Hadoop

From Cheaha
Jump to navigation Jump to search


Attention: Research Computing Documentation has Moved
https://docs.rc.uab.edu/


Please use the new documentation url https://docs.rc.uab.edu/ for all Research Computing documentation needs.


As a result of this move, we have deprecated use of this wiki for documentation. We are providing read-only access to the content to facilitate migration of bookmarks and to serve as an historical record. All content updates should be made at the new documentation site. The original wiki will not receive further updates.

Thank you,

The Research Computing Team

Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. "Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework." Hadoop Overview</ref> It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Googles's MapReduce and Google File System (GFS) papers.

The MapReduce model is documented in this paper from Google labs.

This page does not describe how to use MapReduce or Hadoop. It focuses on two sentences toward the end of the MapReduce paper. Specifically the statement:

The MapReduce implementation relies on an in-house cluster management system that is responsible for dis- tributing and running user tasks on a large collection of shared machines. Though not the focus of this paper, the cluster management system is similar in spirit to other systems such as Condor [16].

Using the SGE cluster management solution on Cheaha to run Hadoop is the focus of this document.

Requirements

Running Hadoop on Cheaha has two dependencies.

  1. you must download and unpack hadoop into your home directory
  2. you must reserve a mini-cluster in which to run your hadoop cluster as described in the mini-cluster example

Additionally, you may want to set up a VNC session to interact with your hadoop cluster via a web browser, especially when first exploring hadoop.

Background

Before approaching Hadoop to do solve problems, you need become familiar with the Map Reduce model for solving problems. This is a model common to functional programming (Lisp) so an understanding of that domain helps has well.

A simplified description of map reduce is to think in terms of lists of data (eg. arrays). The map functions iterate over the list (eg. recurse) and identify elements of interest. The reduce step works on the identified elements and further processes them. For example, it could add the terms selected by the map operation and produce a final sum, if you are work on lists of numbers. Or, more simply, the reduce operation could just print the terms selected by the map operation. (Note: DBA's may recognize this as the pattern "SELECT function(x) WHERE ... GROUP BY x")

More MapReduce examples are described in this paper from Google labs.

There are much better ways to learn the MapReduce model than the Hadoop framework. Working with Hadoop means you will write code in Java and have to deal with framework designed to handle a wide range of problems at scale. This means there's some complexity to understanding the framework before you can use it effectively.

A simple example provided by Dr. Jonas Almeida is to start with a simple list of numbers [1,2,3,4,5] and create a map operation, say "times 2", that selects elements and maps the (multiplies them by 2) into a new result list. The reduce operation then adds the mapped numbers together producing the result 30. You can try out this simple example in any browser, or if your JavaScript is rusty, you can go to the CoffeScript->Try CoffeScript tab and enter the following example

y=[1,2,3,4,5]
z=x.map((x)->2*x).reduce((a,b)->a+b)
alert(y)

It will translate this into JavaScript for you and when you press the "Run" button produce an answer. If you liked that, try same the thing straight in javascript in the console of your browser:

[1,2,3,4,5].map(function(x){return 2*x}).reduce(function(a,b){return a+b})

The MapReduce pattern makes this basic construct much more generic and powerful by having the map function emitt a key with each value such that each unique key can be associated with its own reduce function. Complex workflows can be distributed using this pattern, as elegantly illustrated in mapreduce-js.googlecode.com

This should give you a sense of the logic behind MapReduce. A large class of problems can be broken down into Map and Reduce operations which frameworks like Hadoop can then easily distribute to large pools of computers, hence the popularity of Hadoop. The Map-Reduce model makes it much easier to describe a solution that can then be easily distributed.

Hadoop Single Process mode

The first step to getting hadoop running on the cluster for testing is to set up a job for single node mode.

The following hadoop.qsub will get you started after downloading hadoop

#!/bin/bash
#
#$ -S /bin/bash
#$ -cwd
#
#$ -N hadoop-single
# $ -pe smp 1
#$ -l h_rt=00:10:00,s_rt=0:08:00,vf=1G
#$ -j y
#
#$ -M jpr@uab.edu
#$ -m eas
#
# Load the appropriate module files
#$ -V

#env | sort

#cat $TMPDIR/machines
#
# reset environment
#
cd hadoop-current
rm -rf input/ output/
cd ..

# 
# sets up the hadoop nodes based on the assigned host slots
# - one node is selected as the the master for namenode and jobtracker
# - the rest of the nodes serve as slave nodes for task and data nodes
#
echo "configure hadoop:"
echo localhost > hadoop-current/conf/masters
echo localhost > hadoop-current/conf/slaves
cp hadoop-current/conf/core-site.xml-template-single hadoop-current/conf/core-site.xml
cp hadoop-current/conf/hdfs-site.xml-template-single hadoop-current/conf/hdfs-site.xml
cp hadoop-current/conf/mapred-site.xml-template-single hadoop-current/conf/mapred-site.xml

cd hadoop-current

mkdir input
cp conf/*.xml input
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
cat output/*

exit

# Format a new distributed-filesystem:
bin/hadoop namenode -format

# Start the hadoop daemons:
bin/start-all.sh

# The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to ${HADOOP_HOME}/logs).
#
# Browse the web interface for the NameNode and the JobTracker; by default they are available at:
#
#    NameNode - http://localhost:50070/
#    JobTracker - http://localhost:50030/
#

# Copy the input files into the distributed filesystem:
bin/hadoop fs -put conf input

# Run some of the examples provided:
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

# Examine the output files:

# Copy the output files from the distributed filesystem to the local filesytem and examine them:
bin/hadoop fs -get output output
cat output/*

# or

# View the output files on the distributed filesystem:
bin/hadoop fs -cat output/*

sleep 520

# When you're done, stop the daemons with:
bin/stop-all.sh