(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The MapReduce model is documented in this paper from Google labs.

This page does not describe how to use MapReduce or Hadoop. It focuses on two sentences toward the end of the MapReduce paper. Specifically the statement:

The MapReduce implementation relies on an in-house cluster management system that is responsible for dis- tributing and running user tasks on a large collection of shared machines. Though not the focus of this paper, the cluster management system is similar in spirit to other systems such as Condor [16].

Using the SGE cluster management solution on Cheaha to run Hadoop is the focus of this document.

Requirements

Running Hadoop on Cheaha has two dependencies.

2. you must reserve a mini-cluster in which to run your hadoop cluster as described in the mini-cluster example

Additionally, you may want to set up a VNC session to interact with your hadoop cluster via a web browser, especially when first exploring hadoop.

Background

Before approaching Hadoop to do solve problems, you need become familiar with the Map Reduce model for solving problems. This is a model common to functional programming (Lisp) so an understanding of that domain helps has well.

A simplified description of map reduce is to think in terms of lists of data (eg. arrays). The map functions iterate over the list (eg. recurse) and identify elements of interest. The reduce step works on the identified elements and further processes them. For example, it could add the terms selected by the map operation and produce a final sum, if you are work on lists of numbers. Or, more simply, the reduce operation could just print the terms selected by the map operation. (Note: DBA's may recognize this as the pattern "SELECT function(x) WHERE ... GROUP BY x")

More MapReduce examples are described in this paper from Google labs.

There are much better ways to learn the MapReduce model than the Hadoop framework. Working with Hadoop means you will write code in Java and have to deal with framework designed to handle a wide range of problems at scale. This means there's some complexity to understanding the framework before you can use it effectively.

A simple example provided by Dr. Jonas Almeida is to start with a simple list of numbers [1,2,3,4,5] and create a map operation, say "times 2", that selects elements and maps the (multiplies them by 2) into a new result list. The reduce operation then adds the mapped numbers together producing the result 30. You can try out this simple example in any browser, or if your JavaScript is rusty, you can go to the CoffeScript->Try CoffeScript tab and enter the following example

y=[1,2,3,4,5]
z=x.map((x)->2*x).reduce((a,b)->a+b)


It will translate this into JavaScript for you and when you press the "Run" button produce an answer. If you liked that, try same the thing straight in javascript in the console of your browser:

[1,2,3,4,5].map(function(x){return 2*x}).reduce(function(a,b){return a+b})


The MapReduce pattern makes this basic construct much more generic and powerful by having the map function emitt a key with each value such that each unique key can be associated with its own reduce function. Complex workflows can be distributed using this pattern, as elegantly illustrated in mapreduce-js.googlecode.com

This should give you a sense of the logic behind MapReduce. A large class of problems can be broken down into Map and Reduce operations which frameworks like Hadoop can then easily distribute to large pools of computers, hence the popularity of Hadoop. The Map-Reduce model makes it much easier to describe a solution that can then be easily distributed.

The first step to getting hadoop running on the cluster for testing is to set up a job for single node mode.

#!/bin/bash
#
#$-S /bin/bash #$ -cwd
#
#$-N hadoop-single #$ -pe smp 1
#$-l h_rt=00:10:00,s_rt=0:08:00,vf=1G #$ -j y
#
#$-M jpr@uab.edu #$ -m eas
#
# Load the appropriate module files
#$-V #env | sort #cat$TMPDIR/machines
#
# reset environment
#
rm -rf input/ output/
cd ..

#
# sets up the hadoop nodes based on the assigned host slots
# - one node is selected as the the master for namenode and jobtracker
# - the rest of the nodes serve as slave nodes for task and data nodes
#

mkdir input
cp conf/*.xml input
cat output/*

exit

# Format a new distributed-filesystem:

bin/start-all.sh

# The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to${HADOOP_HOME}/logs).
#
# Browse the web interface for the NameNode and the JobTracker; by default they are available at:
#
#    NameNode - http://localhost:50070/
#    JobTracker - http://localhost:50030/
#

# Copy the input files into the distributed filesystem:

# Run some of the examples provided:

# Examine the output files:

# Copy the output files from the distributed filesystem to the local filesytem and examine them:
cat output/*

# or

# View the output files on the distributed filesystem: