(Difference between revisions)

The MapReduce model is documented in this paper from Google labs.

## Requirements

Running Hadoop on Cheaha has two dependencies.

2. you must reserve a mini-cluster in which to run your hadoop cluster as described in the mini-cluster example

Additionally, you may want to set up a VNC session to interact with your hadoop cluster via a web browser, especially when first exploring hadoop.

## Background

Before approaching Hadoop to do solve problems, you need become familiar with the Map Reduce model for solving problems. This is a model common to functional programming (Lisp) so an understanding of that domain helps has well.

A simplified description of map reduce is to think in terms of lists of data (eg. arrays). The map functions iterate over the list (eg. recurse) and identify elements of interest. The reduce step works on the identified elements and further processes them. For example, it could add the terms selected by the map operation and produce a final sum, if you are work on lists of numbers. Or, more simply, the reduce operation could just print the terms selected by the map operation. (Note: DBA's may recognize this as the pattern "SELECT function(x) WHERE ... GROUP BY x")

There are much better ways to learn the MapReduce model than the Hadoop framework. Working with Hadoop means you will write code in Java and have to deal with framework designed to handle a wide range of problems at scale. This means there's some complexity to understanding the framework before you can use it effectively.

A simple example provided by Dr. Jonas Almeida is to start with a simple list of numbers [1,2,3,4,5] and create a map operation, say "times 2", that selects elements and maps the (multiplies them by 2) into a new result list. The reduce operation then adds the mapped numbers together producing the result 30. You can try out this simple example in any browser, or if your JavaScript is rusty, you can go to the CoffeScript->Try CoffeScript tab and enter the following example

x=[1,2,3,4,5]
z=x.map((x)->2*x).reduce((a,b)->a+b)
disp(z)


It will translate this into JavaScript for you and when you press the "Run" button produce an answer.

This should give you a sense of the logic behind MapReduce. A large class of problems can be broken down into Map and Reduce operations which frameworks like Hadoop can then easily distribute to large pools of computers, hence the popularity of Hadoop. The Map-Reduce model makes it much easier to describe a solution that can then be easily distributed.