Hadoop

Attention: Research Computing Documentation has Moved
https://docs.rc.uab.edu/

Please use the new documentation url https://docs.rc.uab.edu/ for all Research Computing documentation needs.

As a result of this move, we have deprecated use of this wiki for documentation. We are providing read-only access to the content to facilitate migration of bookmarks and to serve as an historical record. All content updates should be made at the new documentation site. The original wiki will not receive further updates.

Thank you,

The Research Computing Team

Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. "Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework." Hadoop Overview</ref> It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Googles's MapReduce and Google File System (GFS) papers.

The MapReduce model is documented in this paper from Google labs.

Requirements

Running Hadoop on Cheaha has two dependencies.

you must download and unpack hadoop into your home directory
you must reserve a mini-cluster in which to run your hadoop cluster as described in the mini-cluster example

Additionally, you may want to set up a VNC session to interact with your hadoop cluster via a web browser, especially when first exploring hadoop.

Background

Before approaching Hadoop to do solve problems, you need become familiar with the Map Reduce model for solving problems. This is a model common to functional programming (Lisp) so an understanding of that domain helps has well.

A simplified description of map reduce is to think in terms of lists of data (eg. arrays). The map functions iterate over the list (eg. recurse) and identify elements of interest. The reduce step works on the identified elements and further processes them. For example, it could add the terms selected by the map operation and produce a final sum, if you are work on lists of numbers. Or, more simply, the reduce operation could just print the terms selected by the map operation. (Note: DBA's may recognize this as the pattern "SELECT function(x) WHERE ... GROUP BY x")

There are much better ways to learn the MapReduce model than the Hadoop framework. Working with Hadoop means you will write code in Java and have to deal with framework designed to handle a wide range of problems at scale. This means there's some complexity to understanding the framework before you can use it effectively.

A simple example provided by Dr. Jonas Almeida is to start with a simple list of numbers [1,2,3,4,5] and create a map operation, say "times 2", that selects elements and maps the (multiplies them by 2) into a new result list. The reduce operation then adds the mapped numbers together producing the result 30. You can try out this simple example in any browser, or if your JavaScript is rusty, you can go to the CoffeScript->Try CoffeScript tab and enter the following example

x=[1,2,3,4,5]
disp=(x)->alert(x)
z=x.map((x)->2*x).reduce((a,b)->a+b)
disp(z)

It will translate this into JavaScript for you and when you press the "Run" button produce an answer.

This should give you a sense of the logic behind MapReduce. A large class of problems can be broken down into Map and Reduce operations which frameworks like Hadoop can then easily distribute to large pools of computers, hence the popularity of Hadoop. The Map-Reduce model makes it much easier to describe a solution that can then be easily distributed.

Hadoop

Requirements

Background

Navigation menu

Search