Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. "Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework." Hadoop Overview</ref> It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Googles's MapReduce and Google File System (GFS) papers.
The MapReduce model is documented in this paper from Google labs.
Running Hadoop on Cheaha has two dependencies.
- you must download and unpack hadoop into your home directory
- you must reserve a mini-cluster in which to run your hadoop cluster as described in the mini-cluster example
Additionally, you may want to set up a VNC session to interact with your hadoop cluster via a web browser, especially when first exploring hadoop.
Before approaching Hadoop to do solve problems, you need become familiar with the Map Reduce model for solving problems. This is a model common to functional programming (Lisp) so an understanding of that domain helps has well.
A simplified description of map reduce is to think in terms of lists of data (eg. arrays). The map functions iterate over the list (eg. recurse) and identify elements of interest. The reduce step works on the identified elements and further processes them. For example, it could add the terms selected by the map operation and produce a final sum, if you are work on lists of numbers. Or, more simply, the reduce operation could just print the terms selected by the map operation. (Note: DBA's may recognize this as the pattern "SELECT function(x) WHERE ... GROUP BY x")
There are much better ways to learn the MapReduce model than the Hadoop framework. Working with Hadoop means you will write code in Java and have to deal with framework designed to handle a wide range of problems at scale. This means there's some complexity to understanding the framework before you can use it effectively.
x=[1,2,3,4,5] disp=(x)->alert(x) z=x.map((x)->2*x).reduce((a,b)->a+b) disp(z)
This should give you a sense of the logic behind MapReduce. A large class of problems can be broken down into Map and Reduce operations which frameworks like Hadoop can then easily distribute to large pools of computers, hence the popularity of Hadoop. The Map-Reduce model makes it much easier to describe a solution that can then be easily distributed.