NoSQL Next Up: Hadoop and Cloudera
Jack Dennis is a true computer pioneer, and was already famous at MIT by the time I got there. He was famous for Multics (which we endured) and his stewardship of the MIT Model Railroad Club (which we smiled curiously at). He was not famous for the eponymous "Dennis Machine" -- a terrific model for parallel processing -- because nobody ever thought we'd ever (with computer time so expensive that a 5:00AM session was "good time") be able to run anything on bunches of computers.
We can run on bunches of computers today, and the means of doing so is so routine that I'm not even going to devote quite a full blog-posting to it. For the details and a great way to get started with NoSQL processing through Hadoop you should take a scan at Phil Whelan's terrific blog entry: Map-Reduce With Ruby Using Hadoop. Phil's article is a great way to get started -- where an investment of an hour or two will get you familiar with MapReduce and the Google-y way to solve problems with piles of computers.
MapReduce itself takes some getting used-to. The basic idea is to take a single function, "map" it out to process lots of data in parallel in separate servers, and then "reduce" the results so that a summary of the map-function gets returned to the user. This is the approach Google has used for search from the beginning: Google can take my query request: "Accenture new CEO" and Map it over hundreds (or thousands) of servers, each of which perform the search over their own little corner of the Internet, and then Reduce it by doing a Pagerank summation of the pages returned from each mapped search. Google sorts the results, and the front page of my search results shows me the best ones.
Joel Spolsky did a nice writeup of the thinking behind MapReduce in his posting: Can Your Programming Language Do This? back in 2006. In the example in the Whelan article, we'll use a Cloudera Script called "whirr" to fire up a cluster of AWS servers with Hadoop, and we'll use that cluster to run a MapReduce job to:
...write a map-reduce job that scans all the files in a given directory, takes the words found in those files and then counts the number of times words begin with any two characters.
That's simple enough, and just the kind of innately-parallelizable task that Hadoop is perfect for. Whelan's article as another nice tidbit in it -- the use of a dynamic language to define the "map" and "reduce" tasks. The idea here is simple -- let's see how much code it takes to map the task out, and reduce the results of our word-count:
Source: Map-Reduce With Ruby Using Hadoop
So, Map is simple: for each line, just remove newline and ignore shorties, then snip the two-character key, give it a value of "1", and write it to stdout. It's a simple idea, and the nice thing about dynamic languages is that they can make the code to do so simple a task look simple as well. Let's take a look at the reduce function now:
Source: Map-Reduce With Ruby Using Hadoop​
Simple as well: for each line reset the total every time we get a new key, otherwise sum up the values for that key. Again, the code here very transparently accomplishes the reduction, and leads us to the result (from my terminal, with the good parts highlighted):
wireless:whirr-0.1.0+23 johnrepko$ hadoop fs -cat output/part-00000 | head
11/01/05 16:11:06 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
aa 13
ab 666
ac 1491
ad 867
ae 337
af 380
ag 507
ah 46
ai 169
aj 14
...and there are our results: only 13 aardvarks and aardwolves, but plenty (1491) of words that start with "ac". Nice, clean result, and absolutely worth an hour or two to start to sense the power of Hadoop, the beautiful cleanness of the Cloudera implementation, and the power of the problems you can solve with such a massively parallel approach. To wrap up, make sure you clean up your Hadoop sessions in AWS with:
$ ec2-describe-instances and
$ ec2-terminate-instances
otherwise you can easily pile up AWS time.
It wasn't that long ago that the "Dennis Machine" was just a theoretical construct, and parallel-processing was nice for graduate work, but wouldn't ever solve real problems. Google brought "massively parallel" and MapReduce to the masses, and there are lots of business problems that we can now solve easily once we're comfortable with the tools.
Jack Dennis laid the tracks ... Phil Whelan's terrific blog entry: Map-Reduce With Ruby Using Hadoop shows you how... let's get this train rolling!