At CompassLabs my colleague Vivek and I are using Hadoop and Amazon’s Elastic MapReduce to process social network data. I can’t talk about what we are doing except to say that it is cool.
I blogged last week about taking the time to create a one-page diagram showing all map-reduce steps and data flow (with examples showing data snippets): this really helps manage complexity. I have a few other techniques that I have found useful enough to share:
Take the time to setup a good development environment. Almost all of my map-reduce applications are written in either Ruby or Java (with a few experiments in Clojure and Python). I like to create Makefiles to quickly run multiple map-reduce jobs in a workflow on my laptop. For small development data sets, after editing source code, I can run a work flow and be looking at output in about 10 seconds for Ruby, a little longer for Java apps. Complex work flows are difficult to write and debug so get comfortable with your development environment. My Makefiles build local JAR files (if I am using Java), copy map-reduce code and test data to my local Hadoop installation, remove the output directories, run the jobs in sequence, and optionally open the outputs for each job step in a text editor.
Take advantage of Amazon’s Elastic MapReduce. I just have limited experience setting up and using custom multi-server clusters because for my own needs and so far for work for two customers Elastic MapReduce has provided good value and saved a lot of setup time and administration time. I think that you really need to get to certain large scale of operations before it makes sense to maintain your own large Hadoop cluster.