Sunday, September 05, 2010

I've improved my Hadoop map reduce development process

I had to design a fairly complicated work flow in the last several days, and I hit upon a development approach that worked really well for me to get things written and debugged on my laptop:

I started by hand-crafting small input data sets for all input sources. I then created a quick and dirty diagram using OmniGraffle (any other diagramming tool would do) showing how I thought my multiple map reduce jobs would play together. I marked up the diagram with job names and input/output directories for each job that included sample data. Each time new output appeared, I added sample output to the diagram. I had a complicated work flow so it was tricky to keep everything on one page for reference, but the advantage of having this overview diagram is that it made it much easier to keep track of what each map reduce job in the workflow needed to do and made it easier to hand-check each job.

As I refactored my workflow by adding or deleting jobs and changing code, I took a few minutes to keep the diagram up to date - well worth it. Another technique that I find convenient is to rely on good old-fashioned make files both to run multiple jobs together on my laptop with a local Hadoop setup, and also to organize the Elastic MapReduce command lines to run on AWS.

I have been experimenting with higher level tools like Cascading and Cascalog that help manage work flows, but I decided to just write my own data source joins, etc. and organize everything as a set of individual map reduce jobs that are run in a specific order.

No comments: