Sunday, July 04, 2010

Reading two good books on using MapReduce algorithms for large scale text processing

I have a fair amount of experience with Hadoop, but little experience with associated tools like Pig and Mahout. I can spend more time with Pig in my local sandbox but I wanted more formal help getting up to speed with Mahout and general MapReduce application programming. I purchased the MEAP for Mahout In Action, reading new chapters as they are available. The authors (especially Robin Anil) have been very helpful on the online forum for the book, and I have found the material to be useful and interesting.

Another book I bought was just delivered yesterday morning: Data-Intensive Text Processing with MapReduce. I have only read the first few chapters but the book has been very interesting and informative.

I have done some work based on Hadoop for about half the customers I have had in the last year and a half, and I believe that knowing how to horizontally scale out machine learning and text analytics applications has become a must-have skill.


Alex Ott said...

last beta of Data-Intensive Text Processing with MapReduce is also available online at

Mark Watson, author and consultant said...

Thanks Alex! Good link.

Alex Ott said...

Mark, have you heard about Cascalog - ? It built with Clojure on top of Cascading and allows easily write queries against data in Hadoop

P.S. I'm currently experimenting with same technologies - mahout, etc. ;-)

Mark Watson, author and consultant said...

I have looked at Cascalog, but not actually tried it.