Sunday, September 26, 2010

Big productivity gain: not having an Internet connection in the middle of the Pacific Ocean

Carol and I are on a long cruise, and because of the high cost of Internet connectivity, I am only getting on the web for about 5 minutes every other day.

I have been spending about 2 hours each day working on the Lisp edition of my Semantic Web book, and I must say that my productivity seems a lot better when I am not distracted with an Internet connection.

So far, we have been very good about not over eating on this trip - enjoying the food but eating small portions. Except for some complementary Champaign the first night we have avoided alcohol, making it easier to not over-eat!

We will be onboard for 25 days so we don't feel pressured to engage in all activities that might be fun. So far, we have been enjoying a series of onboard lectures, the movie theater, and lots of walking on deck.

Tuesday, September 21, 2010

I am going to be on travel for 4 weeks: temporarily turning off blog comments

Carol and I are leaving on a long trip. Unfortunately, I get SPAM comments on my blog which are easy enough to remove, but I will be off of the Internet for long periods of time. I'll turn comments back on when I get home.

I have my laptop setup to work on the Common Lisp edition of my Semantic Web book so that will probably be available in final form in about 6 weeks.

Wednesday, September 15, 2010

Rich client web apps: playing with SproutCore, jQuery, and HTML5

In the last 14 years I have worked on two very different types of tasks: AI and textmining, and on (mostly server side) web applications. Putting aside the AI stuff (not the topic for today), I know that I need to make a transition to developing rich clients applications. This is not such an easy transition for me because I feel much more comfortable with server side development using Java, Ruby on Rails, Sinatra, Merb, etc. On the client side, I just use simple Javascript for AJAX support, HTML and CSS.

As background learning activities I have been working through Bear Bibeault's and Yehuda Katz's jQuery in Action and Mark Pilgrim's HTML5 books. Good learning material.

When I read that Yehuda Katz is leaving Engine Yard to work on the SproutCore framework I took another good look at SproutCore last night, worked through parts of the tutorial with Ruby + Sinatra, and Clojure + Compojure server backends. I find Javascript development to be awkward, but OK. I need to spend some time getting setup using IntelliJ on both jQuery and SproutCore learning projects. If anyone has any development environment suggestions, I am listening.

Tuesday, September 14, 2010

MongoDB "good enough practices"

I have been using MongoDB for about a year for customer jobs and my own work and I have a few practices that are worth sharing:

I use two levels of backup and vary the details according to how important or replaceable the data is: I like to perform rolling backups to S3 periodically. This is easy enough to do using cron, putting something like this in crontab:
5 16 * * 2 (cd /mnt/temp; rm -f -r *.dump*; /usr/local/mongodb/bin/mongodump -o myproject_tuesday.dump > /mnt/temp/mongodump.log; /usr/bin/zip -9 -r myproject_tuesday.dump.zip myproject_tuesday.dump > /mnt/temp/zip.log; /usr/bin/s3cmd put myproject_tuesday.dump.zip s3://mymongodbbackups)
The other level of backup is to always run at least one master and one read-only slave. By design, the preferred method for robustness is replicating mongod processes on multiple physical services. Choose master/slave or replica set installations, but don't run just a single mongod.

I often need to do a lot of read operations for analytics or simply serving up processed data. Always read from a read-only slave unless the small consistency hit (it takes a very short amount of time to replicate master writes to slaves) is not tolerable for your application. For applications that need to read and write, just either keep two connections open or use a MongoDB ORM like Mongoid that supports multiple read and write mongods.

Another thing I try to do is to place applications that need to perform high volume reads on the same server that runs a MongoDB slave; this eliminates network bandwidth issues for high volume "mostly read" applications.

Saturday, September 11, 2010

Very interesting technology behind Google's new Instant Search

Anyone using Google search and who is paying attention has noticed the very different end-user experience. Showing search results while typing queries now requires that Google has to to generate at least 5 times the number of results pages, use new Javascript support for fast rendering of instant search results, and, most interesting to me, a new approach to their backend processing:

It has been about 7 years since I read the original papers on Google's Big Table and map reduce, so it is not at all surprising to me that Google re-worked their web indexing and search. The new approach using Caffeine forgoes the old approach of batch map reduce processing and maintains a large database that I think is based on Big Table and now performs continuous incremental updates.

I am sure that Google will release technical papers on Caffeine - I can't wait!

Using Hadoop for analyzing social network data

At CompassLabs my colleague Vivek and I are using Hadoop and Amazon's Elastic MapReduce to process social network data. I can't talk about what we are doing except to say that it is cool.

I blogged last week about taking the time to create a one-page diagram showing all map-reduce steps and data flow (with examples showing data snippets): this really helps manage complexity. I have a few other techniques that I have found useful enough to share:

Take the time to setup a good development environment. Almost all of my map-reduce applications are written in either Ruby or Java (with a few experiments in Clojure and Python). I like to create Makefiles to quickly run multiple map-reduce jobs in a workflow on my laptop. For small development data sets, after editing source code, I can run a work flow and be looking at output in about 10 seconds for Ruby, a little longer for Java apps. Complex work flows are difficult to write and debug so get comfortable with your development environment. My Makefiles build local JAR files (if I am using Java), copy map-reduce code and test data to my local Hadoop installation, remove the output directories, run the jobs in sequence, and optionally open the outputs for each job step in a text editor.

Take advantage of Amazon's Elastic MapReduce. I just have limited experience setting up and using custom multi-server clusters because for my own needs and so far for work for two customers Elastic MapReduce has provided good value and saved a lot of setup time and administration time. I think that you really need to get to certain large scale of operations before it makes sense to maintain your own large Hadoop cluster.

Thursday, September 09, 2010

why doesn't iTunes support Ogg sound files 'out of the box'?

You know why: Apple does not mind inconveniencing users in order to keep their little walled garden the way they want it. I have been a long time Apple supporter (I wrote the chess game they gave away with the early Apple IIs, and wrote a commercial Mac app in 1984) but sometimes they do aggravate me.

Two new books today

I just got my delivery from Amazon: "Linear Algebra" (George Shilov) and "Metaprogramming Ruby" (Paolo Perrotta).

I have a degree in Physics but I find my linear algebra to be a little rusty so I bought Shilov's book to brush up. I bought Perrotta's book because while reading over some of the Rails 3 codebase, too often I find bits of code that I don't quite understand, at least without some effort.

Sunday, September 05, 2010

I've improved my Hadoop map reduce development process

I had to design a fairly complicated work flow in the last several days, and I hit upon a development approach that worked really well for me to get things written and debugged on my laptop:

I started by hand-crafting small input data sets for all input sources. I then created a quick and dirty diagram using OmniGraffle (any other diagramming tool would do) showing how I thought my multiple map reduce jobs would play together. I marked up the diagram with job names and input/output directories for each job that included sample data. Each time new output appeared, I added sample output to the diagram. I had a complicated work flow so it was tricky to keep everything on one page for reference, but the advantage of having this overview diagram is that it made it much easier to keep track of what each map reduce job in the workflow needed to do and made it easier to hand-check each job.

As I refactored my workflow by adding or deleting jobs and changing code, I took a few minutes to keep the diagram up to date - well worth it. Another technique that I find convenient is to rely on good old-fashioned make files both to run multiple jobs together on my laptop with a local Hadoop setup, and also to organize the Elastic MapReduce command lines to run on AWS.

I have been experimenting with higher level tools like Cascading and Cascalog that help manage work flows, but I decided to just write my own data source joins, etc. and organize everything as a set of individual map reduce jobs that are run in a specific order.

Friday, September 03, 2010

Efficient: just signed up to write an article on Rails 3 after spending weeks spinning up on Rails 3

I was just asked to write an article on my first impressions of Rails 3. This is very convenient because I have been burning a lot of off-work cycles spinning up on Rails 3 (I have done no work using Rails in 5 months because I have been 100% booked doing text/data mining). Architecturally and implementation-wise, Rails 3 rocks: I will have fun writing about it.

Very cool: a tutorial on using the MongoDB sniff tool

No original material here, I just wanted to link some else's cool article on using mongosniff to watch all network traffic going into and out of a mongod process. The output format is easy to read and useful.

Very good news that Google will be providing a "Wave in a Box" open source package

Early this year I played around with the open source code on the Wave protocol site, but "play" is the active word here: I did nothing practical with it.

Although I never used Wave's web UI very much, I did find writing Wave robots interesting and potentially very useful. I invested a fair amount of time in learning the technology. I was disappointed when Google recently announced their phasing out support of Wave but today's announcement that they are completing the open source project to the point of its being a complete system is very good news.

Wednesday, September 01, 2010

I finished reviewing a book proposal tonight for an AI text book

Based on the number of books I have written, it is obvious that I love writing. I also enjoy reviewing book proposals and serving as a tech editor, as long as I am fascinated by the subject matter! The proposal that I just reviewed for Elsevier was very interesting.

I believe that the world (some parts faster than others) is transitioning to a post industrial age where the effective use of information might start to approach the importance of raw labor, physical resources, and capital (and who knows how the world's money systems will transition).

When I was reading this book proposal and also in general books and material on the web, one litmus test I have for "being interesting" is how forward thinking technical material is, that is, how well will it help people both cope and take advantage of new world economic systems.

GMail Priority InboxBox

Finally, I got an invitation and I am trying it. One problem that I have is feeling that I have to read email as it arrives so I find myself not running an email client if I am really concentrating on work or writing. With the new display, I will only see emails at the top of GMail's form if they are deemed important because they are from people I always respond to, etc. I is also convenient being able to switch back and forth between the old style inbox and priority inbox.

Command line tips for OS X and Linux

I wrote last year about keeping .ssh, .gpg, and other sensitive information on an encrypted disk and create soft links so when the disk is mounted, sensitive information is available.

I have a few command line tricks that save me a lot of time that are worth sharing:
  • Use a pattern like history | grep rsync to quickly find recent commands. Much better than wading through your history.
  • Make aliases for accessing services on specific servers for example alias kb2_mongo='mongo xxx.xxx.xxx.xxx:11222'. By having consistent naming aliases for your servers and for running specific services like the mongo console, it is easy to both remember your aliases and use them.
  • Create aliases with consistent naming conventions to ssh to all of your servers. I use different prefixes for my servers and for each of my customers.
  • Create an alias like alias lh='ls -lth | head' to quickly see just the most recently modified files in a directory, most recent first.
  • For your working development system create two letter aliases to get to common working directories (most recent projects, writing, top level code experiment directory, etc.). I try to be consistent and use some of the same aliases on my servers.

Consistent APIs for collections

I have been using Clojure a lot for work this year and the consistent API for anything that is a seq (lists, vectors, maps, trees, etc.) is probably my favorite language feature. Scala 2.8 collections offer the same uniform API. For me Clojure and Scala, with a fairly small number of operations to remember across most collections therefore represent a new paradigm for programming compared to some older languages Like Java, Scheme, and Common Lisp that force you to remember too many different operation names. The Ruby Enumerable Module is also provides a nice consistent API over collections. Most Ruby collection classes mixin Enumerable, but the API consistency is not as good as Scala and Clojure. That said, even though Enumerable only requires a small number of methods to be implemented like each, map, find, etc., the ability to combine these methods with blocks is very flexible.