Sunday, November 21, 2010

Distributed NoSQL datastores: Cassandra and Cloudant's BigCouch

In my work for customers in recent years almost everything that I do uses PostgreSQL (sometimes PostGIS) and/or MongoDB. (I write a lot about the Semantic Web but so far no one has paid me to work on a project with an RDF data store like Sesame or AllegroGraph.)

While I think PostgreSQL and MongoDB are great, their replication stories have not been great. MongoDB's master/slave and replica pairs work OK, and replica sets (MongoDB 1.6 and above) look to be a big improvement (it only takes a few minutes to try the MongoDB 1.6.x replica set tutorial example; follow the instructions.) I have not tried replica sets yet in a production environment but I am looking forward to it! I find MongoDB to be extremely developer friendly with convenient client libraires in Clojure and Ruby (I don't like dealing with JSON data and hashes in Java).

PostgreSQL 9 replication is easier to set up and administer than Slony but I have not had to use it in production. The replication supports master/slave hot stand-by but it is not a distributed data store with no master process.

Cassandra was designed to be distributed with no specific master server. I am almost done reading "Cassandra, The Definitive Guide." I have been enjoying experimenting with Cassandra a lot recently on both my laptop and transient EC2 instances. One negative about Cassandra for me is that I find the Java and Clojure client libraries to be inconvenient to use compared with their MongoDB counterparts. The Ruby client library is very developer friendly!

CouchDB was the first NoSQL datastore that I used (except for RDF data stores) but I have never found it to be as convenient for my work as MongoDB. That may change thanks to Cloudant's BigCouch open-source version of CouchDB that has built in clustering capability. It is extremely easy to set up a test system following the directions on BigCouch github. On both my MacBook and also using EC2s, it only took about 15 minutes to set up a cluster. If I had to set up a fault tolerant distributed data store on small (or even micro) EC2 instances cluster, BigCouch would be a strong candidate because of the relatively low RSIZE memory footprint compared to Cassandra (for empty systems, about 15MB for CouchDB and 150MB for Cassandra - but expect these memory requirements to increase rapidly with large data stores). BTW, check out the people working at Cloudant: interesting that so many physicists work there. Cloudant bases their CouchDB hosting business on BigCouch.

2 comments:

Jonathan Ellis said...

Hi Mark! Thanks for your comments.

The Cassandra Thrift API isn't meant to be used directly. Use a client like Hector (https://github.com/rantav/hector) for java. I believe some clojure clients exist but I don't know how actively maintained they are.

An empty Cassandra system takes about 8-10MB, btw, but it's not designed to aggressively return to that state since in production use it's a non-issue.

Mark Watson, author and consultant said...

Hello Jonathan,

I tried Hector, but I don't like dealing with maps/hashes, JSON in Java. The Clojure libraries are OK, but one of them calls 'column family' 'table', etc. I would bet that a very good Clojure library like the Ruby client library will be available soon.

BTW, good luck with Riptano - nice business idea!

-Mark