Monday, February 25, 2013

Building custom data stores

Creating a custom datastore may seem like a bad idea when such great tools like Postgres, MongoDB, CouchDB, etc. are available in their open source goodness as well as good commercial products such as Datomic, AllegroGraph, Stardog, etc. Still, frustration of not having just what I needed for a project (more on requirements later) convinced me to spend some time building my own datastore based on some available open source libraries.

Much of the motivation for my work developing kbsportal.com is to make possible the development of a larger turnkey information appliance. I have been using MongoDB for this, but even with an application specific wrapper MongoDB has been a little awkward for my requirements, which are:

  • I want a reasonably efficient document store that supports the usual CRUD operations on arbitrary Clojure maps (which can be nested to any depth). Clojure maps are basically what I use to contain and use data so I wanted a datastore that supports this, simply.
  • I want all text in documents (embedded at any depth in the document) to be searchable.
  • I need to be able to annotate data stored documents and sometimes relationships between documents.
  • My preferred notation for annotating data is RDF
  • I need to be able to efficiently perform SPARQL queries on the RDF annotations.
  • Coupling between documents and RDF: auto delete of any triples referencing a document ID, if the referenced document is deleted.

Initially I was going to write a wrapper library using two datastores as SaaS products: Cloudant (for CouchDB with Lucene indexing) and Dydra.com (for a RDF datastore, with extras). A small wrapper API would have made this all work but since a lot of what I am doing is in the experimenting phase I decided that I didn't want to use remote web services for coding experiments. Using these services, with a wrapper would be nice for production, but not for hacking.

Anyway, I have built a small project that uses HSQLDB (relational database) and Sesame (RDF :

EDIT: Patrick Logan asked about my use of HSQLDB; not specific to HSQLDB really, but here is the important code (hand edited to try to get it to fit on this web page) for adding documents that are nested maps, indexing them, and searching (note: I usually use Clucy/Lucene for search in Clojure code, but for what I am doing right now, this suffices):

(defn index-if-str [x id]
  (if (= (class x) java.lang.String)
    (sql/with-connection hsql-db
      (doseq [token (map (fn [s] (.toLowerCase s))
                     (clojure.string/split x #"[ ;.,]()"))]
        (if token
          (sql/insert-record "search" {:doc_id id :word token}))))))

(defn insert-doc [map]
  (let [id
        (:id (sql/with-connection hsql-db
               (sql/insert-record
                 "docs" {:json (json/write-str map)})))]
    (postwalk (fn [x] (index-if-str x id)) map)
    id))

;; (insert-doc {:foo "bar" :i 101 :name "sue jones"})

(defn search [s]
  (map
    first
    (let [indices
          (map
            :doc_id
            (let [tokens
                  (apply str (interpose ", "
                     (map (fn [s] (str "'" (.toLowerCase s) "'"))
                       (clojure.string/split s #"[ ;.,]()"))))]
              (sql/with-connection hsql-db
                 (sql/with-query-results results
                    [(str "select * from search where word in (" tokens ")")]
                    (into [] results)))))]
      (sort (fn [a b] (compare (second b) (second a))) (into [] (frequencies indices))))))

2 comments:

patrickdlogan said...

1. If you could write more about why you chose hsqldb and how you're using it. (I've only used it for very simple SQL tables quite some time ago.)

2. Did you consider jena and favor sesame?

Mark Watson, author and consultant said...

Hello Patrick,

HSQLDB is simple to use, and all I needed was a storage engine for 2 tables: blob storage for documents and for an inverted word index. Probably crufty code, but just added the bit of code to my blog article for creating the index table rows when inserting a document and doing search.

I have used Jena very little, but I have tons of experience over the years using Sesame.