Saturday, April 30, 2011

Text search in SimpleDB: a Ruby example

You might want to use SimpleDB for storage and to support text indexing and search if you did not want to manually run and administer Solr yourself. Here is a little snippet that shows how to store searchable documents in SimpleDB:
require 'rubygems'
require 'aws_sdb'

SERVICE = AwsSdb::Service.new

# assuming that this domain is already created
DOMAIN = "some_test_domain_7854854"

class Document

  def initialize name, text
    words = (name + ' ' + text).downcase.split.uniq
    attributes = {:words => words, :text => text}
    SERVICE.put_attributes(DOMAIN, name, attributes)
  end
  
  def Document.search query
    # The last inject takes the intersection and
    # insures that all search terms are present:
    keys = query.downcase.split.collect {|x|
      SERVICE.query(DOMAIN,
                    "['words' starts-with '#{x}']")[0]
    }.inject {|x, y| x & y }
    keys.collect {|key|
                  SERVICE.get_attributes(DOMAIN, key)}
  end

end

Document.new('title1',
             'The bird flew to the lake for water')
Document.new('title2',
             'The dog chased the cat')

p Document.search 'flew lake'
The formatting of this code snippet is odd because I was trying to get short lines to fit the page width. This code snippet is not terribly efficient but since the first 25 Amazon SimpleDB Machine Hours consumed per month are free for your Amazon AWS account using this code example in your applications can end up being almost free (there are small data storage and bandwidth charges) and you get the advantage of no administration hassles. The output for the above code snippet is:
[{"text"=>["The bird flew to the lake for some water"],
  "words"=>["bird", "flew", "for", "lake", "the",
            "title1", "to", "water"]}]
There are two improvements that you can implement: remove noise/stop words from the words attribute and make the code multithreaded to execute the individual SimpleDB queries in parallel when possible to do so. I was trying to make this example code snippet concise. For simple and/or moderately used applications these improvements aren't necessary.

If you run this example remotely from your laptop, notice that remote SimpleDB access is a little slow. When run on a small EC2 instance, it takes about 0.05 seconds to add a "document" to SimpleDB and about 0.1 seconds to search using two search terms.

No comments: