Sunday, June 17, 2007

Using Lucene with JRuby

I use the Ruby Ferret indexing and search library a lot. Ferret is a port (some Ruby, mostly C) of Lucene. I have recently been getting into using JRuby. A few days ago, I discovered that it was reasonable easy to run a simple Rails web application using the Java application server JBoss using JRuby (this took me an hour - next time will be easy). Today, I spent a short while getting Lucene and JRuby working together:
require "java"
require "lib/lucene-core-2.1.0.jar"

class Lucene
@index_path = nil
def initialize(an_index_path = "data/")
@index_path = an_index_path
end
def add_documents id_text_pair_array # e.g., [[1,"test1"],[2,'test2']]
index_available = org.apache.lucene.index.IndexReader.index_exists(@index_path)
index_writer = org.apache.lucene.index.IndexWriter.new(
@index_path,
org.apache.lucene.analysis.standard.StandardAnalyzer.new,
!index_available)
id_text_pair_array.each {|id_text_pair|
term_to_delete = org.apache.lucene.index.Term.new("id", id_text_pair[0].to_s) # if it exists
a_document = org.apache.lucene.document.Document.new
a_document.add(org.apache.lucene.document.Field.new('text', id_text_pair[1],
org.apache.lucene.document.Field::Store::YES,
org.apache.lucene.document.Field::Index::TOKENIZED))
a_document.add(org.apache.lucene.document.Field.new('id', id_text_pair[0].to_s,
org.apache.lucene.document.Field::Store::YES,
org.apache.lucene.document.Field::Index::TOKENIZED))
index_writer.updateDocument(term_to_delete, a_document) # delete any old docs with same id
}
index_writer.close
end
def search(query)
parse_query = org.apache.lucene.queryParser.QueryParser.new(
'text',
org.apache.lucene.analysis.standard.StandardAnalyzer.new)
query = parse_query.parse(query)
engine = org.apache.lucene.search.IndexSearcher.new(@index_path)
hits = engine.search(query).iterator
results = []
while (hits.hasNext && hit = hits.next)
id = hit.getDocument.getField("id").stringValue.to_i
text = hit.getDocument.getField("text").stringValue
results << [hit.getScore, id, text]
end
engine.close
results
end
def delete_documents id_array # e.g., [1,5,88]
index_available = org.apache.lucene.index.IndexReader.index_exists(@index_path)
index_writer = org.apache.lucene.index.IndexWriter.new(
@index_path,
org.apache.lucene.analysis.standard.StandardAnalyzer.new,
!index_available)
id_array.each {|id|
index_writer.deleteDocuments(org.apache.lucene.index.Term.new("id", id.to_s))
}
index_writer.close
end
end
This code assumes that the Java Lucence JAR file lucene-core-2.1.0.jar is in the subdirectory lib. A short test program is:
require "lucene"
require 'pp'

ls = Lucene.new
ls.add_documents([[1,"test one two"],[2,'testing 1 2 3'], [3,'this is a longer test string']])
ls.delete_documents([1]) # optional: test document delete from index
pp ls.search("test")
I had some hesitations about JRuby: I was concerned that using JRuby would lack the light weight feel of hacking in native Ruby. No worries though: JRuby is easy and quick to work with.

2 comments:

Charles Oliver Nutter said...

Very nice...I think there's potential here. Perhaps there's a way to make something that joins ferret and lucene syntaxes, but uses Lucene where appropriate under the covers?

Mark Watson, author and consultant said...

Hello Charles,

I thought of that, letting people switch between:

require 'ferret'

or:

require 'lucene'

My example here was just a quick hack. BTW, I was pleased at how easy it was to run a Rails application on JBoss - that was cool!