Friday, December 30, 2005

A novel search engine application

During a couple hour conversation today with my publisher, he posed the problem of determining what word sequences can logically go together in natural language. Examples would be:
  • the cow is brown - OK
  • the cow is green - probably does not make sense
  • the recess was one hour - OK
  • the recess is brown - makes no sense if recess is used as a school break, could make sense if the recess is the (Wordnet #2) sense "an enclosure that is set back or indented"
I would normally think of this as a sort-of hidden Markov model problem: look for frequencies of words appearing together, allowing a few wild card intermediate words.

However, a better solution came to me in about 15 seconds: use a search engine like Nutch to index a reasonably large part of the web. To test word sequences like "the cow is brown" and "the cow is green", we would need to look at the number of times the words [cow, brown] appeared in that sequence close to together; same for [cow, green]. This approach would even give reasonable answers for sequences like [recess, brown] that might make sense for some Wordnet senses of the words, but would likely not occur in actual use.

No comments: