Saturday, March 09, 2013

Google Research's wiki-links data set

wiki-links was created using Google's web crawl and looking for back links to Wikipedia articles. The complete data set less than 2 gigabytes in size, so this playing with the data is "laptop friendly."

The data looks like:

MENTION vacuum tubes 10838 http://en.wikipedia.org/wiki/Vacuum_tube
MENTION electron gun 598  http://en.wikipedia.org/wiki/Electron_gun
MENTION oscilloscope 1307 http://en.wikipedia.org/wiki/Oscilloscope
MENTION radar        1657 http://en.wikipedia.org/wiki/Radar
One possible use for this data might be to compare two (possibly multiple word) terms by looking up their Wikipedia pages, remove the stop (noise words) from both pages, and calculate a similarity based on "bag of words", etc. Looks like a great resource!

Another great data set from Google for people interested in NLP (natural language processing) is the Google ngram data set that has ngram sets for "n" in the range [1,5]. This data set is huge and not "laptop friendly" so last year I leased very large memory server from hetzner.de for a few months while I used the ngram data sets. I wish that I still had this data online but the cost of the server eventually became greater than the value of ready access to the data. The next time I need it I am planning on configuring a large memory EC2 instance with enough EBS storage for the data, indices, and application specific stuff - then I can stop the large memory instance when I don't need the data online which is probably 99% of the time: most of the costs will just be for the EBS storage itself, and not the (approximately) $0.50/hour when I keep the instance running.

Edit: I just did the math: renting a Hetzner server turns out to be much less expensive than using an EC2 instance that is usually spun down because 1 terabyte of EBS storage is $100/month (almost double what a Hetzner server costs).

No comments: