Monday, February 07, 2011

Curated data

It is difficult to predict what data will have long term value so it is often safest to archive everything. With data storage costs approaching zero I think that we can expect high value data to last forever, baring a nuclear war or the crash of society.

Curated data has a higher value than saving "everything." I think that the search engine Blekko is interesting and useful because of what it does not have: human powered curation yields fewer results but very little SPAM. The Guardian's curated structured data stores have much higher value than the original raw data (from government sources, etc.). I can imagine The Guardian curated data becoming a permanent part of our history as for example are ancient stone tablets we see in museums.

I have long planned on providing curated news and technology data that has semantic markup either on my ancient knowledgebooks.com domain or a new placeholder kbsportal.com but I seldom have free time slots because of my consulting business. Hint: I would like having a few partners who are into statistical natural language processing and general data geeks to help me with this. I don't know if it would end up being a viable business or just a public service portal.

1 comment:

Mental Contrail said...

I work in NLP in biological research. Curated data is everything. We're working to improve gene identification and event detection. Genes are tough because the names and the uses of the names are far from standardized. In many cases you see the same name among many species, and identifying the species becomes important. Adding event detection makes it easier to narrow down exactly what is being discussed in a particular paper. Are the two genes (or proteins) in question involved together in a particular process, or do they just happen to be mentioned together? This stuff is all developed with annotated corpora and curated ontologies in hopes of making it much easier to find just the right paper. Where similar technology would find application in the rest of the word is an interesting question.