Wednesday, August 18, 2004

Archiving data (semantic web, business, etc.) in XML

The other night I needed some data that I had processed a few years ago - no problem; I have been archiving data in adhoc XML documents for years. I say adhoc because I usually don't use a DTD or Schema to define structure or to validate XML - instead, I write a program that collects and/or processes data and writes directly to well formed XML files - format determined by the application.

The important thing is that I can look at an old XML data file, see the format that I used, and in a minute or two have a little code that uses a SAX type parser to get out what I need. I have used XML files for:
  • Data scraped from the web matching board of directors members with companies (used for an experiment to detect interlocking board members)
  • Data form the CIA World Fact Book for countries
  • US State and city names
  • Categorization data from training on the 2 gigabyte Reuter's news story corpus
  • etc.
I used to keep data in a relational database - handy for adhoc queries, etc., but now I favor simply archiving interesting data in XML files.

I have thought about setting up a repository of free interesting data in XML - hopefully if I share with others then I will get some interesting stuff back in return. That is on my to-do list :-)

No comments: