The other night I needed some data that I had processed a few years ago – no problem; I have been archiving data in adhoc XML documents for years. I say adhoc because I usually don’t use a DTD or Schema to define structure or to validate XML – instead, I write a program that collects and/or processes data and writes directly to well formed XML files – format determined by the application.
The important thing is that I can look at an old XML data file, see the format that I used, and in a minute or two have a little code that uses a SAX type parser to get out what I need. I have used XML files for:
- Data scraped from the web matching board of directors members with companies (used for an experiment to detect interlocking board members)
- Data form the CIA World Fact Book for countries
- US State and city names
- Categorization data from training on the 2 gigabyte Reuter’s news story corpus
I used to keep data in a relational database – handy for adhoc queries, etc., but now I favor simply archiving interesting data in XML files.
I have thought about setting up a repository of free interesting data in XML – hopefully if I share with others then I will get some interesting stuff back in return. That is on my to-do list