Archiving data (semantic web, business, etc.) in XML

The other night I needed some data that I had processed a few years ago – no problem; I have been archiving data in adhoc XML documents for years. I say adhoc because I usually don’t use a DTD or Schema to define structure or to validate XML – instead, I write a program that collects and/or processes data and writes directly to well formed XML files – format determined by the application.

The important thing is that I can look at an old XML data file, see the format that I used, and in a minute or two have a little code that uses a SAX type parser to get out what I need. I have used XML files for:

  • Data scraped from the web matching board of directors members with companies (used for an experiment to detect interlocking board members)
  • Data form the CIA World Fact Book for countries
  • US State and city names
  • Categorization data from training on the 2 gigabyte Reuter’s news story corpus
  • etc.

I used to keep data in a relational database – handy for adhoc queries, etc., but now I favor simply archiving interesting data in XML files.

I have thought about setting up a repository of free interesting data in XML – hopefully if I share with others then I will get some interesting stuff back in return. That is on my to-do list :-)

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>