Sunday, November 09, 2008

Don't repeat yourself: for code, sure, but how about for data?

In individual applications we want to make sure that we don't have replicated code that is identical or very similar. What about replication across all projects, both active and in "freeze mode"?

I periodically like to consolidate source code: keep a single latest svn trunk version on my system and organize the code that I have written and frequently reuse into libraries. I am in the process of packaging up most of my Ruby code in local gems.

I also have issues with many copies of textual data files. For data used in Java libraries and applications the solution is simple: I keep data with the code that needs it in JAR files that are kept in a single library directory on my development system. I have been doing this for over 10 years and this is a really nice way to keep data assets and code together.

Sometimes I simply link data statically into compiled applications that I use (e.g., in the last year I have reimplemented many of my statistical NLP tools in Gambit-C Scheme and I generate a single command line utility program with all the required data statically lined.)

For data assets used in programs developed in multiple programing languages, a "separation of concerns" between code and data assets makes more sense.

I need to better organize other data assets like tagged training data, raw text organized into a hierarchy of categories, data that I have culled form the web and stored in XML files, etc. I am starting the process of putting the most up to date versions into a single directory and tweaking my code to check the DATA environment variable value and then load data assets as-needed. I will probably not import this data directory into svn or git: most of the data seldom changes and some of the assets are huge.

No comments: