Saturday, June 05, 2004

Complexity of Java code for reading OpenOffice.org documents vs. Microsoft documents

I have spent more time than I would like to admit writing Java code to pull plain text from Microsoft Word, PowerPoint, etc. files. This morning, I added support for reading OpenOffice.org documents to my Knowledge Management system: easy!

It took about 15 minutes of coding: used the ZipFile API to read the top level document file, and found the ZIP entry labeled "content.xml", got an input stream for this ZIP entry, fed it to a custom SAX parser class that simply aggregated character data inside <text:p> tags.

No comments: