Saturday, May 19, 2007

Why the ODF is better than Microsoft's document formats

It takes a few lines of Ruby code to process OpenOffice.org document files:
require 'rubygems'
require 'rexml/document'
require 'rexml/streamlistener'
require 'zip/zipfilesystem' # install with gem
include REXML

class OOXmlHandler
include StreamListener
attr_reader :plain_text
def initialize; @plain_text = ""; @last_tag_name =""; end
def tag_start name, attrs; @last_tag_name = name; end
def text s
@plain_text << s << "\n" if @last_tag_name.index('text')
end
end

class ReadOpenOffice
attr_reader :text
def initialize file_path
Zip::ZipFile.open(file_path) { |zipFile|
xml_handler = OOXmlHandler.new
Document.parse_stream((zipFile.read('content.xml')), xml_handler)
@text = xml_handler.plain_text
}
end
end

puts ReadOpenOffice.new('KBrecipes.odt').text
I have spent too much of my time over the last 10 years dealing "programatically" with Microsoft document formats. I am tired of wasting my time when open document formats are so much easier and less expensive to use.

9 comments:

Ara Vartanian said...

Yes, it is ridiculous that given the growth around Office 2.0 that we are still hamstrung by binary (largely undocumented) document formats: what amount to legacy formats from two decades ago.

Interoperable applications will become so much more productive when we can crack open the rich user-provided content (i.e. word processing documents, spreadsheets) rather than treating it basically as a black box. For too long, the cost of reverse engineering binary formats has been prohibitive.

There is a lot of creativity out there, and a lot of things we can hardly imagine now will become possible once these document formats catch on.

Chris Ward said...

Try 'ODFHandler'. 'OOXmlHandler' sounds too Microsoft-ish.

ISO26300 is like moving to DVDs after years of struggling with VHS videotape. Or to paper tape after decades of punch-cards.

Yes, there will be some losers; but there will be a lot of winners.

The faster we move, the better.

Mark Watson, author and consultant said...

Ara: I agree with your excitement over the future of creatively creating custom work flows around open document formats. I would mix in enthusiasm for open web APIs like GData, MetaWeb, DabbleDB, etc.

Chris: I like your name change - just edited my local library code.

Doug Mahugh said...

Yes, XML-based formats are easier to work with than binary formats, but it's not clear to me what difference you're trying to illustrate here.

You could use the same few lines of code to scan all the "t" nodes in an Open XML package and echo them out -- in fact, your Freudian slip on the classname makes me wonder if you've already done that. :-)

So the difference is "text" vs. "t"?

Mark Watson, author and consultant said...

Hello Doug,

It was not a Freudian slip, OO stood for OpenOffice.org.

You are right about being able to pull text as easily from Microsoft's Open XML format, but unless you are a Microsoft employee (I am not) or own Microsoft stock (I sold mine as a tiny and meaningless protest against their document format policies), you may agree with this:

Microsoft is reluctantly being open in this case, but it is contrary to their underlying business model of lock-in.

Microsoft's format is also made less appealing due to keeping compatibility with old formats with binary attachments, etc.

I have blogged before on how I would "fix Microsoft" (go to a yearly subscription model to get off of the feature-creep marketing ploys to force upgrades - strive for security and robustness. This goes for both cash cows: Windows (single user and server) and Office.)

Doug Mahugh said...

Silly me, Mark -- I've seen OOXML used by so many people that I made the Freudian slip here.

I am indeed a Microsoft employee, but I agree with you that the path we've taken with the new file formats is different from how we've approached these issues in the past. (I started at Microsoft the same week Open XML was submitted to Ecma, so I don't have personal experience of the old file-format policies except as a user.)

I also agree that compatibility with the binary formats makes some things messy. But there's not a simple solution there -- we have customers who would probably leave if we didn't maintain that compatibility, and we are a profit-driven corporation after all. There is always tension between backward compatibility and innovation, especially for succesful products with large market share.

Mark Watson, author and consultant said...

Doug: thanks for the comments. I certainly am not an "open source only" developer: most of my work in the last year uses the expensive Franz Common Lisp tools; I use OS X and Windows on a regular basis; etc.. So, I understand that Microsoft must act responsibly with their shareholder's equity.

Several years ago, my wife and I were weekend guests at a friends house and their son in law (way high up the food chain in Microsoft management) was also there with his family. He explained the economics of Microsoft and a lot more - interesting discussions.

Muhammad Haggag said...

The new docx format used by Word 2007 is simply a zip file containing XML files, one of which is document.xml. Microsoft Word 2003 can save documents in XML format too, although it's not the default.

Granted, this isn't as good as being governed by an open specification, but it's a far cry from working with a binary format.

Anonymous said...

doug, what i don't get is: MS has this bloated office open xml spec because it chose to include details about legacy formats. Was there a need to include that? Nothing prevents MS from marketing translators from the old formats to openxml. Sincerely i hope it's not iso certified until it's cleaned up.

Another thing i don't get is why they call it open xml. something with human readable markup, a DTD or a schema, is open by definition.

BTW very nice blog, I see rails java and others discussed without being too partisan.