May 21, 2003

The Openness of XML formats

There has been a lot written about how XML means the end of closed document formats. For example, in Scott McNealy's XML hype piece [10/12/01] from 18 months ago, he claims "I believe the cure to all our file-format headaches lies in a technology known as XML". The article is introduced with the question, "Should open, XML-based file formats replace today's proprietary ones?"

However, it's not clear that "open" and "XML" are necessarily joined at the hip.

The debate is muddled because "open" data formats can mean different things to different people:

  1. A standard which is text-based as opposed to binary.
  2. A standard which is fully documented.
  3. A standard which was produced and/or certified by a standards body.

Data stored in XML is described using a schema, which defines what various tags mean. You could think of standard HTML as defining a schema, which every browser supports; with XML, anyone can define a schema.

An XML format would satisfy rule #1 and arguably satisfy #2, in the sense that XML schemas tend to be self-documenting. Microsoft's current binary format for Word satisfies none of those, so it is certainly not open. With ODFI I am trying to get Microsoft (and other companies) to satisfy #2 only--and even with XML formats I want companies to provide actual written documentation, not simply say "here is our schema, that's all the documentation you need".

Having a data format that satisfies #1 could be useful, but is not a requirement. And I am opposed to pushing for a data format, XML or other, that satisfies #3.

Microsoft has announced that Office 2003 is going to support storing data in XML. So that should make Scott McNealy happy, right? Well, not exactly. The article "At Microsoft's Mercy" [4/23/03] by Kendall Grant Clark captures some of the feelings about Microsoft's use of XML, from conspiracy theories that it is all a publicity scam, to those who think XML is over-hyped in any case.

Microsoft has defined one schema for Word, called WordML, but is also allowing users to define their own schemas in certain versions of Office. Will this help data interchange? As the Register puts it [4/25/03], "In the future, you may be faced with two flavors of nonsense. XML Word documents that have been mangled by Microsoft's XML-creation tools, and XML Word documents that have been mangled by users who add their own non-standard entities."

To really allow complete exchange of data between Word and other word processors, Microsoft would need to support not WordML, but a standard XML schema, one that satisfies rule #3 above. Many people seem to think that storing data in XML would automatically satisfy #3, based on the misperception that XML defines one overall standard schema for all data, or that computers would be able to automatically interpret the semantics of any XML schema. Others feel that doing XML "correctly" requires using a standard schema. Neither of these are true, as Microsoft has pointed out, and it apparently has no intention of supporting a standard schema.

The article "Why Standards?" [5/18/03] by Jim Waldo points out that standards that codify existing practice are much better than those that attempt to define something from the ground up. The problem with standards bodies is that they are slow and they can get political. If Microsoft wants to include a new feature in Word and therefore in its WordML schema, what should it do if the standards body that is certifying it a) takes too long to approve it or b) refuses to allow it altogether? Keep in mind that one of the main goals of ODFI is to allow information to be retrieved from a data file long after the program that reads it is gone. The key to this is having the format documented, and it doesn't matter if the documentation comes from one company or from a standards body.

I'll also point out that Microsoft is not going to make XML the default way to store data in Office 2003; the old .doc format will still be used unless the user choose to save as XML. Microsoft has to do this; otherwise, when one user in an organization upgrades to Office 2003 and starts producing XML documents, everyone else will have to upgrade at the same time or be left unable to read them. In fact Microsoft got roasted for causing this type of disruption when it changed its binary format between Word 95 and Word 97. The only way XML can become the default is to allow several versions of Office to ship that can all read XML; then perhaps in Office 2008 XML can become the default way to save files.

That is not to say that Microsoft's support of WordML has no benefits. To begin with, XML is text-based, not binary, so it is less susceptible to corruption, and a minor typo can be fixed with any editor (the same is true of the existing standard RTF). Also, as this post on XML-DEV [4/18/03] by John Cowan points out, most users do not crack open data formats themselves, but they do want third-party utilities that can do so. While it may take a little while for third parties to support a new flavor of WordML that accompanies a new version of Office, it is easier and more reliable for a third party to change its code to support reading a new XML schema than it is for them to reverse-engineer a new binary data format.

Posted by Adam Barr at May 21, 2003 02:20 PM

Comments

An open, XML-based file format is good for two reasons:

A. its open (which is good for #2 above)
B. its XML

An XML-based file format is good for a variety of well-known reasons:
- it can be validated;
- it is easy to parse (compared to binary and non-XML text formats; XML parsers are common; many people understand XML)
- third parties can extend the file format with a variety of standard techniques;
- its a better foundation for content reuse, document automation, and other good practices.

It is clear that both open and XML is better than just open.

Posted by: Jason Harrop at June 5, 2003 12:19 AM

A point of clarification: assuming the reference below is still current, if you buy the Microsoft Office Professional 2003 Enterprise Edition or Microsoft Office Professional Edition 2003, you get support for "Customer-defined XML schema" (as defined on that web page - but it means a plain old XSD file)

See http://www.microsoft.com/presspass/newsroom/office/factsheets/OfficeSKUFS.asp

Posted by: Jason Harrop at June 5, 2003 12:32 AM

Agreed. Open *and* XML is better than just open. But I would rather have open and ugly-binary (with real documentation) than XML and no documentation.

Posted by: Adam Barr at June 5, 2003 06:50 AM

I agree completely about documented vs. undocumented. However, it might be useful for the success of your project if you would try to clarify at the top level:

1) What kinds of data formats you're talking about. XML is a good choice for comparatively simple, non-realtime document types, like office applications, however for things like playable media data (audio, video, etc.) XML would ballon file sizes absurdly, transmission rates would plummet to the point of unusability, etc. In other words, for technical reasons there are domains where 'documented but binary' has its places.

2) Whether or not you're addressing the issue of IP-rights encumbrances. In other words, a file format could be both documented and XML-based, but nonetheless copyright- or patent-protected, such that it cannot be legally used without a license agreement, and perhaps even payment of money. I suggest you consider adding 'IPR-unencumbered' to you short list of important critria. This is probably your assumption, but disambiguating it would be good.

Don't mean to nit-pick, just trying to be helpful as I think your basic idea here is a good one.

Posted by: Chris Grigg at June 5, 2003 02:50 PM

1) I am not talking about any particular data format--whatever manufacturers want to define. I think XML and open is better than open, but I don't want to codify a preference for any particular format in the bill, because then what does that say about a company with an open non-XML format? I don't want to make anybody move to a new format unless they want to.

2) Patents etc. are a sticky issue. Again I don't want to take away any rights someone already has or make a format they own be worth less. So you can't just say "only formats with no IP encumbrance." However I will put something in there that if there is a patent on a data format, the company has to license it to the government in a Reasonable and Non-Discriminatory (RAND) way.

If I was being more hardcore I would say they had to license it to the government for free, but that would be an easy target for industry opposition.

- adam

Posted by: Adam Barr at June 5, 2003 03:15 PM