June 04, 2003

Sample Open Data Format Bill

This is my first attempt to write a sample open data format bill. It is based, in large part, on the open source bill proposed in Oregon, with some parts inspired by the bill proposed in Peru (which is an ancestor of the Oregon bill).

A BILL FOR AN ACT

  1. The [governing body] finds that:
    • The [government] archives, handles, and transmits information which does not belong to it, but which is entrusted to it by citizens. The [government] must take measures to safeguard the integrity and accessibility of this public data.
    • It is necessary to the functioning of the [government] that computer data owned by the [government] be permanently available to the [government] throughout its useful life;
    • While managers of computer systems give due attention to backing up data files and preserving such backups for future retrieval, must less attention is paid to ensuring that software that can read such backups. and the computer systems needed to run such software, remains available.
    • To guarantee the succession and permanence of public data, it is necessary that the [government]'s accessibility to that data be independent of the goodwill of the [government]'s computer system suppliers, or on the continued existence of those suppliers;
    • It is in the public interest to ensure exchange of computer data through the use of software and products that promote data stored in open formats;
    • It is also in the public interest that the [government] be free, to the greatest extent possible, of restrictions imposed by parties outside the [government]'s control on how, and for how long, the [government] may access the data it is storing on behalf of its citizens;
  2. The [governing body] further finds that:
    • Software that stores data in open data formats guarantees that its encoding of data is not tied to a single provider;
    • Complete documentation of formats used to encode data ensures that data files could be read at a future time, by writing new software to interpret the data files, even if the original software that encoded it was unavailable due to lack of computer hardware or software;
    • Open data format software encourages exchange of data between different software products, and
    • Properly designed encryption systems depend on the secrecy of keys and other information that is distinct from the format in which data is stored, thus public knowledge of a data format used to store encrypted data should have no negative effect on the security of that format.
  3. Therefore, it is in the public interest that the [government] use open data format software in its public computing functions.

Be It Enacted by the [people]:

SECTION 1:

  • 'Open data format' means a data format that encodes computer data in such a way that the specifications for the encoding:
    (A) Are available, in a human-readable, complete format, for all to implement;
    (B) Do not lock the user into a particular vendor or group;
    (C) Are free for all to implement with no royalty or fee except for a fee or fees required by the standards organization for certification of compliance;
    (D) Do not favor one implementer over another for any reason other than the technical standards compliance of an implementation;
    (E) Do not prohibit the implementation of extensions, but may employ license terms that prevent subversion of the standard through predatory practices; and
    (F) Have no restrictions on the creation of programs that store, transmit, receive or access data codified in such way.
  • 'Open data format' software is software that stores all user data that it receives, processes, and/or stores in data files that are encoded using open data formats. This rule extends to the on disk file fomats used by operating systems; however it does not include the protocols used for network communication. It also excludes the following:
    • Data formats of files used only by the internal workings of the software, as an example those that store configuration information not needed to retrieve user data.
    • Data formats that are not "native" to a particular piece of software, but are read or written by that software only to improve the exchange of data with other pieces of software.
  • 'Proprietary data format' software is software that is not open data format software.

SECTION 2. For all new software acquisitions, the person or governing body charged with administering each administrative division of [government], including every department, division, agency, board or commission, without regard to the designation given the entity, shall avoid purchasing software products that are not open data format software.

If the software in question is not publicly available, but was written for the [government]'s internal use, the data formats need not be made available to the public as provided in Section 1(A) above, but instead can be provided only to [government]. The data formats used must still satisfy Sections 1(B) through 1(F) above, and alternatives considered as described in Section 3 below.

SECTION 3. If no open data format software is available for a particular task, then [government] shall consider the following alternatives, listed in descending order of preference:

  • Software for which the source code is publicly available.
  • Software which stores its data in a text format, as opposed to a binary format.
  • Software which stores its data in a binary format.

In such cases, the [government] must provide written justification to the central purchasing agency as to why it was unable to obtain open data format software.

SECTION 4. For existing software, the [government] shall give manufacturers one year from the day this bill becomes law to provide documentation on the file formats used. If after one year a manufacturer has not provided such documentation, then a replacement for that piece of software must be sought, following the guidelines outlined above.

Posted by Adam Barr at June 4, 2003 12:22 PM

Comments

This bill does not catch the case of using open formats like XML or ASN.1 . . . but keeping the data type definitions propriatary. An open encoding or transport (like XML or the OGG bitstream) is not sufficient to qualify as an "open data format"

Posted by: Eric Northup at June 4, 2003 07:05 PM

Does this mean goverment databases that store their files in non-open format (because they use something they wrote themselves, or because they are using Oracle or some other non-open database) won't be allowed to backup their files?

Posted by: Joseph Shraibman at June 4, 2003 07:12 PM

"Software for which the source code is publicly available" might be read to include things like M$'s "Shared Source" where they will give you the source for free if you sign a restrictive NDA

Posted by: Joseph Shraibman at June 4, 2003 07:14 PM

Section 3:
"
Software which stores its data in a text format, as opposed to a binary format.
Software which stores its data in a binary format.
"

Perhaps the definition of 'text format' and 'binary format' is not precise enough. 'human readable' might be a better term, but that too has problems (is XML 'human readable'?)

maybe 'Software which stores its data in a format readily interpreted by any competent professional'

At any rate, 'text format' is a slight loophole. There are many encodings that are in 'text format' which could not be understood without knowing the encoding algorithm, which effectively makes it unreadable. A piece of proprietary software can get a leg up on the competition in this way.

Just look at the 'whitespace' language. If some code in that language were sent back ten years, it would be completely unintelligible to anyone. But it would still qualify as text format.

Posted by: witheld at June 4, 2003 07:28 PM

While the sentence beginning "It is in the public interest..." broadly covers this point, it is well worth making explicit that the government should not compell those with a right to government data to purchase proprietary products necessary to read closed data formats used in the storage of government data.

Posted by: Karl O. Pinc at June 4, 2003 07:59 PM

Does the government have "computing functions" that are not "public computing functions"?

Posted by: Karl O. Pinc at June 4, 2003 08:01 PM

The government could require vendors of software which utilize proprietary data formats to place in escrow the source code of the product, or the source code of another program which successfully reads the proprietary data format and converts it in a lossless fashion to a open data format as certified by a third party via testing on a statistically valid sample of existing documents similar to those used by the government. The escrowed program would be released into the public domain when the vendor ceases supporting the proprietary data format program. This option would place somewhere in the hierarchy of preference if no open data format software is avialable.

It is not enough to have a specification of the closed data format. We all know that programs do not perform according to specification. Working code is required.

Posted by: Karl O. Pinc at June 4, 2003 08:10 PM

I suggest that the standard for a "open data format" be that of an IETF RFC, that there be at least 2 independent, working, implimentations of the open data format that seamlessly interoperate. Without working code it's all vapor.

Posted by: Karl O. Pinc at June 4, 2003 08:13 PM

The federal government already has many standards
for acquisition of information technology.

It is also very well aware of legacy hardware,
software, and data format issues, as well as
the cost of dealing with such systems over time.

Moreover, many packages like Word allow storage
of data in either openly specified or proprietary
formats, so it is an open question as to what
category the software falls in, and categorization
could result in many lawsuits.

Big standardization didn't end up working
out so great (Ada), so the government standardizes
internally where necessary, and tries to
choose the most cost effective solution meeting
its technical needs.

p.s. 1B really is ambiguous. Software written
for a particular platform 'locks' one into a
group, independent of whether its data is
openly specified. It is also out of context.
The purpose of the bill is to encourage open
formats, not to keep the government from
implementing vendor specific solutions; they are
a necessity as often one vendor is leaps and
bounds ahead of another.


Posted by: er at June 4, 2003 08:20 PM

This is a highly worthy effort. I really hate it when my kid's school emails me documents in WORD format. There is no way a government office should be forcing me to use a particular vendor's software to the exclusion of all others.

I believe that in communicating with parties outside of government, a higher standard should apply.

It should be made explicit that in documents and other data that are created for communication with parties outside of a government function, Open File Formats shall ALWAYS be used. The exclusion that allows for a non-Open format to be used if no Open format tools are available should not apply in this case. If no Open format tool exists then one should be created - or some other way to achieve the desired ends should be found.

Posted by: Steve Baker at June 4, 2003 09:00 PM

The section:

If the software in question is not publicly available, but was written for the [government]'s internal use, the data formats need not be made available to the public as provided in Section 1(A) above, but instead can be provided only to [government].

Is still not acceptable. Any data stored be the "government" must be available to the "people" because the people ARE the government.
The government can easily document the data format since they created it in the first place. There is simply no reason for it to be exempt.

Posted by: Mark Alexander at June 4, 2003 09:17 PM

The gov't is already supposed to use SGML for lots of stuff. The USPS also has a file format based on some old-ass cobol or some krap like that. Maybe XML-ize everything? Then there needs to be some giant FOIA repository that can store & search all this krap. It doesnt get much simpler than XML.

Posted by: skewld00d at June 4, 2003 09:30 PM

Replies to commemts:

Eric (and skewld00d): I agree, I would not consider XML to automatically be "open". I'll have to clarify that.

Joseph: Ideally if Oracle does not open its format a government would stop using Oracle...but if there is no alternative, then they could keep using it per section 3. "Publicly" available source would not include "shared source". However, I should probably put something like "shared source" on the list of acceptable alternatives, but below truly open source. At least it is better than nothing.

withheld (!): Yes, I need to tighten up what "text format" is. Should be defined separately in Section 1 at least.

Karl: Well, ideally government software would use open formats. However, if there is nothing else available, then proprietary formats could be used. I think you have to leave some sort of "out", you can't just require that all formats be open. I agree it is a problem guaranteeing that specs are accurate...not sure how to codify this in law however.

er: Word stores in many formats but only the native format is "lossless", keeping all the exact semantics of Word. 1B could be improved, I cut-and-pasted it from an open source bill.

Steve: I don't want to put too much of a burden on the government. If there really is no open data format software available for some particular function, then I don't want to just require that the government produce one.

Mark: That section was meant to cover software that is not generally available to the public. Yes there might be value in having the format totally public...but I am trying to avoid the situation where there are only a few vendors selling a particular solution and they all get together and say "NO we won't release our formats." Allowing a company to release the format only to the government (i.e. *not* to its competitors) might be enough to make one of them defect and agree to do so.

Thanks for all the comments, I will try to integrate them into version 2 of the bill soon.

- adam

Posted by: Adam Barr at June 4, 2003 11:20 PM

When is a format open? MS says that RTF is an "open format", because there exists a "human readable" description of the RTF format. But everyone who has been writing software that can read or write RTF, knows that it has many hidden features and quirks. A format should only be regarded as open, when there is a formal description of it, including both syntax and semantics.

Posted by: Frans Faase at June 4, 2003 11:52 PM

What about pdf? I don't know about the US, but in the UK a huge amount of government data is being put in pdf format - ironically to make it available to the public. The data may be readable, but locks government departments that use it into using proprietary software if they want to edit it in the future. You use a single term - 'accessibility' - without defining this further; is a write-once read-many format accessible?

Posted by: graham at June 5, 2003 03:46 AM

There are a few limitations not specifically addressed in this draft bill. Specifically, section 3 needs clarification and refinement and the identification of support software required is not addressed in this bill.

Section 3 identifies three items for consideration. First was 'Software for which the source code is publicly available.' The entire object of this proposed bill (I thought) was not to identify or establish rules for source code. The end result is that we don't care "how" you deal with the data, only that it remains accessable, is protected (if necessary), and retains it's integrity. I recommend sticking to the data format argument to avoid the "open source" argument.

Next, "Software which stores its data in a text format, as opposed to a binary format." Lets face it. When you get right down to it, all data on a computer is in a binary format. Your letter A may equate to letter A to a user, but it's machine code is something like 32 in hex (I don't remember the exact value - may be 64, but I think you get the point). It is the software that interprets this data to make it readable. Thus this should not be an option.

The last of this list, "Software which stores its data in a binary format." As stated above, all data is stored in a binary format making this, by default, applicable to all software.

I would think a better list might be:
1) Independent solutions that allow data to be imported and exported without compromising the data integrity.
2) Solutions which require minimal support systems to operate and maintain (minimal would need to be defined, but indicate level or type of platform support required)
3) Solutions for which application and data are currently only supported by particular vendors/software (the purchasing agency should then be required to seek or construct an application to support this function).

The second item is the necessity of "support" required. For instance, word for windows won't run (without a significant amount of work) on a linux based operating system and is directly applicable to reliability (though objective data is hard to come by and everyone has an opinion on this topic) and cost factors. It may be neccesary to indicate that the solution(s) should not be directly related to any harware/software used to store or utilize data (again, focus on data rather than code).

Posted by: Jason at June 5, 2003 06:15 AM

Frans: I don't know details on RDF, but people I have talked to about it say it is a standard. It depends what "quirks" you are. I agree it would be best to have a formal BNF-type definition of file formats, but again I did not want to put that in the law. Perhaps a good spec could be defined something like "such that one skilled in the art could write a program to interpret the data etc..."

graham: Yes I would consider PDF to be "open". It's an interesting point about "read once write many", but the main goal is to be able to retrieve data, for use by other programs and for preserving later. I may be open to being convinced otherwise on this point.

Jason: Section 3 is the "if we can't get docs, what is the next best thing" catchall. So open source is better than no source because you could reverse-engineer the format (at great length). "Text" format does need to be defined better, perhaps in terms of certain characters used. And the last one, about binary formats, is meant to be all other programs that store data in binary proprietary formats, like Word -- the least acceptable way, but what you may be stuck with.

Those are good ideas for other more acceptable alternatives.

- adam

Posted by: Adam Barr at June 5, 2003 07:05 AM

I think this is a great idea. I wonder if the concept of an open data format has been sufficiently nailed down by the language of the bill template, however.

Perhaps giving a few examples of data stored in a word or excel file in the text of the bill would help.

Posted by: Chris Marshall at June 5, 2003 07:29 AM

Er has a good point, the government often already has the capability to save all it's data in open formats (ascii if nothing else) and does not. Perhaps you should be working on a separate bill which requires the government to keep it's documents stored in open formats and to use open formats whenever it makes information publicly available.
Without such a practice, an open format data law is moot.

Posted by: Karl O. Pinc at June 5, 2003 07:33 AM

The definition of open format should include something like the phrase "a complete, accurate, and publicly-available formal specification".

As a bridge, you might want to add as the first alternate in Section 3, software licensed conditionally upon the promise to deliver an open format and a conversion utility within a year.

In the case of a program that can write multiple formats, it might be useful to require an open format as the default format for storing data on baseline (not customized) installations.

Posted by: Scott Zak at June 5, 2003 08:13 AM

Karl: Even if a program can save data in an open format (and usually the "open" formats are not as semantically rich as their native format), most people won't, because they choose the default. So in 20 years you will still have a lot of Word .DOC files lying around (even in Word 2003 which supports saving as XML, .DOC is still the default).

Scott: Good suggestion about the conditional license. For a program that writes multiple formats that are all native to it (such as Word 2003 doing .DOC and WOrdML) I would rather that all formats be documented...most users will choose the default format but some may not. So if the default is open but the secondary one is closed, binary, etc. then documents in that secondary format would still be "lost". I agree should not let companies get away with saying "We do support one open format but it is not the default."

Posted by: Adam Barr at June 5, 2003 08:55 AM

One more thing for Frans, when I wrote the initial article about ODFI over a year ago (http://www.osopinion.com/perl/story/16034.html) I did talk about

"Design a standard way for describing data formats (an 'ODFI description') and a program to validate that a data file conforms to the ODFI description (known as being 'ODFI compliant'). I envision something along the lines of Backus-Naur Form with a validation program that, given an ODFI description and a data file, could give the meaning of any byte in the file."

At the time I figured I would need such a thing as technical backing before I could propose any actual legislation -- that I could not simply say "requires complete, accurate documentation" or somesuch. However in the meantime the various open source bills have been proposed and I thought it was time to get moving on an open data format law quickly. Plus, obviously it is a fair bit of work to produce something like that!

- adam

Posted by: Adam Barr at June 5, 2003 09:05 AM

This bill is a piece of crap!!!
I cant believe that you would entrust any info to the government. The first line of this bill destroys it!!

Posted by: Josh Truby at June 5, 2003 01:15 PM

On first blush, this is a compelling idea: take a lesser, and less controversial step towards openness. But as the discussions on slashdot and here have shown, there are potentially many problems with an open data format, just as there are with an open source requirement. Even defining your terms can be a significant challenge. In fact, it is difficult to make a truly and completely compelling arguement for either open *or* proprietary. Or perhaps, an equally compelling arguement for both.

I think you need to look more closely at the real goal. Is open source the goal? I may believe open source is better, but I also believe that compelling you to believe what I believe is the real mistake. I don't think we'll *ever* reach unanimty on the subject of "better," and that's my real point. Is proprietary better? Is open source better? What is truly better is that we be allowed a choice.

The truth is, I don't care how software vendors and OS purveyors do it. I don't care what or whether they charge for it. I care whether or not I have other choices, and that the vendor's means do not *prohibit* me from switching to another vendor. It is not per se important that kword can open a .doc file. It is instead important that the work I do to create a file in either program ought to transfer seamlessly from one to the other through some reasonable means (i.e. an external standard), and that either program clearly identify anything that won't transfer.

The real goal should be that anyone, be it government, corporation or consumer, should have a fair choice of what vendor they want for any service or product. And they should be *guaranteed* the right to switch when they want to, to whom they want to, seamlessly. Althought the Open Data Format is an attempt at this guarantee, I think it falls short, and is fraught with the same potential debates that open source is. Instead, I argue that the goal should be a guarantee of adherence to open standards and/or specifications for the purposes of data portability. If Oracle wants to use a patented and prorietary method of storing my data so that I can access it faster, that's actually of benefit to me. However, I should not be *stuck* with Oracle.


Briefly, I might state it like this:

"Any software product that this [government, corporation, individual] uses to create and manage data of any kind, whether static or dynamic, should be required, on demand, to produce all of this [government, corporation, individual]'s data in a standard or specification approved by at least one of the following international standards bodies:

[list goes here - wc3, iso, etc. etc.]

"Further, any software product that meets the above must also be able to demonstrate on demand that such output unanimously pass all rules of an independent test that guarantees the data will be output according to that specification or standard without deviation, error, omission or exclusion. This test must be approved by the standards or specification body in question."

This is obviously too brief, but is perhaps a start in the right direction. The problem is not whether not I have access to the source, nor whether or not I can read the format the data is stored in by default, but whether I can take my data elsewhere if need be.

$.02

neil

Posted by: neil verplank at June 6, 2003 06:21 PM

Let me categorically state the goal is NOT open source. And I am leery of requiring standards bodies, because standards can't always capture the full flavor of the data. Plus a lot of kinds of data probably don't have a standard.

Allowing output in a documented format is a good start. In fact I might put that as one of the preferred alternatives. But that still requires a system that can run the program, which may not be the case in the future. People do understand about preserving backups of files (and putting them on new media as old media ages) but they are not (and maybe cannot be) doing much about preserving the hardware and OS that applications need to run.

- adam

Posted by: Adam Barr at June 6, 2003 10:28 PM