Issues in Data Formats

Archie Warnock, A/WWW Enterprises

Original Version Posted to Usenet in sci.data.formats

[separator]

We're addressing the question of whether or not HDF is a good archival format (and along with that, the question of whether any other formats are better ones). I guess we'd better define some terms first. Some time ago, I described three different, competing and often conflicting uses for data formats:

1. Working formats - formats designed primarily for the purposes of data analysis and processing. Working formats should primarily emphasize operational efficiency - native numerical formats, direct access and efficient (minimal) storage requirements. Advanced (possibly arbitrary) data structures should be supported, and portable toolboxes for manipulation should exist to ensure broad utility.

2. Interchange formats - formats designed primarily for the purposes of exchanging data between different hardware platforms and software applications, perhaps on a variety of media. Interchange formats should be designed to utilize portable data representations, minimal file sizes and good software support for conversion to working formats. Preferably, they should be self-describing to avoid the problem of having the documentation become separated from the data.

3. Archival formats - formats designed primarily for the purpose of assuring the long-term storage and readability of data on a variety of platforms. Archival formats must ensure that the data is recoverable by future generations, on hardware and with software that we can't possibly anticipate. They should be designed to be self-describing without requiring special tools and should utilize portable data representations. The byte-by-byte descriptions of how the data is written should be commonly available in the standard archives of human knowledge.

Questions? Comments? Ok, let's go on.

Self-describing format - this means that the description of data stream is an inherent part of the data stream and that the description be human-readable. A simple dump of the byte stream must yield information about the contents of the byte stream. Why? Because there is no assurance that, in all cases, a potential user will have access to specialized software which interprets the data stream.

Now, here are some assertions which I claim to be true. They address the issues of suitability of purpose and archival safety.

Assertion #1: At this time, no single format meets the requirements for all three uses. In fact, it may not be possible for any single format to meet the requirements. This is not due to inherent failings of the formats, but to the trade-offs required between efficiency and portability.

Justification: I believe this assertion is self-evident from the properties of the formats we've been discussing - HDF, CDF, netCDF, PDS VICAR and FITS. If there are other formats that are significantly better in all three areas, I'd like to know more about them. We all would.

Assertion #2: At this time, data in FITS format has greater assurance of being recoverable by future generations than does data in HDF. Note: this specifically does not address what to do with data that cannot be represented in one format or the other, nor does it imply that data stored in HDF will be necessarily lost. We're talking about assurances here.

Justification: Assume you have data written in format XXX, where XXX = your favorite format. Now assume that N decades from now, where N is some positive number, perhaps large, your data is found by someone poking through some old file cabinets at your institution. It may be one a tape or a CD-ROM or some newer medium. No documentation is found with the data.

(Note: this has, in fact, happened to a collegue of mine - he received a tape of data from the gas chromatograph mass spectrometer on Viking that had been deposited at the NSSDC. The documentation had been misplaced. The tape had been written in a binary floating point format on a (now obsolete) IBM mainframe. No current machine uses this numerical representation. If the documentation hadn't been subsequently located, that expensive data - millions of dollars - would have been permanently lost.)

Questions: Is that data recoverable? What format has the greatest likelihood of being recoverable?

Which is safer - data in a self-documenting format as described above, or data which requires a format-specific "dump tool" to render the metadata readable? Note: this does not address the question of which does a better job of formatting the metadata or making it readable - nor should it. You don't care, at this point, how "nice" the presentation is. You just want to see "what's in the file".

Which is safer - data in a format described in detail in the refereed literature, available in virtually every university in the world, and in the Library of Congress, or data in a format described in detail in an internal publication of an institution which may no longer be in existence, or may have terminated support for the project development due to funding constraints?

Which is safer - data which is in a format which is documented in a way that a programmer may, if necessary, write software to ingest the data in any arbitrary language on any arbitrary hardware platform, or data which is in a format that requires access through a toolbox which only works with certain languages on certain platforms and supported by an institution which will not necessarily have the resources to broaden the support?

Now look, I know NCSA tries to support HDF. I know that access to HDF is, in many cases, more convenient. I know that NCSA has internal documentation describing the internals of HDF. And I know that EOSDIS "trusts" HDF to be safe for archival storage. But none of those points are relevant to the discussion of whether HDF offers the assurances to make it safe for long-term archival storage.

Assertion #3: The process by which FITS has gathered the agreement and support of its user community and adopted the requirements and specifications of the format are the minimum steps which must be taken to ensure the archival safety of data in FITS format. In addition, the process can be emulated by the user community of any other format to ensure the archival safety of data in that format. The standardization process is reproducible, and IT IS THE PROCESS THAT IS IMPORTANT, NOT THE DETAILS OF THE FORMAT.

Justification: There is, of course, a fundamental difference between placing information in the refereed literature and on any old piece of paper. The publication of the details of FITS format in the refereed literature not only ensures the stamp of approval by the individual referees and editors of the journal, but places the information and subsequent related discussion in the public record. Libraries and publishers take seriously the job of archival storage of the professional literature. It is the absolute minimum necessary to assure the availability of that information to future generations. Anyone who has done academic research know this to be true.

This alone would that data in FITS (or any other format that did the same) will be readable at any time in the future - I can always find out how to read the bits. But the FITS community has gone one step further. The format is tested and subjected to the consensus evaluation and approval of the user community - there are FITS committees in North America (under the aegis of the AAS), in Europe (under ESO) and in Japan. The recommendations of those committees are then adopted by the International Astronomical Union. These steps ensure that the details of FITS are part of the permanent body of knowledge maintained by the astronomical community. It will be accessible as long as the professional astronomical community exists. Data in FITS format is permanently accessible. It is standardized by the force of international agreements, by the entire astronomical community. The same cannot be said of any of the other formats we've discussed.

OK - this is long and is going to generate, I suspect, a lot of discussion. I want to add a couple of disclaimers.

1. We simply must separate the issues of ease of use and archival safety. They aren't the same thing. I know HDF is easy to use. That doesn't make it safe in the long term. I know FITS can be a pain to get right (though FITSIO makes the job easier). That doesn't make it unsafe.

2. I'm not advocating that anyone in particular force their data into FITS. Frankly, I don't care if any single data set is readable in 50 years or not, unless it's my data set. I am concerned about the body of scientific data as a whole, though. There are lots of things FITS doesn't do well - it's not a panacea. But the one thing it does do well is make data safe and permanently accessible.

3. Repeat after me:

The standardization process is reproducible, and IT IS THE PROCESS THAT
IS IMPORTANT, NOT THE DETAILS OF THE FORMAT.
The standardization process is reproducible, and IT IS THE PROCESS THAT
IS IMPORTANT, NOT THE DETAILS OF THE FORMAT.
The standardization process is reproducible, and IT IS THE PROCESS THAT
IS IMPORTANT, NOT THE DETAILS OF THE FORMAT.
The standardization process is reproducible, and IT IS THE PROCESS THAT
IS IMPORTANT, NOT THE DETAILS OF THE FORMAT.
The standardization process is reproducible, and IT IS THE PROCESS THAT
IS IMPORTANT, NOT THE DETAILS OF THE FORMAT...

warnock@awcubed.com