Digital Imaging Technology for Preservation: Proceedings from an RLG Symposium Held March 17 and 18, 1994: Cornell University, Ithaca, New York , ed. Nancy E. Elkington (Mountain View, Calif.: Research Libraries Group http://www-rlg.stanford.edu/welcome.html, 1994), p. 41-58.
What follows focuses first on ideas of authenticity, or intellectual preservation, and then on ideas of longevity, both of which are crucial to preservation of electronic information. This is a symposium on digital imaging technology, but digital images are a very important subset of electronic information more generally and the issues of intellectual preservation and longevity are important to both.
Lesk describes the rapid changes in the means of recording, in the storage formats and in the software that allows electronic information to be of use. Urging what might be called technology preservation he asserts that for electronic information, "preservation means copying, not physical preservation." That is, the preservation of electronic information into the indefinite future requires its being "refreshed" from old to new technologies as they become available and as the old technologies cease being supported by vendors and the user community.
New versions and drafts are familiar to us from dealing with authorial texts, for example, or from working with successive book editions, legislative bills, or revisions of working papers. It is desirable to keep track bibliographically of the distinction between one version and another. We are accustomed to visual cues to tell us when a version is different; in addition to explicit numbering we observe the page format, the typography, the producer's name, the binding, the paper itself. These cues are not available or dependable for distinguishing electronic versions.
Structural updates, changes that are inherent in the document, also cause changes in information content. A dynamic data base by its nature is frequently updated: Books in Print, for example, or architectural drawings, or elements of the human genome project, or a University directory. How do we identify a given snapshot and authenticate it as representing a certain time?
In an electronic future the opportunities for a Stalinist revision of history will be multiplied. An unscrupulous researcher could change experimental data without a trace. A financial dealer might wish to cover tracks to hide improper business, or a political figure might wish to hide or modify inconvenient earlier views; this is the 1984 scenario. Imagine if you will that the only evidence of ReaganÕs Iran-Contra scandal was in electronic mail, or that the only record of Bill Clinton's draft correspondence was in e-mail. Consider the political benefit that might derive if each of the parties could modify their own past correspondence without detection. Then consider the case if each of them could modify the other's correspondence without detection. Society needs a defense against such cases, and the parties involved need also to be able to say "here is a true electronic record."
A solution will have to be algorithmic; that is, it will have to be based on simple software rather than on hardware, which rapidly becomes obsolete. This would seem to be a problem, for software, like documents themselves, can easily be tampered with and modified.
A solution has been developed by a small group of researchers which offers promise, and I will discuss it here as an example of how the problem can be solved. Scott Stornetta and Stuart Haber of Bellcore have named their proposal digital time-stamping (DTS).# It calls upon the cryptographic technique of one-way hashing and uses the concept of the "widely-witnessed event," and is a means of authenticating not only a particular document, but its existence at a specific time. The technique is analogous to rubber-stamping incoming papers with the date and time they are received. In electronic form its use is proposed to be by a document's creator (or other responsible intermediate party) to set up the necessary conditions for later authentication by an eventual user.#
The researchers were initially prompted to develop DTS by charges of intellectual fraud made against a biologist. They became interested in how to demonstrate that there had been no tampering with electronic evidence. In addition, they were aware that the technique could be useful as a means for determining priority of thought (e.g. in patents). The technique they developed makes use of cryptographic theory but does not require the encryption of documents.
Note that using this technique the document itself may be kept private if its creator wishes. However it need not be and in many cases would not be. For librarianship and scholarship generally, the public accessibility of documents without human intervention is a necessity and the one-way hash allows both a document and its hash to be public without fear of change. Note also that the algorithm (software) for creating the hash may also be public; its mechanism need not be private, for knowledge of it will not affect the uniqueness nor the one-way nature of the created hash.
But one can also imagine situations some years, decades, or centuries from now in which it will be desirable to be assured as to when the document first existed. In patent and contract law, which DTS will also serve, this is a daily necessity. In scientific research the need is clear, as it is if one considers stylistic analysis of an author's growth using electronic manuscripts as evidence.
The "widely-witnessed event" is a concept that draws on the difficulty of tampering with a fact that is known to many outside the circle of interested parties. State lotteries prevent both collusion and the appearance of collusion by publicizing drawing of the winning numbers, often on television. Everyone sees the numbers drawn as they are drawn so that it is not possible for officials of the lottery to arrange the winner in advance.
DTS draws on the principle of the widely-witnessed event by openly intertwining the hash of a given document with the hashes of other documents submitted unpredictably by unknown other parties. The combined hashes for each document (known as "certificates") depend upon a visible chain of actions of other similar parties such that tampering cannot occur without being immediately evident to an observer.
The user at a client workstation, perhaps a PC, creates the hash of a document (this can be very quickly done at the click of a software button). She sends the hash over the network to a time-stamping server, which combines the hash with a hash previously received.# The resulting number is called the "certificate" for the present hash, and is sent back to the user's workstation. This certificate becomes part of the authentication means for the original document whether used in the next half hour or the next half century. Note that the certificate is inextricably intertwined with those previously created for hashes received in unpredictable order from unknown (and unpredictable and uninfluenceable) users. The time-stamping server might easily be constructed to serve a region as large as the United States.
The time-stamping server creates a root certificate which is widely published at regular intervals. As a demonstration, the technique's authors for several years have published a root certificate once a week in the personals column of The New York Times.# Such a widely-witnessed event, available for centuries on microfilm or other means, is a tamper proof tool of authentication. In real-world practice, the intervals would be much shorter, say perhaps one minute.
DTS is being presented as a solution of value to a number of information communities, for example banking, law, pharmaceutical companies, and government. Its proposers have been intrigued by unique library requirements including long functional life on the human scale. But if DTS were to be used in research librarianship, several practical matters would have to be worked out.# These include
Digital time-stamping may provide for many of the needs of the library (and archival) communities for long-term authentication of electronic information. If this approach turns out not to be suitable, however, it is likely that one relying on similar techniques will be found. In any case, it is important that libraries identify some solution that allows scholars, students, readers, publishers and information users to have confidence that their electronic resources are authentic.
I can informally put the new problem in the following way (and this audience will know how oversimplified this is): roughly speaking, if I put a book in a room and close the door, when I open the door in 500 years the book will still be there. If I do that for a quantum of electronic information, stored on some device, the same will not be true, not even perhaps in ten years. Different requirements arise out of the difference between the artifactual nature of information up until now and the evanescent instantiations of electronic information.
In this decade the increasing volume of scholarly electronic information has brought many research libraries to attempt ways of providing it to their clienteles. As yet, however, no research library has taken on the expensive, uncharted and difficult task of providing an electronic repository which organizes, provides and preserves information at levels of long-term commitment established for print materials.
Here I'd like to set out some of what is likely to be required for a digital repository to be successful. Such a repository, a Digital Research Collection (DRC), will from the start be committed to organizing, storing and providing electronic information for periods of time longer than human lives. The information will be accessible through present and emerging access and retrieval technologies. A DRC will benefit from the experience of libraries in providing standardized cataloging techniques and other means of organization. A DRC will be preserved over long periods through several technological means: by protection of the medium on which it resides, by regular transfer of data to new technological media, and by migration of information through software technologies.# The repository will guarantee the integrity of the information stored against accident, thoughtlessness and fraud by providing authentication techniques that combine mathematical security with ease of use, public trustworthiness and privacy protection.#
Implementation of the DRC will require two kinds of products, the repository itself and the tools for use with it, and will require commitments of several new kinds. In what follows the technical requirements are given the most space, yet they probably present the easiest problems to solve; they just cost money. The final section on organizational commitments comprises by far the most difficult set of tasks. The issues are described here in cursory form. Each could be developed in great detail (and I'm glad to see that Don Waters is also addressing them at this symposium), but at the moment the outline and overall program need most to be defined. Work needs to begin.
All these are only examples. None, of course, should automatically be selected; collection development policies should be adapted and followed. The continuing substantial costs of providing electronic information will require that electronic collection decisions be made even more carefully and parsimoniously than for print.
Staged access: Alternatives for providing immediate online access to a great volume of electronic information need to be examined. Does it all need to be immediately available? What can be off line, and how would it be retrieved? Present alternatives include magnetic disk, optical disks and jukeboxes, optical disks shelved, and remote storage; perhaps even magnetic tape.
Backup mechanisms: Backup/restore procedures must be in place and must be automated and economical, for libraries are never likely to have expensive labor available in quantity. Backups must be multi generational, all the way to Iron Mountain, with regular disaster simulations and tests.
Data structure standards: In a repository, does information simply exist as is or is complementary information associated with it (widely differing examples might include SGML headers, ICPSR codebooks, picture captions, hypertext links and early software versions for use with data files). If there is an association, is it through use of header portions of a file or supplemental files? How are they indicated and connected?
Refreshing mechanisms: Refreshing will be necessary for long-term preservation across both media and software. There will be organizational and bureaucratic issues in addition to the simply technical: e.g. copyright (if information is copied from magnetic to optical disk through an upgraded software version, copyright issues must be recognized), automation to reduce labor costs, workflow and record-keeping, migration techniques, and standards and techniques that will apply independently of technology. It might be possible to link refreshing to backup techniques for expedience and economy.
Fees and freedom, which in practice are often linked issues: Standards and techniques will be necessary to solve a knot of problems surrounding access and ownership, including
Authentication and Integrity: Intellectual preservation, as discussed above. If using the DTS technique or one like it, the DRC would: establish a satellite DTS server and certificate validator, implement DTS on multiple formats, publish root certificates to make authentication practicable, and establish standards and conventions for use and citation.
Redundancy: It will be important to establish standards for the number of redundant repository locations necessary to assure long-term existence of specific electronic information and access to it. One location won't do for a particular major electronic document or set; will two, or three? How many? Major institutions may separately or consortially establish repositories. It is not yet clear how much redundancy of their components will be desirable among them.
Aside from assuring longevity, other issues come to bear on decisions to provide multiple permanent copies of electronic information. Geographic location will still play a role (at least trans-oceanically, and probably transcontinentally). The historical interplay between costs of network bandwidth and response time, and costs of storage, will require informed decisions which will require change from time to time.
Over time, we will learn how collection development plays out in an access environment as well as in an ownership environment. It is sometimes loosely proposed that libraries need not acquire electronic information, for it will be available somewhere on the network. Such proposers ignore the obvious truth that some institution must still, in the end, take responsibility for the information--that is, in some way must own it. That has always been a library responsibility.#
Cataloging: Providing access to voluminous information is an intellectual problem that has historically been solved in the print environment by cataloging, with its attendant rules and procedures to insure consistency and accuracy. These tools, adapted to suit new needs, will work for electronic information as well.# They should be linked to the new retrieval mechanisms so that users can smoothly navigate from having located information to retrieving it without having to shift their mode of use. Early mechanisms will probably link catalog records to documents using tools such as the newly proposed MARC 856 field working with the representation of virtual locations through the Uniform Resource Indicator, Locator and Name (URI, URL, and URN).# SGML techniques may also offer possibilities for location and linking of documents through its document description techniques.#
If the DRC's catalog system works well, users will be able to search for information, locate bibliographic records for desiderata, and use those records directly to draw the desired information to their workstation.# If the digital time-stamping technique is used (see above), means for including and testing the DTS certificate must be provided. Standards for such cataloging and remote access still need to be developed, particularly for providing catalog access to non-owned materials. The present review of AACR2 Chapter 9 is to be applauded, as is the recent OCLC study on the cataloging of non-book materials.#
Remote Access: A DRC should from the outset be intended for access from multiple remote locations. Procedures for dissemination of such catalog records will be needed; it will be not only a technical matter but a policy matter for libraries associated with the DRC to provide non-local access to their local patrons. Presumably the existing bibliographic utilities will play their accustomed role.
Institutional commitments will be of at least two kinds: organizational and fiscal.
Organizational Commitment: The organization of libraries is already changing as electronic information increasingly becomes part of their charge. Most research libraries now have substantial systems departments. Some libraries locate the responsibility for electronic information distinctly from that for print. Other libraries see the forms as inseparable and include electronic responsibilities along with artifactual responsibilities in assignments for collection development, cataloging and public service.
What is new will be the permanent assignment of staff responsibility for the long term maintenance of electronic information within a library. There is no obvious artifactual parallel for this responsibility. For print the long-term responsibility is now shared among the departments of circulation, stack maintenance, preservation and physical plant. Nor are there present preservation parallels in academic computing centers, where staffs focus on technological advance and leave data to the users. The electronic preservation responsibility will be focused as it will require technical expertise likely to be located in a single functional area. We can see possible beginnings of such responsibilities in the electronic text personnel recently established at Iowa State, Virginia, UC Berkeley and Georgetown.
It is by no means clear that this functional area will be what we have called the library's systems department. As libraries move more into the electronic environment the historic tripartite division of libraries into public services, technical services and collection development continues, but in more fluid arrangements. People who combine bibliographic understanding, problem-solving abilities and process orientation have often been found in technical services. Similar librarians will take on the demanding new technical, collection and service responsibilities for long-term support of digital collections.
Fiscal Commitment: The permanent existence of a digital research collection will require assured continuity in operational funding. Almost any other library activity can survive a funding hiatus of a year or more. Acquisitions, building maintenance, and preservation can be suspended, or an entire staff can be dispersed and a library shut down for several years, and the artifactual collections will more or less survive. Digital collections, like the online catalog, require continual maintenance if they are to survive more than a very brief interruption of power, environmental control, backup, migration and related technical care.
Online catalogs offer a partial parallel. Their maintenance costs have reached a rough steady state, and the capital costs are decreasing relative to the capabilities provided. The catalog size will continue to increase, but catalog records are small relative to the information to which they refer. However, DRC's as a proportion of the library's supply of information will grow for the foreseeable future, and the quantity of information requiring care will become considerable (and much larger than the catalog). Storage costs are likely to continue to drop substantially for some time, which may make the financial burden manageable. (Staffing costs are not expected to increase, as most libraries now recognize that overall staff growth for any reason will not be allowed for some time; reassignments, however, are likely.)
Long term funding will be required to assure long term care. Libraries and their parent institutions will need to develop new fiscal tools and use familiar fiscal tools for new purposes. Public institutions with their constraints to annual funding will have particular difficulties; existing procedures for capital or plant funding may provide precedents. One familiar technique is the endowment. It has been difficult to obtain private funding for endowments of concepts and services rather than books and mortar, but it is possible; and institutions might also build endowments out of operating funds over periods of time.
Some revenue streams associated with DRC's may be practical. Consortial arrangements may allow for lease or purchase of shares in a DRC. Shorter-term access might be provided to other institutions on a usage basis. Access could be sold to certain classes of users, e.g. businesses, non-local clienteles, or specific information projects. Relations with publishers, presently difficult to perceive through the mists of intellectual property, might result in fee income for storage of electronically published materials during the copyright lifetime in which publishers collect usage fees. With commitment and imagination long term fiscal tools will be found.
Institutional Commitment: All these are instrumental means of accomplishing the greatest requirement, that of conscious, planned institutional commitment to preserve that part of the human record and human culture which will flower in electronic form. Where museums have preserved artifacts, libraries have preserved information (until now in artifactual form). For the past century most research libraries have been associated with universities, and this connection seems likely to continue in the immediate future.# Whatever the governance structure, the institution will have to make a conscious decision that commits itself to providing resources.
For scholars and other libraries to have confidence that a given DRC is indeed likely to exist for the long term the commitment will have to be clearly and publicly made. It will probably be desirable for guidelines or standards to be established defining what is meant by a long term commitment, and defining what electronic data bases can qualify to be called a Digital Research Collection. Just as donors of books, manuscripts and archives look for demonstration of long term care and commitment, so too will scholars and publishers as electronic information is created and requires a home.
Note 1. Portions of this talk were derived from two articles of mine, Intellectual Preservation: Electronic Preservation of the Third Kind (Washington, DC: Commission on Preservation and Access, March, 1994) and "Requirements of the Digital Research Collection" (submitted for publication to College & Research Libraries, January, 1994 [subsequently published, July, 1995, p. 331-339]).
Note 2. For one concise summary of the implications of the "sharp distinction between the carrier and the intellectual knowledge it contains," see Patricia Battin, "From Preservation to Access--Paradigm for the Future," Annual Report July 1, 1992--June 30, 1993 (Washington, DC: Commission on Preservation and Access, 1993), p. 1-4.
Note 3. See especially Lesk (below), but also Janice Mohlhenrich, ed., Preservation of Electronic Formats: Electronic Formats for Preservation (Fort Atkinson, Wis.: Highsmith, 1993), the proceedings of the 1992 WISPPR preservation conference. In it, Karen L. Hanus provides an extensive "Annotated Bibliography on Electronic Preservation" (p. 121-136). See also "Implications of Electronic Formats for Preservation Administrators," Newsletter Insert: Newsletter, Commission on Preservation and Access No. 62 (Nov.-Dec. 1993), p. 1-2.
Note 4. Lesk, Michael, Preservation of New Technology: A Report of the Technology Assessment Advisory Committee to the Commission on Preservation and Access (Washington, DC: CPA, 1992; available from the Commission at $5: 1400 16th S. NW, Suite 740, Washington, DC 20036-2217).
Note 5. The archive community speaks of the importance of provenance in establishing that a piece of information is in fact a record. Electronic information by itself can have no demonstrable provenance; the authentication solution hereinafter described may be able to provide the equivalent.
Note 6. Haber, Stuart, and W. Scott Stornetta, "How to Time-stamp a Digital Document," Journal of Cryptology (1991) 3:99-111; also, under the same title, as DIMACS Technical Report 90-80 ([Morristown,] New Jersey: December, 1990). See also D. Bayer, S. Haber and W.S. Stornetta, "Improving the Efficiency and Reliability of Digital Time-stamping," Sequences II: Methods in Communication, Security, and Computer Science, ed. R. M. Capocelli et al (New York: Springer-Verlag, 1993), p. 329-334. A useful brief account is in Barry Cipra, "Electronic Time-Stamping: The Notary Public Goes Digital", Science Vol. 261 (July 9, 1993), p. 162-163 (I have used Cipra's diagram as the basis for my own).
Note 7. This is consonant with what Battin notes as in the future for librarians: "For analog information, we must develop triage strategies for the past; for digital, prospective triage strategies at the point of acquisition or creation" (my emphasis); Battin, p. 3-4.
Note 8. Client/server software assumes a planned, cooperative relationship between two computers. The server typically provides a generalized source of information or a generalized service to a wide clientele, while the client provides computing intelligence physically close to the user and tailored to the user's specific machine and needs.
Note 9. Intellectual Preservation (see note 1) contains further text and figures which amplify the discussion here of hashing and, below, of time-stamping. See also Peter S. Graham "Preserving the Intellectual Record and the Electronic Environment," Scholarly Communication and the Electronic Environment: Issues for Research Libraries, ed. Robert Sidney Martin (Chicago: ALA, 1993), p. 71-101. (Also published as: q "Intellectual Preservation and the Electronic Environment," After the Electronic Revolution,...: Proceedings of the 1992 Association for Library Collections and Technical Services President's Program, ed. Arnold Hirshon (Chicago: ALA, 1993), p. 18-38.)
Note 10. Stornetta reports that the newspaper was initially reluctant to accept the advertisement, fearing that a numbers scam or a drug connection was involved (informal communication).
Note 11. Bellcore in 1994 spun off a small startup company, named Surety Technologies, Inc., which intends to market Digital Notary based on this concept.
Note 12. The Research Libraries Group has determined to embark on a pilot project to develop a repository of electronic research collections. Identifying authentication requirements and solutions is seen as one task of such a project, and the Haber/Stornetta technique is under consideration.
Note 13. Preliminary forms of what follows were presented at the ALCTS Institute: The Electronic Library (October, 1993) and at a Task Force meeting of the Coalition for Networked Information (November, 1993).
Note 14. I've chosen the DRC name in preference to "Digital Library." The latter term, preempted and given currency by Vice President Albert Gore and by the National Science Foundation, has become used typically to define merely a quantity of data bases available for use at a given time. A library however is an organization, not a building full of books nor a network full of data bases. In library terms, therefore, a DRC is a set of electronic information organized for the long term.
Note 15. Artifactual materials include books, journals, manuscripts, recordings and other information resources which are inseparably linked to the objects that are their medium, and therefore exist in space and require specific physical handling to use. In contrast with such materials, where to preserve the artifact is to preserve the information contained in it, electronic information is easily transferred from one medium to another with no loss.
Note 16. See above and Lesk, as cited in note 4.
Note 17. See above and Graham, as cited in notes 1 and 9.
Note 18. Others will be found listed in Directory of Electronic Journals, Newsletters and Academic Discussion Lists, ed. Ann Okerson (Washington, DC: Association of Research Libraries, 1993, and other editions).
Note 19. I take note of Stuart Lynn's comment in his initial presentation that there is much information out there on the network that is not dreamt of in our libraries and will never be maintained by us; yet the nature of what we select as libraries does force us to consider long-term implications. There are interesting print analogies with ephemera, gray literature, popular culture (such as ballads and now comic books), and libraries' dependence on the zealous initiative of collectors (e.g. Pepys) as we eventually build great collections.
Note 20. Lynch, Clifford, A Framework for Identifying, Locating, and Describing Networked Information Resources (March 24, 1993; electronic "Draft for discussion at March-April 1993 IETF Meeting"), n.p., section "Referencing Parts of Objects" (my citation in this form exemplifies the problem).
Note 21. Burnard, L., What is SGML and How Does it Help? TEI document TEI ED W25, October 1991, available from TEI fileserver (listserv@uicvm.uic.edu; send the line "get tei-L filelist"); International Organization for Standards, ISO 8879: Information Processing - Text and Office Systems - Standard Generalized Markup Language (SGML), ISO, 1986; Sperberg-McQueen, C.M. and Burnard, L. (eds),ACH-ACL-ALLC Guidelines for the Encoding and Interchange of Machine-Readable Texts [TEI], Draft version 1.1 (Chicago and Oxford, 1990; Draft version 2 to appear in 1994).
Note 22. Lynch, in Framework proposes "that the emphasis be on describing content...rather than access mechanisms" (¤"Cataloging Networked Information Resources").
Note 23. Berners-Lee, Tim, July 14 1993. Uniform Resource Locators [online, as ftp://ds.internic.net/internet-drafts/draft-ietf-uri-url-01.txt (or ...-01.ps). See also MARBI Proposal 93-4 (Nov. 20, 1992), p. 5 ff, for comments on the possible relations between the URL and the proposed MARC field 856 (Electronic Location and Access); and MARBI Proposal 94-3 (Dec. 6, 1993), which specifically proposes adding a subfield $u to field 856 to accommodate a URL.
Note 24. See the proceedings to be edited by Lisa Horowitz of the Workshop on Documenting E-texts held at Somerset, New Jersey under the auspices of the Center for Electronic Texts in the Humanities (Rutgers/Princeton), May 16-18, 1994 (in press).
Note 25. For a further description of this potential for integration see Peter S. Graham, "The Mid-Decade Catalog," in ALCTS Newsletter (January, 1994), pp. A-D.
Note 26. Dillon, Martin, et al, Assessing information on the Internet (Dublin, Ohio: OCLC, 1993).
Note 27. The national libraries are the great exceptions, such as those of Britain, Russia, France, Canada and the United States. Exceptions in this country include the handful of independent research libraries such as the Folger, the Huntington and the American Antiquarian Society, and some of the great civic institutions such as the Boston and New York Public Libraries. For the possibility of the link between research libraries and universities being lost, see the 1991 Malkin Lecture of Terry Belanger, The Future of Rare Book Libraries (Charlottesville: Book Arts Press, in preparation; text available from Dec. 16, 1991 archive of ExLibris, a listserv at rutvm1.rutgers.edu, message from: terry@cunixa.cc.columbia.edu, subject: Malkin Lecture).
Back to PG's Home Page - http://aultnis.rutgers.edu/pghome.html.