DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT
This is a draft document in progress.
It is subject to change without notice
Last change date: April 26, 1995
Comments are most welcome.
Contact: Donald Waters
Address: donald_waters@yale.edu
Last modified: 4/26/95
The Social Organization of Archiving Digital Information
- I. Overview of Major Issues
- A. Statement of the problem
- 1. The Task Force on Archiving of Digital Information is concerned
primarily with ensuring that information in digital form endures for future
generations.
- 2. Preserving the media on which information is electronically recorded is
now well understood to be a relatively short-term and partial solution to the
general problem of preserving digital information.
- 3. Even if the media could be physically well-preserved, rapid changes in
the means of recording, in the formats for storage, and in the software for use
threaten to render the life of information in the digital age as, to borrow a
phrase from another arena of discourse on civil society, "nasty, brutish and
short."
- a) Given the threat of technological obsolescence, Michael Lesk has argued
that the preservation of electronic information into the indefinite future
"means copying, not physical preservation." In this sense, preservation means
"refreshing" information from old to new technologies.
- (1) Devices, processes and software used to record, store, and retrieve
digital information are being replaced with new products and methods on a 2- to
5- year cycle.
- (2) Backward compatibility between versions of software and generations of
hardware are not assured.
- (3) Interoperability among competing hardware and software product lines is
not assured.
- (4) Therefore, "copying" for preservation purposes is more complex than
replicating digital information on new storage media.
- b) Jeff Rothenberg has recently suggested another possible solution:
Create and archive emulators of software and operating systems which allow the
content of digital information objects to be carried forward and used in its
original format.
- 4. Given various technical options, however, preserving electronic
information is not only, or even primarily, a technical matter, as anyone knows
who has participated in the copying, say, of a bibliographic or corporate
financial database from one on-line system to another.
- B. Some issues are important but fall outside the purview of the Task
Force.
- 1. Questions of intellectual judgment -- what information to eschew and
what to carry forward in what structure and format -- are always among the more
difficult issues in creating and maintaining an archive.
- a) Some principles that would influence the selection of materials in
electronic form are treated below in this report.
- b) But selection is an issue common to all archiving functions.
- c) Moreover, selection criteria cannot be generalized because they are
dependent on the goals and policies of each archive and so are highly
archive-specific.
- d) In general, the Task Force views the criteria for selecting digital
materials for archiving as beyond the scope of work.
- 2. The Task Force also regards as out of scope issues pertaining to the
preservation of materials in analog form.
- a) We recognize, however, that materials in analog form may be converted
and need thereafter to be preserved in digital form.
- b) The principles and practices established here will no doubt inform
responsible parties who are converting information objects from analog to
digital form.
- c) Moreover, interoperability between analog and digital preservation
systems is an important goal
- C. Mobilizing the will and resources of a variety of agents to a common
good.
- 1. How can we build a meaningful, dynamic organizational framework for
these agents to work together productively?
- 2. What are the principles, economic incentives and contractual
relationships that might serve to create an environment most conducive to the
preservation of digital information into the indefinite future?
- D. Among the many factors conditioning intellectual judgment and the
application of technological wizardry in the archiving of digital information
is, of course, the social organization of information, its generation and
use.
- 1. We intend to observe the 80/20 rule
- a) Our objective is not to resolve all issues associated with the
preservation of all kinds of digital information.
- b) We want to address the most vexing issues that currently stand in the
way of an organized approach to digital archiving.
- c) We hope to articulate enough basic principles to stimulate progress, but
not so many as to stifle it
- 2. We advance a four-part approach to the social organization of archiving
digital information.
- a) How is the emerging digital information environment changing the life
cycle of information and the identity and role of the various stakeholders in
that information?
- b) What features of electronic information objects most affect our ability
to archive them?
- c) What kinds of roles, functions and other organizational structures are
needed to preserve those objects?
- d) What infrastructures are needed to enable digital
preservation?
- II. The Digital Environment: Stakeholders in the Information Life
cycle
- A. Critical factors in the emerging digital environment
- 1. Digital networks are central to information access and distribution
- a) Assume the existence of reliable, secure, high-bandwidth networks
- b) Specific policy decisions regarding pricing, security and extension of
networks, however, will greatly affect the viability of efforts to preserve
digital information residing on the networks.
- 2. Digital technologies facilitate the reclamation and reuse of information
objects.
- a) Ease of reuse may increase value of archive to all stakeholders
- b) The creator/publisher may have sufficient incentive to maintain its own
archive
- c) Need to plan for cases where there is little incentive for preserving
digital information objects
- d) Easily reused digital objects may complicate selection decisions, as
well as indexing and cataloging
- 3. Digital technologies also provide greater flexibility in the
distribution of information
- a) Emergence of consortial arrangements to explore distribution mechanisms
among various partners in the information cycle
- (1) Role of RLG/DPC
- (2) Rise of regional consortia
- b) Distribution of responsibilities for collecting information as a basis
for archiving function
- 4. Digital technologies increasingly are serving to integrate the delivery
of information in various media
- a) Digital storage and transmission of voice, text, images and video
introduces a common layer of technology.
- b) Integration affects user behavior.
- c) Integration also gives rise to new (e.g. multimedia) information objects
that need to be preserved.
- d) These objects in many cases will exist only in digital form
- B. Stakeholders in the information life cycle
- 1. Stakeholders include the following
- a) Author/creator, Publisher, Distributor, Library/Archive, Reader/Consumer
- b) In the digital environment, the traditional relationships among these
identities are shifting and new stakeholders will emerge.
- 2. There are various models to represent these shifting relationships and
to suggest where archiving might fit in the emerging digital environment.
These models cannot well be represented linearly.
- a) The traditional model of the information life cycle emphasizes the
tension between copyright and fair use.
- (1) Provider --> publisher --> distributor
- (2) From the distributor the information may flow directly to the reader or
to a library and then to the reader.
- (3) One function of the library is to assure fair use
- b) Fair use analysis depends on a static information object to apply fair
use tests; however, the model doesn't apply in the digital environment.
- c) Alternative models posited for the digital environment generally are
silent on fair use but assume copyright or other rights for some objects.
- d) These alternative models emphasize other values in the information life
cycle.
- (1) Direct to reader: information costs are so low that information does
not need to pass through central repositories.
- (2) Author as publisher: barriers to publication are so low that authors
can publish directly (e.g. as pre-prints)
- (3) Library as publisher/distributor: libraries are seen as archives for
publishing on demand.
- III. Information objects
- A. Digital information objects have attributes that are structured in
multiple dimensions and which influence the technical, economic and operational
characteristics of information use.
- B. Among the attributes that are critical to digital preservation are the
following:
- 1. Source and mode of distribution
- a) Individually-owned materials (such as mail, notes, manuscripts,
preprints, databases, etc.); Corporately-owned materials (employment and
financial records, planning documents, reports, etc.); Publishers (books,
serials, films, recordings, etc.); and libraries, museums and other educational
institutions.
- b) Personal and organizational archives require an information management
infrastructure for ongoing preservation
- c) Publishers maintain an infrastructure that may or may not meet
preservation requirements
- 2. Context
- a) Meta-information (bibliographic catalogs, indices, data dictionaries,
directory systems, etc.); as opposed to
- b) The documents and other objects to which they refer, such as monographic
and serial texts, graphic and photographic images, sound recordings, data
collections, software-dependent data objects (GIS and CAD)
- 3. Encoding of structure, format and content may vary and affect both use
and ability to archive
- a) All features of information objects may be encoded in proprietary
software that runs on specific operating systems
- b) They may be encoded in standard formats: ASCII, TIFF, etc.
- c) They may have self-defining qualities: SGML with DTD; data and codebooks
- d) Multimedia objects may incorporate all of these qualities
- 4. The attributes may be dynamic in various ways:
- a) Information objects may be revised and updated so that there are
instances, versions or editions of the object
- b) Information objects may change cumulatively or interactively, as in
contributions to a listserv
- c) Information objects may be dynamic in the various views one has or takes
of the information
- d) Information objects may change in the linkages made among them
- (1) For objects that are dynamic in this way, like WWW pages, there appears
to be no good archiving solution other than to take periodic snapshots or to
archive everything
- (2) Solution only for component pieces
- 5. Perhaps the most important attribute is use
- a) What observers seem most to fear is that digital information objects
that are no longer used will simply be deleted without consideration for future
use and without being made available for archiving by some other custodian.
- b) Put a different way: the best insurance that information will endure is
use, but use is not a sufficient criterion for continuous archiving
- c) Put another way: a digital archive serves as a safety net for user
demand.
- IV. Roles, Functions and Organization for Archiving
- A. Hypotheses about effective organization for archiving digital
information follow from the preceding analysis.
- 1. Effective management of relationships among stakeholders is the key to
successful archiving
- 2. The distributed network of information suggests the need for distributed
responsibility for archiving the information
- a) Stakeholders will invoke a variety of consortial models as the emerging
digital environment gives rise to new ways of interacting and dividing labor
and responsibility
- b) Collaborative models will likely include partnerships, federations,
contractor/subcontractor relationships, etc.
- c) Organizations will form around intellectual discipline, types of
material, functional role such as storage or cataloging, and across to
regional, national or international boundaries
- d) As the digital preservation environment takes shape the most effective
organizational structures will likely be those that are agile and bear the
least overhead
- 3. Given that digital technologies facilitate the reuse of information
objects and given that the best insurance that information will endure is use,
it seems plausible to suggest the following tiered structure of
responsibility:
- a) Information creators/providers/owners have initial responsibility to
provide for the archiving of digital information.
- b) The creator/provider/owner may engage libraries and archives to take
over some or all of the archival responsibility
- (1) Libraries and archives may also interact with creators/providers as
subcontractors for maintaining an archive even during the active life of
information objects
- (2) Libraries and archives may exercise an aggressive rescue function to
preserve information objects that become endangered because the
creator/provider/owner no longer takes responsibility for the archiving
function and does not take steps formally to hand it over
- c) Libraries and archives would assume responsibility for selecting and
archiving material for which there is no natural institutional home
- 4. The organization for archiving ought to be designed to accommodate
information objects that are self-describing; that is, packaged with
information about what it is, what is needed to effectively use it and how to
use it
- B. A commitment to enduring access is a defining feature of a digital
archive and is fulfilled in practice by the exercise of these critical
functions:
- 1. Managing the operating environment, which consists of the following
areas of responsibility
- a) Storage of the copy of record
- (1) Storage may be on-line, near-line, or off-line but must be accessible
when needed
- (2) Storage practice may support just-in-case as well as just-in-time
distribution strategies
- b) Access policies
- (1) Level of access
- (2) Nature of access
- c) Connectivity
- d) Description
- (1) In order to decide what to preserve, you've got to decide what what is
- (2) Presume that provider/publisher will provide a basic set of metadata
- (3) The archiving system must support generating and managing common
metadata from multiple objects
- e) Retrieval
- (1) You do not have what you cannot retrieve
- (2) Develop and maintain mechanisms for searching metadata and information
objects
- f) Assure authority and provenance via cryptographic techniques
- g) Capacity
- (1) Are there sufficient access points?
- (2) Does the archive meet current computational standards, including user
display capabilities
- 2. Managing the migration of the archive as the operating environment
changes.
- a) Working Definition: Migration is the periodic transfer of digital
information from one hardware/software configuration to another or from one
generation of technology to a subsequent generation in order to retain the
ability to access, display, retrieve, manipulate, and use the information.
- b) Migration is different from copying. Copying is transferring the same
bit stream from one medium or storage device to another.
- c) In some migrations, it may not be possible to migrate an exact "replica"
or "copy" of the original object and still retain software compatibility.
- d) Changing hardware and software will drive the need for migration.
- 3. Managing the costs of the operating environment of the archive and of
periodic migrations.
- a) The ability to estimate/predict the costs of operation and of migration
will be an important factor in planning and resource allocation (and possibly
selection).
- b) Costs of the operating environment will likely vary over time:
- (1) The principle cost factors are those associated with storage, use,
property rights transactions, and the systems engineering needed to maintain
the distributed infrastructure.
- (2) Storage costs need to be managed as an amortized capital cost and will
likely continue to decline both absolutely and relative to the other cost
factors.
- (3) The costs of access and of property rights transactions are relatively
high because the supporting systems are highly immature (or non-existent);
these systems are developing very rapidly and their relative costs will fall.
- (4) In the long-run, the primary cost factor in the management of digital
archives will likely be the costs of systems engineering to support the highly
distributed network-based functions needed to operate and digital archive
effectively.
- c) Costs of migration will vary depending on:
- (1) complexity of original data structures
- (2) frequency of migration (e.g. the life cycle of software and how vendors
are positioning themselves in different application environments).
- (3) the extent to which the functionality for computation, display,
indexing, linkage, etc. must be migrated in addition to content.
- (4) the need to compensate for acquisition or intellectual property
rights.
- d) We have little reliable data on these costs and little experience in
managing them.
- e) There are unresolved issues regarding the distribution of costs (e.g.
whether you charge to recover migration costs; and who you charge,
etc.).
- C. Organizational mechanisms to facilitate distributed responsibilities for
archiving
- 1. Is there a need for a central repository, like the Iron Mountain
facility for master microfilm copies?
- 2. Migration Strategies
- a) There are a variety of migration strategies, none of which is entirely
satisfactory nor universally applicable.
- (1) Migration strategies may vary in different application environments,
for different types of material and depending on the need to preserve various
levels of functionality (computational, display, indexing, etc.).
- (2) Our community is only beginning to address migration issues and our
experience is limited in terms of technical feasibility, costs, benchmark, etc.
- (3) Migration should become more effective as the community matures, gains
experience, and learns how to select appropriate migration strategies.
- (4) We still need to work on some useful ways to break digital preservation
into several specific types of material and then refine ideas for migration
strategies in these scenarios.
- b) Strategy 1): Migrate digital materials from less stable to more stable
media and/or from formats that are highly software dependent to formats that
are less software-intensive.
- (1) This strategy is most commonly implemented by printing to paper or
microfilm.
- (2) This strategy is also used for some digital materials (keeping ASCII
text or delimited ASCII data files) when retaining content is paramount or when
display, computational, indexing and other functionality is not critical.
- (3) This is a feasible and cost-effective strategy for a certain slice of
digital materials because it eliminates the need for future migrations or it
reduces migration to simple copying.
- (4) As long as we lack skills, standards, and more robust strategies to
avoid this hybrid solution, printing to paper or film will remain a migration
strategy for certain types of materials in many institutions.
- (5) Many types of digital materials are not amenable to this strategy (i.e.
how do you microfilm a database? or print out a full motion video)
- (6) Migration strategies developed for digital preservation may be
applicable to business environments where there is a desire to reduce or
eliminate paper documents.
- c) Strategy 2): Migrate digital materials from the multiplicity of formats
present at any time to smaller number of common formats.
- (1) Subsequent migrations will involve economies of scale and fewer
customized transformations.
- (2) Development of standard interchange formats that all documents can head
toward may be a more cost-effective approach (such as Opendoc or SQL/CCL for
databases)
- (3) Wide scale adoption of standards may be difficult to achieve because
vendors will determine whether open systems are desirable.
- (4) The need within communities for interchange or sharing of documents
will drive interchange formats and standards, not the need for preservation.
(e.g. current trends within the GIS community or within the business community
around EDI. We won't solve the compatibility problem, but we should take
advantage of it.)
- (5) Make sure that institutional hardware/software platforms comply with
standards or common configurations (e.g. Don't install Apple in the library or
archives when the rest of the organization is using Windows; also try to keep
hardware/software platforms in the library or archives on the same generation
of technology as the rest of the organization.)
- d) Strategy 3: Develop/impose standards
- (1) Look to common usage rules; Adopt de facto standards or commonly used
packages as the only acceptable formats for preservation.
- (2) This strategy is not likely to succeed in many environments where
imposition of standards is viewed as a limitation on freedom of choice.
- e) Strategy 4: Work with industry to develop backward compatibility paths
as standard feature in all software
- (1) Where migration paths are not commonly included in software packages
(such as between software product lines), raise user awareness of the need for
a migration path for: new versions, different vendors, vendors that go out of
business
- f) Strategy 5: Develop "processing centers" that can handle migration and
reformatting of materials in obsolete formats
- (1) Even if we succeed with standards and migration paths, there is a large
body of materials in non-standard formats (and this is likely to continue to be
produced).
- (2) Processing centers should be established that specialize in conversions
of materials in one or a few obsolete formats (e.g. text, certain types of
databases, (GIS, CAD, multi-media)
- (3) Technical strategies might involve maintaining obsolete
hardware/software to provide the look and feel of the original material
- (4) Processing centers might develop software emulators or retrospective
migration programs
- (5) This approach would take advantage of economies of scale and maximize
use of expertise
- (6) Possible models are consortia of institutions, regional centers,
commercial firms (similar to services that convert old movies to current video
formats); national labs (e.g. establish a national hardware and software
laboratory like the national media lab).
- (7) We have very limited experience with the technical feasibility of this
approach
- 3. Registries for distribution of functions
- a) Registry of archives
- (1) Archives must be self-describing, so that users can interrogate them
on-line and understand how they are organized
- (2) Rather than a standard set of elements or a standard organization, we
need instead a standard method for declaring the existence of an archive and
describing what it contains and what services it provides.
- (3) Perhaps a tool like the finding aid might satisfy this requirement
- b) Registries of locations
- (1) Distinction between Universal Resource Name and Universal Resource
Location
- (2) Need for indirection between the name and location so that there can be
multiple instances of the object and the underlying location can change without
affecting pointers to the name.
- c) Registry of ownership
- (1) Means for transacting rights clearance
- (2) Principle: clarity in legal rights will drive decisions to
preserve
- d) Registry of bootstrapping tools
- (1) DTDs, codebooks
- (2) Licensed software and operating systems
- V. Enabling Infrastructure: Factors affecting the efficacy and
durability of the organizational model
- A. Scale and timing
- 1. We are in the early stages of this business
- 2. Current migration activities are sub-critical in terms of scale; larger
scale activities that better apply or distribute expertise may be more
economically feasible
- 3. The ability to spread migration costs across multiple users may also
make migration more cost-effective.
- 4. We need benchmark to measure what migration costs, when it is
worthwhile, where we are improving
- 5. How does emulation compare with copying or refreshing digital
information? Are there other technical solutions that we can imagine or
propose?
- B. What market, contractual and other kinds of support are necessary to
facilitate the transfer of archival responsibility?
- 1. Legal protections, similar to the preservation functions under present
copyright law, are needed to enable libraries/archives to assume responsibility
for preserving digital information objects in the event that the
creator/provider/owner no longer takes responsibility for the archiving
function and does not take steps formally to hand it over
- 2. To the extent that present motivation to maintain archives arises from
the principle of fair use, how is the principle maintained in the model?
- C. What financial incentives might motivate the archival enterprise?
- 1. Financial support through life cycle budgeting, access charges etc.
- 2. Digital technologies must have capital budgeting techniques
- 3. Tax implications for providers/publishers building digital stock rather
than paper stock
- D. Feasibility of registries
- 1. Transactions for rights permissions and payments
- 2. Bootstrapping techniques, particularly with obsolete software and
operating systems
- E. Search engines and metadata
- F. Links to other subsystems inside and outside the infrastructure (e.g.
commercial repositories, GILS, analog archives, etc.)
- VI. Recommendations
- A. Identify best practices in these environments and learn from them.
- 1. There are many situations where digital preservation will be
handled/solved by others to meet their own business needs (e.g. in government,
industry, medicine, etc.).
- 2. We should focus efforts on areas where solutions in industry,
government, etc. will not meet the needs of our users for continuing access to
digital information; e.g. identify where migration will not occur if we do not
intervene.
- B. Encourage the creators of digital materials to keep digital information
in a live, native software environment as long as possible.
- 1. We need to concentrate on migration strategies for distributed archives
if and when the original custodian goes out of business or has no continuing
interest in preserving digital materials.
- C. Develop strategies to encourage/support preservation of software as a
significant intellectual and cultural product in its own right.
- 1. Build alliances with software engineers, computing enterprises, etc. as
sources of both financial support and expertise for software preservation.
- 2. Create computer hardware/software repositories that are both museums
documenting the significance of computing and working laboratories for research
and migration.
- 3. Integrate software preservation strategies with support for
migration