Preservation of Digital Information: Issues and Current Statusby Alison BullockNetwork Notes #60 ISSN 1201-4338 Information Technology Services National Library of Canada April 22, 1999
1.0 Introduction For several decades, preservation specialists have voiced concern about the preservation of the portion of our cultural heritage in electronic form. The major challenge ¾ rapid obsolescence of the hardware and software required to interpret and present digital documents ¾ has been widely discussed. Ensuring continued access to digital information necessarily involves copying or transforming digital documents to run on current media, software, hardware and operating systems. This document explores the issues surrounding the preservation of digital information and highlights work to address them. 2.0 What is digital preservation? "Digital preservation" or "digital archiving" means taking steps to ensure the longevity of electronic documents. It applies to documents that are either "born digital" and stored on-line (or on CD-ROM, diskettes or other physical carriers) or to the products of analog-to-digital conversion, if long-term access is intended. 3.0 The problems in a nutshell The fundamental problem of preserving electronic documents or "digital objects" stems from the nature of the objects themselves. Unlike non-digital formats such as books, magazines, manuscripts, or microfilms, digital objects are accessible only by using combinations of computer hardware and software. Market competition means that this hardware and software can become obsolete in cycles of less than three years. Ensuring ongoing access, therefore, requires currency with technology changes, and moving digital objects from obsolete to current file formats, storage media, operating systems and so on. A number of other technical, social and legal issues add to the difficulty of the task. These include
4.0 Preservation requirements Preservation measures ensure that a document or artifact ¾ digital or otherwise ¾ is accessible in a usable form over time. Maintaining the accessibility of digital media, however, is much more complex than with such non-digital media as paper. For example, when a book is preserved in its original format, all aspects of the book are preserved ¾ its physical presence, its format, its layout, and its content. It is practically impossible to extract individual elements (e.g., content without layout) because they are inextricably linked. Even reformatting to paper or microfilm does not completely divorce content from layout as page sequences and physical appearance, for instance, can still be captured. Digital objects, in contrast, are easily decomposed into individual elements, and significantly more effort must be made to preserve them as a "whole." For example, one can retain the content of an electronic document, while losing the layout. Further, one can keep its physical presence (i.e., a file), but fail to preserve its readability. In the digital world, then, the first task is to identify the multiple aspects of a work that must be preserved. Next, to succeed in the preservation of digital objects, preservation measures must ensure that as many of these aspects as possible persist over time. In preserving a digital object, we aim to:
5.0 Proposed preservation strategies Several strategies attempt to address the primary digital preservation problem of technological obsolescence. These include migrating information through successive generations of technology; using software to emulate the behavior of older machines; preserving original hardware and software to run obsolete programs, and creating hard copies (paper or microform) of digital objects. Each of these strategies meets some, but not all, preservation goals.
5.1 Migration Migration is the primary strategy articulated by most organizations that plan to preserve digital objects. It covers a range of activities to periodically copy, convert or transfer digital information from one generation of technology to subsequent ones. Migration may involve copying digital information from a medium that is becoming obsolete or physically deteriorating to a newer one (e.g., floppy disk to CD-ROM), and/or converting from one format to another (e.g., Microsoft Word to ASCII), and/or moving documents from one platform to another (e.g., VAX to UNIX). Migration certainly preserves the physical presence and the content of a digital object. However, it may not preserve presentation, functionality and context. For example, presentation elements such as bolding and italics may disappear, and the functionality and context provided by links between database entries may be lost because the links break. Successive migrations may eventually result in unacceptable data loss. The focus is on limiting the loss and retaining the content in a usable form. Data archives have a long history of using migration successfully, but are generally dealing with relatively homogeneous information deposited under guidelines specifying a limited number of acceptable formats and modes of transmission. Some archives convert non-standard formats to one or two standards on receipt. In contrast, the National Library of Australia encountered problems migrating a small sample of commercial publications from floppy disks to CD-ROM. First, a large portion (35 percent) could not be tested or used because the library lacked the necessary hardware or software. Then, for various reasons, only half of the 40 or so disks that were copied from floppy disk to CD-ROM could be confirmed to be functional after copying 2. The experience of the National Library of Australia illustrates that even the simplest form of migration ¾ copying ¾ may pose problems for some types of digital objects residing in library collections. The preservation of physical format publications (CD-ROMs, floppy disks) will be particularly challenging. For example, in copying these objects, we may also be copying executable programs in languages that may become obsolete. Nonetheless, there are a number of ways to increase the chances of using migration successfully as a preservation strategy. These include Migration is undeniably an important strategy for preserving digital objects. However, it has yet to be tested and proven as a mechanism for managing complex multimedia objects over the long term. 5.2 Emulation Emulation refers to creating new software that mimics the operations of older hardware or software in order to reproduce its performance. Thus, not only are physical presence and content preserved, but digital objects could display original features (e.g., layout) and functionality available with the older software. Emulation has recently attracted attention as a potential strategy to assist preservation, recognizing that some electronic material that is highly dependent on particular hardware and software will not lend itself to migration. Emulation is used to provide "backward compatibility" for video games, and to model how future systems might run. Emulators exist for some obsolete systems, however, emulation for preserving digital objects over the long term has not been widely tested or priced. 5.3 Output to permanent paper or microfilm Outputting a hard copy of a digital file is a "low tech" solution that can result in a well-standardized product with a life expectancy of several hundred years. Certainly, this strategy could fix the object as a whole and preserve content and to some extent layout. However, a decreasing number of publications (flat files, printable formats) lend themselves to such methods. For example, output to paper will lead to great functional loss for hypertext documents, and cannot capture multimedia. Despite these drawbacks, a "hybrid strategy" of creating both microfilm and digital copies is gaining support as a technique for reformatting paper originals. The digital copy enhances access and functionality, and the microform copy acts as an archival surrogate. 5.4 Preserve technology Another method for ensuring ongoing access to digital objects would be to simply keep older technology available for use. Although this would preserve content and enable future generations to view digital objects in their native format with original layout and functionality, creating hardware or software "museums" is prohibitive in cost, space and technical support requirements. At best, this method is an interim measure when migration is not possible. Research is currently under way to explore the preservation strategies outlined above. In an early initiative in the U.K., the National Preservation Office and the Joint Information Systems Committee (JISC) co-funded several projects on digital archiving. One outcome of this work is a tool for measuring the complexity of the preservation process and guiding selection of a preservation approach. The tool uses a scorecard whose elements include the type of material, file format, medium and platform/operating system. Other JISC/NPO reports focus on comparing digital preservation methods, and performing ad-hoc rescue of digital material. The JISC is building on this foundation by funding the CEDARS 3 project (CURL Exemplars in Digital ARchiveS) through the Consortium of University Research Libraries. The three-year project began in March 1998. Among other objectives, CEDARS will investigate methods of preserving different sorts of digital resources and develop priced and scaleable models. Other research on digital preservation approaches is being carried out in the United States. Cornell University, for example, is working to create risk management tools for the management of digital information, and to develop a plan for the long-term preservation of Cornell’s digital documents 4. The Council on Library and Information Resources, which is partially funding the Cornell work, has also just released a report on emulation 5. The report analyzes proposed preservation strategies, outlines why emulation is the most feasible, and identifies further research which needs to be done to support emulation as the preservation strategy of choice for digital objects. Clearly, there is more work to be done before current strategies will satisfy most preservation needs. Furthermore, strategies such as migration and emulation will require ongoing commitment and significant resources. As in the analog realm, a combination of approaches will be required to ensure that digital information survives. 6.0 Assuming control To increase the probability that digital objects will be preserved, organizations need to lay appropriate groundwork. One way is to develop and implement the most effective practices in acquiring, describing and managing digital resources.
6.1 Adopting standards To facilitate preservation, the best practice is to adopt a three-part approach: 1) use current standards to create digital objects; 2) monitor standards as they change; 3) migrate to new standards as they are established. Most digital preservation guidelines advocate collecting digital objects in standard formats. Unfortunately, while standards are well defined for text (e.g., ASCII), images (e.g., JPEG, TIFF) and encoding documents (e.g., SGML, HTML), standards have not emerged for some other types of information (e.g., databases). In addition, not only do standards change rapidly, but vendors seeking to increase their market share incorporate "value-added" features to accepted standards. Thus, some valuable information that should be preserved is found in non-standard and "almost standard" digital objects. Institutions performing the digital preservation activity can also be subject to standardization. The RLG Task Force on Digital Archiving, for example, proposes "a formal process of certification, in which digital archives meet or exceed the standards and criteria of an independent certifying agency" 6. In a similar vein, the International Standards Organization (ISO) has produced a reference model (CCSDS 650.0-W-4.0), for an Open Archival Information System (OAIS) 7. The model establishes the minimum requirements for a digital archive to ensure long-term preservation of digital information, and provides a framework for describing and comparing archival architecture and operation. A consortium of European libraries has adopted the OAIS as the reference model for the Networked European Depository Library (NEDLIB). The model is also being used in digital preservation initiatives in the U.K. and Australia. 6.2 Developing Digital Preservation Guidelines Archives and record-keeping bodies in Europe, North America and Australia have taken the lead in developing best practices and functional requirements that address some preservation issues 8. Common elements include Some libraries have developed similar guidelines. However, a recent survey concluded that most digital preservation guidelines focus on creation, receipt and capture of digital objects, and do not satisfactorily address their long-term preservation 9. Although long-term preservation needs may not be completely addressed by current guidelines, preserving digital objects in one other area is well-documented. Guidelines tell us how to store and handle physical format digital objects properly and how to manage risk. These measures maximize the possibility that the resources can be moved through successive technological generations, and work well in the short term. Unfortunately, they are not in themselves sufficient to ensure long-term accessibility to the material. 6.3 Documenting Resources A recurrent theme in digital preservation guidelines is documentation and description of electronic resources. The need for such deliberate description stems in part from the fact that digital objects do not carry the visible evidence of creation and use (imprints, bindings, bookplates, marginalia, or Scotch tape) of non-electronic formats. Such clues guide preservation decisions. They also help users to establish that the work is whole and intact, and to understand its provenance and the context in which it was created. 6.3.1 Metadata A description of a digital object is "data about data," or metadata. Such descriptive data should include the contextual information crucial to the long-term management of electronic information. Metadata elements useful in preservation might include Conversion projects would employ additional metadata elements such as the capture device, resolution, compression, source material, and producer (of the digital document). Existing metadata schemes (e.g., MARC, Dublin Core) provide for some such information capture, but there is no consensus on which approach will work best for preservation purposes. In MARC, for example, some of the necessary data is captured in notes’ fields that may not use sufficiently detailed or consistent language for subsequent search and retrieval. 6.3.2 Unique identifiers One element of describing a digital object is to assign it a unique and persistent identifier. An identifier is a number, like an ISBN, which is associated with a particular instance of a digital object. Unlike an URL, it is independent of the location of that object. A unique, widely supported identifier for digital objects would help establish the authenticity of the object by confirming to a user that the resource s/he is accessing is the one cited. It could also help establish the relationship between copies or versions of digital objects, as any modification of the original would result in a modification of its identifier. An outline of the various naming schemes is given in A Glossary of Digital Library Standards, Protocols and Formats 10. Some organizations involved in digital preservation are currently using PURLs (Persistent Uniform Resource Locators),URNs (Uniform Resource Names) or modified Digital Object Identifiers. As yet, no single digital identifier has achieved widespread, international acceptance. 6.3.3 Linking metadata with content Metadata can be stored either as an integral part of the document it describes (e.g., embedded in an HTML header) or as part of a separate file of information (e.g., a MARC record). One way of linking metadata and the digital object is to package them together. To this end, the aforementioned Reference Model for an Open Archival Information System (OAIS) proposes an "information package" comprised of "content information" and "preservation descriptive information" 11. Along similar lines, a working group of the Society of Motion Picture and Television Engineers has developed a Universal Preservation Format (UPF) 12, a data-file mechanism that uses a container structure to incorporate metadata into digital media objects. Although primarily developed for audio-visual data, the principles underlying the UPF may have wider applicability. 6.4 Building Partnerships Another way in which libraries and archives are assuming control of digital preservation is by forging partnerships. Examples of library consortia include the eLib (Electronic Library) project in the U.K., the Digital Library Federation and the Research Libraries Group ARCHES (Archival Server and Test Bed) in the U.S., the NEDLIB (Networked European Depository Library) in Europe, and the Canadian Initiative on Digital Libraries in Canada 13. These groups were founded primarily to build digital libraries, in which managing preservation is a necessary component. 6.5 Establishing Infrastructure A final way in which libraries are gaining control over the preservation of digital objects is by building preservation requirements into systems development. The Australian National Library, for example, has started a Digital Services Project to develop systems to manage its digital collections and to support cooperative, shared access 14. The National Library of Canada has incorporated preservation requirements in its Digital Library Infrastructure Project. The next step is to develop the systems. 7. Next steps In developing policies and establishing and promoting best practices and procedures to select, acquire, describe and store digital objects, archives, libraries, and others have taken the first steps to preserving these resources. The next steps remain ill defined. To reduce the uncertainty surrounding digital preservation, we can
We are not yet dealing satisfactorily with the problems of digital preservation. Rapid technological obsolescence combined with relatively short-lived media means that collections must be actively managed. Simply collecting and "shelving" important works, a passive strategy that works to some extent for paper-based publications, is insufficient to ensure digital objects will survive in perpetuity. Resources Bearman, David ; Trant Jennifer. -- Authenticity of Digital Resources: Towards a Statement of Requirements in the Research Process. -- D-Lib Magazine. -- (June 1998). <http://sunsite.anu.edu.au/mirrors/dlib/dlib/june98/06bearman.html> Graham, Peter S. -- Long-Term Intellectual Preservation. -- Collection Management. -- Vol. 22, no. 3/4 (1998). -- P. 81-98 Hedstrom, Margaret. -- Digital preservation: a time bomb for Digital Libraries. -- (n.d.). Day, Michael. -- Metadata for Preservation. -- CEDARS Project Document AIW01. -- (August 3, 1998). <http://www.ukoln.ac.uk/metadata/cedars/AIW01.html> (February 24, 1999) National Library of Australia. -- PADI: Preserving Access to Digital Information Web site. <http://www.nla.gov.au/padi/> Rothenberg, Jeff. -- Ensuring the Longevity of Digital Information. -- (Rev. February 22, 1999). <http://www.clir.org/programs/otheractiv/ensuring.pdf> Task Force on the Archiving of Digital Information. -- Preserving digital information: report of the Task Force on Archiving of Digital Information. -- Commissioned by the Commission on Preservation and Access and the Research Libraries Group. -- Washington, D.C.: Commission on Preservation and Access, 1996. <http://www.clir.org/programs/otheractiv/ensuring.pdf> Acknowledgements Thanks to my colleagues at the National Library of Canada -- namely, Gary Cleveland, Susan Haigh and Nancy Brodie -- for reviewing and suggesting revisions to this paper. __________Notes 1 Many of these concepts (content, fixity, reference, context and provenance) are derived from the Task Force on the Archiving of Digital Information, Preserving digital information: report of the Task Force on Archiving of Digital Information commissioned by the Commission on Preservation and Access and the Research Libraries Group. Washington, D.C.: Commission on Preservation and Access, 1996. <www.rlg.org/ArchTF/> 2 Colin Webb, "Migration Trials: Migrating publications on floppy disk to CD-R". October 1997. <www.nla.gov.au/nla/staffpaper/cwebb7.html> (Jan. 20, 1999) 4 <www.news.cornell.edu/releases/Nov98/preserving.digital.bs.html> 5 The full text of this report, titled "Avoiding Technological Quicksand: Finding a Viable Technological Foundation for Digital Preservation", is available at <www.clir.org/pubs/reports/reports.html> 7 <ftp://nssdc.gsfc.nasa.gov/pub/sfdu/isoas/int07/CCSDS-650.0-W-4.pdf> 8 For example, see Martin Bangemann, Guidelines for best practices for using electronic information, European Commission, 1997. <www.echo.lu/dlm/en/gdlines.html> 9 Marc Fresko and Kenneth Tombs. "Digital Preservation Guidelines: The state of the art in libraries, museums and archives." European Communities, 1998. <www.echo.lu/digicult/en/backgrd.html> 10 Susan Haigh "A Glossary of Digital Library Standards, Protocols and Formats." Network Notes #54, May 6, 1998. <www.nlc-bnc.ca/pubs/netnotes/notes54.htm> 12 UPF Homepage: <info.wgbh.org/upf/> 13 See summary in Nancy Brodie. "Collaboration among National Libraries in the Preservation of Digital Information", National Library News, vol.31, no. 3-4. Mar-Apr.1999, pp. 5-6. <www.nlc-bnc.ca/pubs/nl-news/1999/mar99e/mar99e.htm> 14 An explanation of the project, and supporting documentation, is available at <www.nla.gov.au/dsp/> Copyright. The National Library of Canada. (Revised: 1999-5-26). |