Stories

Spacer  

D-Lib Magazine
January 2000

Volume 6 Number 1

ISSN 1082-9873

Best Practices for Digital Archiving

An Information Life Cycle Approach

Spacer Line
Spacer Gail M. Hodge
Information International Associates, Inc.
Consultant to the International Council for Scientific and Technical Information
gailhodge@aol.com
Spacer Line
Spacer

As we move into the electronic era of digital objects it is important to know that there are new barbarians at the gate and that we are moving into an era where much of what we know today, much of what is coded and written electronically, will be lost forever. We are, to my mind, living in the midst of digital Dark Ages; consequently, much as monks of times past, it falls to librarians and archivists to hold to the tradition which reveres history and the published heritage of our times. - Terry Kuny, XIST/Consultant, National Library of Canada [Kuny 1998]

1.0 Introduction

The rapid growth in the creation and dissemination of digital objects by authors, publishers, corporations, governments, and even librarians, archivists and museum curators, has emphasized the speed and ease of short-term dissemination with little regard for the long-term preservation of digital information. However, digital information is fragile in ways that differ from traditional technologies, such as paper or microfilm. It is more easily corrupted or altered without recognition. Digital storage media have shorter life spans, and digital information requires access technologies that are changing at an ever-increasing pace. Some types of information, such as multimedia, are so closely linked to the software and hardware technologies that they cannot be used outside these proprietary environments [Kuny 1998]. Because of the speed of technological advances, the time frame in which we must consider archiving becomes much shorter. The time between manufacture and preservation is shrinking.

While there are traditions of stewardship and best practices that have become institutionalized in the print environment, many of these traditions are inadequate, inappropriate or not well known among the stakeholders in the digital environment. Originators are able to bypass the traditional publishing, dissemination and announcement processes that are part of the traditional path from creation to archiving and preservation. Groups and individuals who did not previously consider themselves to be archivists are now being drawn into the role, either because of the infrastructure and intellectual property issues involved or because user groups are demanding it. Librarians and archivists who traditionally managed the life cycle of print information from creation to long-term preservation and archiving, must now look to information managers from the computer science tradition to support the development of a system of stewardship in the new digital environment. There is a need to identify new best practices that satisfy the requirements and are practical for the various stakeholder groups involved.

2.0 The Background of the ICSTI Study

In an effort to advance the state-of-the-art and practice of digital archiving, the International Council for Scientific and Technical Information (ICSTI), a community of scientific and technical information organizations that includes national libraries, research institutes, publishers, and bibliographic database producers, sponsored a study in March 1999 [Hodge 1999]. This study is the most recent in a series of efforts on the part of ICSTI to highlight the importance of digital archiving. The topic was first raised in the joint UNESCO/International Council of Scientific Unions (ICSU) Conference on Electronic Publishing in 1996. The topic was highlighted at the technical session of the June 1997 Annual ICSTI meeting and a working group was formed. The Electronic Publications Archive Working Group presented a white paper of the major issues in December 1998 [ICSTI 1998]. At its December 1998 meeting, the ICSTI Board approved the study on which this report is based. Based on common interest in this topic, CENDI, an interagency working group of scientific and technical information managers in the U.S. federal government, cosponsored the study.

3.0 Study Methodology

The study began with an initial survey of the ICSTI and CENDI membership, a literature review and contacts with experts in order to identify digital archiving projects. Over 30 projects were identified, from which 18 were selected as the most "cutting edge". The highlighted projects covered six countries (U.S. (9), UK (2), Canada (1), Australia (1), Sweden (1) and Finland (1)) and four international organizations. They came from a variety of sectors including government scientific and technical programs, national archives, national libraries, publishers, and research institutes.

Project managers from the selected projects were asked a series of questions aimed at identifying emerging models and best practices for digital archiving. While technologies for storage and retrieval were discussed, technology was of secondary interest to the understanding of policy and practice.

For purposes of the study, "digital archiving" was defined as the long-term storage, preservation and access to information that is "born digital" (created and disseminated primarily in electronic form) or for which the digital version is considered to be the primary archive. [The study did not include the digitization of material from another medium unless the digital became the primary version.] The study aimed to provide new insights into digital archiving issues elicited by many of the baseline studies and white papers on digital archiving [Garrett 1996; Hedstrom 1998; NRC 1995; Haynes 1997; Beagrie 1998]. Primary attention was given to operational and prototype projects involving scientific and technical information at an international level. It included a variety of digital format types applicable to scientific and technical information, including data, text, images, audio, video and multimedia; and a variety of object types, such as electronic journals, monographs, satellite imagery, biological sequence data, and patents. The results, while not scientifically valid, identify emerging models and best practices for digital archives in an effort to support the development of a tradition of digital stewardship.

4.0 Digital Archiving in the Framework of Information Life Cycle Management

The project managers from the "cutting edge" projects emphasized the importance of considering best practices for archiving at all stages of the information management life cycle. Acknowledging this important philosophy, the best practices identified by the study are presented in the framework of the information life cycle -- creation, acquisition, cataloging/identification, storage, preservation and access.

4.1 Creation

Creation is the act of producing the information product. The producer may be a human author or originator, or a piece of equipment such as a sensing device, satellite or laboratory instrument. Creation is viewed here in the broadest sense, as increasingly science is based on a variety of data types, products and originators.

All project managers acknowledged that creation is where long-term archiving and preservation must start. Even in rigorously controlled situations, the digital information may be lost without the initial awareness on the part of the originator of the importance of archiving. Practices used when a digital object is created ultimately impact the ease with which the object can be digitally archived and preserved.

In addition, there are several key practices involving the creator that are evolving within the archiving projects. First, the creator may be involved in assessing the long-term value of the information. In lieu of other assessment factors, the creator’s estimate of the long-term value of the information may be a good indication of the value that will be placed on it by people within the same discipline or area of research in the future. The U.S. Department of Agriculture’s Digital Publications Preservation Steering Committee has suggested that the creator provide a preservation indicator in the document. This would not take the place of formal retention schedules, but it would provide an indication of the long-term value that the creator, as a practicing researcher, attaches to the document’s contents.

Secondly, the preservation and archiving process is made more efficient when attention is paid to issues of consistency, format, standardization and metadata description in the very beginning of the information life cycle. The Oak Ridge National Laboratory (Tennessee, USA) recently announced guidelines for the creation of digital documents. Limits are placed on both the software that can be used and on the format and layout of the documents in order to make short and long-term information management easier.

Many project managers acknowledged that the best practice would be to create the metadata at the object creation stage, or to create the metadata in stages, with the metadata provided at creation augmented by additional elements during the cataloging/identification stage. However, only in the case of data objects is the metadata routinely collected at the point of creation. Many of the datasets are created by measurement or monitoring instruments, and the metadata is supplied along with the data stream. This may include location, instrument type, and other quality indicators concerning the context of the measurement. In some cases, this instrument-generated metadata is supplemented by information provided by the original researcher.

For smaller datasets and other objects such as documents and images, much of the metadata continues to be created "by hand" and after-the-fact. Metadata creation is not sufficiently incorporated into the tools for the creation of these objects to rely solely on the creation process. As standards groups and vendors move to incorporate XML (eXtensible Mark-up Language) and RDF (Resource Description Framework) architectures in their word processing and database products, the creation of metadata as part of the origination of the object will be easier.

4.2 Acquisition and Collection Development

Acquisition and collection development is the stage in which the created object is "incorporated" physically or virtually into the archive. The object must be known to the archive administration. There are two main aspects to the acquisition of digital objects -- collection policies and gathering procedures.

4.2.1 Collection Policies

In most countries, the major difference in collection policies between formal print and electronic publications is the question of whether digital materials are included under current deposit legislation. Guidelines help to establish the boundaries in such an unregulated situation. It is also the case that there is just too much material that could be archived from the Internet, so guidelines are needed to tailor the general collection practices of the organization. The collection policies answer questions related to selecting what to archive, determining extent, archiving links, and refreshing site contents.

4.2.1.1 Selecting What to Archive

Both the National Library of Canada (NLC) and the National Library of Australia (NLA) acknowledge the importance of selection guidelines. The NLC’s Guidelines state, "The main difficulty in extending legal deposit to network publishing is that legal deposit is a relatively indiscriminate acquisition mechanism that aims at comprehensiveness. In the network environment, any individual with access to the Internet can be a publisher, and the network publishing process does not always provide the initial screening and selection at the manuscript stage on which libraries have traditionally relied in the print environment… Selection policies are, therefore, needed to ensure the collection of publications of lasting cultural and research value." [NLC 1998]

While the scope of NLA’s PANDORA (Preserving and Accessing Networked DOcumentary Resources of Australia) Project is only to preserve Australian Internet publishing, the NLA also acknowledges that it is still impossible to archive everything. Therefore, the NLA has formulated guidelines for the Selection of Online Australian Publications Intended for Preservation by the National Library of Australia. These guidelines are key to the successful networking of the state libraries into the National Collection of Australian Electronic Publications, since they provide consistency across multiple acquisition activities. Scholarly publications of national significance and those of current and long term research value are archived comprehensively. Other items are archived on a selective basis "to provide a broad cultural snapshot of how Australians are using the Internet to disseminate information, express opinions, lobby, and publish their creative work." [NLA]

4.2.1.2 Determining Extent

Directly connected to the question of selection is the issue of extent. What is the extent or the boundary of a particular digital work? This is particularly an issue when selecting complex Web sites.

"[For PANDORA] internal links only are archived. Both higher and lower links on the site are explored to establish which components form a title that stands on its own for the purposes of preservation and cataloguing. …preference is given to breaking down large sites into component titles and selecting those that meet the guidelines. However, sometimes the components of larger publications or sites do not stand well on their own but together do form a valuable source of information. In this case, if it fits the guidelines, the site should be selected for archiving as an entity." [NLA]

4.2.1.3 Archiving Links

The extensive use of hypertext links to other digital objects in electronic publications raises the question of whether these links and their content should be archived along with the source item. This issue has been addressed by the selected projects in a variety of ways.

Most organizations archive the links (the URLs or other identifiers) but not the content of the linked objects. The American Institute of Physics archives the links embedded in the text and references of its electronic journal articles but not the text or content of any of these links, unless the linked item happens to be in its publication archive or in the supplemental material which it also archives. Similarly, the Office of Scientific and Technical Information of the U.S. Department of Energy (DOE OSTI) does not intentionally archive any links beyond the extent of the digital object itself. However, the document may be linked to another document if that document is another DOE document in the OSTI archive. NLA’s decision about archiving the content of linked objects is based on its selection guidelines. If a linked item meets the selection guidelines, it’s contents will be archived, otherwise it will not be.

In a slightly different approach, the NLC has chosen to archive the text of the linked object only if it is on the same server as the object that is being archived. The NLC cites difficulties in tracking down hypertext links and acquiring the linked objects as the reason for its decision not to include the content of other links. The previous issue of the same periodical, accessed through a hypertext link, would be considered a part of the original publication. Another publication accessed through a hypertext link would not be considered part of the original publication.

Only two of the reviewed projects archive the content of all links. Internet guru Brewster Kahle’s Internet Archive retains all links (unless they are to "off-limits" sites), because the aim of the project is to archive a snapshot of the entire Internet. Within a specific domain, the American Astronomical Society also maintains all links to both documents and supporting materials in other formats, based on extensive collaboration among the various international astronomical societies, researchers, universities, and government agencies. Each organization archives its own publications, but links are maintained not only from references in the full text and cited references of the articles, but between and among the major international astronomical databases. Within this specific domain, the contents of all linked objects are available.

4.2.1.4 Refreshing the Archived Contents

In cases where the archiving is taking place while changes or updates may still be occurring to the digital object, as in the case of on-going Web sites, there is a need to consider refreshing the archived contents. A balance must be struck between the completeness and currency of the archive and the burden on the system resources. Obviously, the burden of refreshing the content increases as the number of sources stored in the archive increases. For example, NLA allocates a gathering schedule to each "publication" in its automatic harvesting program. The options include on/off, weekly, monthly, quarterly, half-yearly, every nine months, or annually. The selection is dependent on the degree of change expected and the overall stability of the site.

4.2.2 Gathering Approaches

There are two general approaches to the gathering of relevant Internet-based information -- hand-selected and automatic. In the case of the NLA, the sites are reviewed and hand-selected. They are monitored for their persistence before being included in the archive. Alternatively, the Royal Library, the National Library of Sweden, acquires material by periodically running a robot to capture sites for its Kulturarw3 project without making value judgments [National Library of Sweden]. The harvester automatically captures sites from the .se country domain and from known Web servers that are located in Sweden even though they have .com extensions. In addition, some material is obtained from foreign sites with material about Sweden, such as travel information or translations of Swedish literature. While the acquisition is automatic, the National Library gives priority to periodicals, static documents, and HTML pages. Conferences, usenet groups, ftp archives, and databases are considered lower priority.

The EVA Project at the University of Helsinki, National Library of Finland uses techniques similar to those used in Sweden. However, the guidelines from EVA address issues to be considered when using robots for harvesting. In order not to overload the servers being harvested, particularly those belonging to the public networks, the EVA guidelines establish time limits between visits to a single Web server and between capturing and recapturing a single URL. Even though this approach has allowed the EVA project to progress, developers at EVA consider this approach to be "very rough and not flexible enough for archiving purposes." [University of Helsinki Library] The EVA developers would prefer that the time limits be more configurable at the server and, preferably, at the individual URL levels. The flexibility would require that the scheduler be a database application that can be modified by the librarian.

4.2.3 Intellectual Property Concerns

Intellectual property remains a key issue in the acquisition process. The approaches to intellectual property vary based on the type of organization doing the archiving. In the case of data centers or corporate archives where there is a close tie between the center and the owner or funding source, there is little question about the intellectual property rights related to acquisition. However, in the case of national libraries, the approaches to intellectual property rights differ from country to country. The differences are based on variant national information policies or legal deposit laws. In many countries, the law has not yet caught up with the digital environment, and the libraries must make their own decisions. In the absence of digital deposit legislation, the PANDORA Project seeks permission from the copyright owner before copying the resource for the archive. In contrast, the Swedish and Finnish national library projects have an automated system and do not contact the owners.

4.3 Identification and Cataloging

Once the archive has acquired the digital object, it is necessary to identify and catalog it. Both identification and cataloging allow the archiving organization to manage the digital objects over time. Identification provides a unique key for finding the object and linking that object to other related objects. Cataloging in the form of metadata supports organization, access and curation. Cataloging and identification practices are often related to what is being archived and the resources available for managing the archive.

4.3.1 Metadata

All archives use some form of metadata for description, reuse, administration, and preservation of the archived object. There are issues related to how the metadata is created, the metadata standards and content rules that are used, the level at which metadata is applied and where the metadata is stored.

The majority of the projects created metadata in whole or part at the cataloging stage. However, there is increasing interest in automatic generation of metadata, since the manual creation of metadata is considered to be a major impediment to digital archiving. A project is underway at the U.S. Environmental Protection Agency to derive metadata at the data element level from legacy databases. The Defense Information Technology Testbed (DITT) Project within the U.S. Department of Defense is also investigating automated metadata generation.

A variety of metadata formats are used by the selected projects, depending on the data type, discipline, resources available, and cataloging approaches used. Most national libraries use traditional library cataloging standards with some fields unable to be filled and others taking on new meaning. All titles in the NLA’s PANDORA Archive receive full MARC cataloging by the Electronic Unit Staff. However, several newer abbreviated formats developed specifically for Web-based resources are also in use. EVA uses a Dublin Core-like format. It is anticipated that an abbreviated format such as the Dublin Core may facilitate receipt of metadata directly from the publisher, eliminating the need for extensive library cataloging.

There is even a greater variety of content standards used by the projects when entering data into the metadata fields. The national libraries tend to use traditional library cataloging rules such as AACR2. Some communities, such as the geospatial community, have information standards, such as latitude and longitude, which are easily incorporated as metadata content standards. However, work remains to identify the specific metadata elements needed for long-term preservation as opposed to discovery, particularly for non-textual data types like images, video and multimedia.

The level at which metadata is applied depends on the type of data and the anticipated access needs. Datasets are generally cataloged at the file or collection level. Electronic journal articles may be cataloged individually, sometimes with no concern about metadata for the issue or journal title levels. Homepages provide a particularly difficult problem for determining the level at which metadata should be applied. Generally, the metadata is applied to whatever level is considered to be the full extent of the intellectual resource.

In the projects reviewed, the metadata files generally are stored separately from the archives themselves. Libraries may store the metadata in their online public access catalogs. Publishers may store the metadata in a bibliographic or citation database. However, in some instances, such as electronic journals with tagged headers for title, authors, author affiliation, etc., the information may be stored with the object itself and extracted for the catalog. In the case of distributed archives, the metadata may be stored centrally, with the objects distributed throughout the network, or the metadata may be stored as embedded tags in the digital resource.

Discussions surrounding the interoperability of archives, both within and across disciplines, focus on the need to be able to crosswalk or translate between the various metadata formats. This is key to the development of networked, heterogeneous archives. The Open Archival Information System (OAIS) Reference Model developed by the ISO Consultative Committee for Space Data Systems addresses this issue by encapsulating specific metadata as needed for each object type in a consistent data model [CCSDS 1998]. The Long Term Environmental Research (LTER) Network has developed mechanisms for "fitting" its network-specific metadata information into the broader scheme of the Federal Geographic Data Committee content standard for geographic data and other standards related to ecology.

4.3.2 Persistent Identification

For those archives that do not copy the digital material immediately into the archive, the movement of material from server to server or from directory to directory on the network, resulting in a change in the URL, is problematic. The use of the server as the location identifier can result in a lack of persistence over time both for the source object and any linked objects.

Despite possible problems, most archives continue to use the URL when referencing the location for the digital object. However, there are some projects that are changing this practice. The OCLC archive uses PURLs <http://purl.oclc.org/>, persistent identifiers to which the changeable URL is mapped. The American Chemical Society uses the Digital Object Identifier <http://www.doi.org> for its journal articles and also maintains the original Manuscript Number assigned to the item at the beginning of the publication process. The Defense Technical Information Center of the U.S. Department of Defense is using the Handle® system <http://www.handle.net/> developed by CNRI.

A multifaceted identification system is used by the American Astronomical Society (AAS). Name resolution is used instead of URLs. In addition, the AAS uses astronomy’s standard identifier, called a "Bibcode", which has been in use for fifteen years. In the spring of 1999, AAS added PubRef numbers (a linkage mechanism originally developed by the U.S. National Library of Medicine); other identifiers can be added as needed to maintain links.

4.4 Storage

Storage is often treated as a passive stage in the life cycle, but storage media and formats have changed with legacy information perhaps lost forever. Block sizes, tape sizes, tape drive mechanisms and operating systems have changed over time. Most organizations that responded to the question about the periodicity of media migration anticipate a 3-5 year cycle.

The most common solution to this problem of changing storage media is migration to new storage systems. This is expensive, and there is always concern about the loss of data or problems with the quality when a transfer is made. Check algorithms are extremely important when this approach is used.

The most rigorous media migration practices are in place at the data centers. The Atmospheric Radiation Monitoring (ARM) Center at the Oak Ridge National Laboratory plans to migrate to new technologies every 4-5 years. During each migration, the data is copied to the new technology. Each migration will require 6-12 months. According to Ray McCord of the ARM Center, "This is a major effort and may become nearly continuous as the size [of the archive] increases."

4.5 Preservation

Preservation is the aspect of archival management that preserves the content as well as the look and feel of the digital object. While the study showed that there is no common agreement on the definition of long-term preservation, the time frame can be thought of as long enough to be concerned about changes in technology and changes in the user community. Depending on the particular technologies and subject disciplines involved, the project managers interviewed estimated the cycle for hardware/software migration at 2-10 years.

4.5.1 Hardware and Software Migration

New releases of databases, spreadsheets, and word processors can be expected at least every two to three years, with patches and minor updates released more often. While software vendors generally provide migration strategies or upward compatibility for some generations of their products, this may not be true beyond one or two generations. Migration is not guaranteed to work for all data types, and it becomes particularly unreliable if the information product has used sophisticated software features. There is generally no backward compatibility, and if it is possible, there is certainly loss of integrity in the result.

Plans are less rigorous for migrating to new hardware and applications software than for storage media. In order to guard against major hardware/software migration issues, the organizations try to procure mainstream commercial technologies. For example, both the American Chemical Society and the U.S. Environmental Protection Agency purchased Oracle not only for its data management capabilities but for the company’s longevity and ability to impact standards development. Unfortunately, this level of standardization and ease of migration is not as readily available among technologies used in specialized fields where niche systems are required because of the interfaces to instrumentation and the volume of data to be stored and manipulated.

Emulation, which encapsulates the behavior of the hardware/software with the object, is being considered as an alternative to migration. For example, a MS Word 2000 document would be labeled as such and then metadata information provided that indicates how to reconstruct such a document is at the engineering -- bits and bytes -- level. An alternative to encapsulating the software with every instance of the data type is to create an emulation registry that uniquely identifies the hardware and software environments and provides information on how to recreate the environment in order to preserve the use of the digital object. (Heminger 1998; Rothenberg 1999]

At this time, there is no system in place to provide the extensive documentation and emulation information required for this approach to be operable, particularly to allow an archive to deal with the variety of older technologies. Most importantly, there is no policy that requires the manufacturers to deposit the emulation information. The best practice for the foreseeable future will be migration to new hardware and software platforms; emulation will begin to be used if and when the hardware and software industries begin to endorse it.

4.5.2 Preservation of the Look and Feel

At the specific format level, there are several approaches used to save the "look and feel" of material. For journal articles, the majority of the projects reviewed use image files (TIFF), PDF, or HTML. TIFF is the most prevalent for those organizations that are involved in any way with the conversion of paper backfiles. For example, JSTOR, a non-profit organization that supports both storage of current journal issues in electronic format and conversion of back issues, processes everything from paper into TIFF and then scans the TIFF image. The OCR, because it cannot achieve 100% accuracy, is used only for searching; the TIFF image is the actual delivery format that the user sees. However, this does not allow the embedded references to be active hyperlinks.

HTML/SGML (Standard Generalized Mark-up Language) is used by many large publishers after years of converting publication systems from proprietary formats to SGML. The American Astronomical Society (AAS) has a richly encoded SGML format that is used as the archival format from which numerous other formats and products are made [Boyce 1997]. The SGML version that is actually stored by the publisher is converted to HTML. PDF versions can also be provided by conversion routines.

For purely electronic documents, PDF is the most prevalent format. This provides a replica of the Postscript format of the document, but relies upon proprietary encoding technologies. PDF is used both for formal publications and grey literature. The National Library of Sweden transforms dissertations that are received in formats other than PDF to PDF and HTML. While PDF is increasingly accepted, concerns remain for long-term preservation and it may not be accepted as a legal depository format, because of its proprietary nature.

Preserving the "look and feel" is difficult in the text environment, but it is even more difficult in the multimedia environment, where there is a tightly coupled interplay between software, hardware and content. The U.S. Department of Defense DITT Project is developing models and software for the management of multimedia objects. Similarly, the University of California at San Diego has developed a model for object-based archiving that allows various levels and types of metadata with distributed storage of various data types. The UCSD work is funded by the U.S. National Archives and Records Administration and the U.S. Patent and Trademark Office.

4.5.3 Transformation vs. Native Formats

A key preservation issue is the format in which the archival version should be stored. Transformation is the process of converting the native format to a standard format. On the whole, the projects reviewed favored storage in native formats. However, there are several examples of data transformation. AAS and ACS transform the incoming files into SGML-tagged ASCII format. The AAS believes that "The electronic master copy, if done well, is able to serve as the robust electronic archival copy. Such a well-tagged copy can be updated periodically, at very little cost, to take advantage of advances in both technology and standards. The content remains unchanged, but the public electronic version can be updated to remain compatible with the advances in browsers and other access technology." [Boyce 1997]

The data community also provides some examples of data transformation. For example, the NASA Data Active Archive Centers (DAACs) transform incoming satellite and ground-monitoring information into standard Common Data Format. The U.K.’s National Digital Archive of Datasets (NDAD) transforms the native format into one of its own devising, since NDAD could not find an existing standard that dealt with all their metadata needs. These transformed formats are considered to be the archival versions, but the bit-wise copies are retained, so that someone can replicate what the center has done.

In some countries, there are intellectual property questions related to native versus transformed formats. According to Canadian Copyright Law, an author’s rights are infringed if the original work is "distorted, mutilated or otherwise modified." After much discussion, the NLC decided that converting an electronic publication to a standard format to preserve the quality of the original and to ensure long-term access does not infringe on the author's right of integrity. However, this assumption has not been tested in court.

4.5.4 Standards and Interoperability

One of the paradoxes of the networked environment is that in an environment that is so dynamic and open to change, there is a greater and greater emphasis on standards. Those projects that have been archiving for a long period of time indicated that while they started out with a large number of incoming formats -- primarily textual -- the number of formats have decreased. DOE OSTI began its project with a limited number of acceptable input formats, because there were so many different native formats. In the political environment of that time, it was difficult to gain support for the standardization of word processing packages. However, documents are currently received in only a few formats. Text is received in SGML (and its relatives HTML and XML), PDF (Normal and Image), WordPerfect and Word. Images are received in TIFF Group 4 and PDF Image.

The market forces have reduced the number of major word processing vendors. To a lesser extent, consolidation has occurred in the number of spreadsheet and database formats. However, there is less consistency in the modeling, simulation and specific purpose software areas; much of this software continues to be specific to the project. Therefore, the emphasis in these areas is on the development of standards for interoperability and data exchange (e.g., the Open GIS Consortium for interoperability between geographic information systems), realizing that perhaps the market forces will not play as large a role here as with more general purpose software applications.

4.6 Access

The previous life cycle functions that have been discussed are performed for the purpose of ensuring continuous access to the material in the archive. Successful practices must consider changes to access mechanisms, as well as rights management and security requirements over the long term.

4.6.1 Access Mechanisms

Most project managers interviewed consider the access and display mechanisms to be another source of change in the digital environment. Today it is the Web, but there is no way of knowing what it might be tomorrow. It may be possible in the future to enhance the quality of presentation of items from the digital archive based on advances in digitization and browser technologies. NLM’s Profiles in Science product creates an electronic archive of the photographs, text, videos, etc. that are provided by donors to this project. This electronic archive is used to create new access versions as the access mechanisms change. However, the originals are always retained. Project manager Alexa McCray stated that "The evolution of technology has shown that whatever level of detail is captured in the conversion process, it will eventually become insufficient. New hardware and software will make it possible to capture and display at higher quality over time. It is always desirable to capture and recapture using the original item."

4.6.2 Rights Management and Security Requirements

One of the most difficult access issues for digital archiving involves rights management. What rights does the archive have? What rights do various user groups have? What rights has the owner retained? How will the access mechanism interact with the archive’s metadata to ensure that these rights are managed properly? Rights management includes providing or restricting access as appropriate, and changing the access rights as the material’s copyright and security level changes.

Security and version control also impact digital archiving. Brewster Kahle raises many interesting questions concerning privacy and "stolen information," particularly since the Internet Archive policy is to archive all sites that are linked to one another in one long chain. [Kahle 1997]. Similarly, there is concern among image archivists that images can be tampered with without the tampering being detected. Particularly in cases where conservation issues are at stake, it is important to have metadata to manage encryption, watermarks, digital signatures, etc. that can survive despite changes in the format and media on which the digital item is stored.

5.0 Conclusions

Within the sciences, there are a variety of digital archiving projects that are at the operational or pilot stage. A review of the cutting-edge projects shows the beginning of a body of best practices for digital archiving across the stages of the information life cycle.

Standards for creating digital objects and metadata description, which specifically address archiving issues, are being developed at the organization and discipline levels. Regardless of whether acquisition is done by human selection or automated gathering software, there is a growing body of guidelines to support questions of what to select, the extent of the digital work, the archiving of related links and refreshing the contents of sites. Standards for cataloging and persistent, unique identification are important in order to make the material known to the archive administration. A variety of metadata formats, content rules and identification schemes are currently in use, with an emphasis on crosswalks to support interoperability, while standardizing as much as possible. Issues of storage and preservation (maintaining the look and feel of the content) are closely linked to the continuous development of new technologies. Current practice is to migrate from one storage medium, hardware configuration and software format to the next. This is an arduous and expensive process that may be eliminated if emulation strategies are developed among standards groups and hardware and software manufacturers. Access mechanisms, being hardware and software based, have their own migration issues. In addition, there are concerns about rights management, security and version control at the access and re-use stage of the life cycle.

While there are still many issues to be resolved and technology continues to develop a-pace, there are hopeful signs that the early adopters in the area of digital archiving are providing lessons-learned that can be adopted by others in the stakeholder communities. Through the collaborative efforts of the various stakeholder groups -- creators, librarians, archivists, funding sources, and publishers -- and the involvement of information managers, a new tradition of stewardship will be developed to ensure the preservation and continued access to our scientific and technological heritage.

Acknowledgments

Grateful acknowledgment is given to the project managers who contributed their time and expertise, and to ICSTI and CENDI for sponsoring the study on which this article is based. ICSTI also sponsored the writing of this article.

References

[Beagrie 1998] Neil Beagrie and Daniel Greenstein. "A Strategic Policy Framework for Creating and Preserving Digital Collections." July 14, 1998. <http://www.ahds.ac.uk/manage/framework.htm>

[Boyce 1997] Peter Boyce. "Costs, Archiving, and the Publishing Process in Electronic STM Journals." Against the Grain, v. 9 #5, November 1997, p. 86. <http://www.aas.org/~pboyce/epubs/atg98a-2.html>

[CCSDS 1998] Consultative Committee for Space Data Systems. "Reference Model for an Open Archival Information System (OAIS): Recommendation Concerning Space Data Systems Standards." White Book CCSDS 650.0-W-4.0, September 17, 1998. <http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html>

[CENDI] <http://www.dtic.mil/cendi/>

[ICSTI] <http://www.icsti.org/>

[ICSTI 1998] "The Electronic Publications Archive -- Report of a Working Group of the International Council for Scientific and Technical Information." December 1998.

[Garrett 1996] John Garrett and Donald Waters. "Preserving Digital Information: Report of the Task Force on Archiving of Digital Information." Commissioned by the Commission on Preservation and Access and the Research Libraries Group, Inc. 1996. <www.rlg.org/ArchTF/tfadi.index.htm>

[Haynes 1997] David Haynes, David Streatfield, Tanya Jowett and Monica Blake. "Responsibility for Digital Archiving and Long Term Access to Digital Data." JISC/NPO Studies on Preservation of Electronic Materials. 1997. <http://www.ukoln.ac.uk/services/papers/bl/jisc-npo67/digital-preservation.html>

[Hedstrom 1998] Margaret Hedstrom and Sheon Montgomery. "Digital Preservation Needs and Requirements in RLG Member Institutions." A study commissioned by the Research Libraries Group. December 1998. <http://www.rlg.org/preserv/digpres.html>

[Helsinki University Library] Helsinki University Library and Center for Scientific Computing in Finland. Functional and Technical Requirements for Capturing On-line Documents (EVA-Project), No Date.

[Heminger 1998] Alan R. Heminger and Steven B. Robertson. "Digital Rosetta Stone: A Conceptual Model for Maintaining Long-term Access to Digital Documents" November 21, 1998. <http://tuvok.au.af.mil/au/database/research/ay1996/afit_la/rober_sb.htm>

[Hodge 1999] Gail Hodge. "Digital Electronic Archiving: The State of the Art, The State of the Practice." April 26, 1999. <http://www.icsti.org>

[Kahle 1997] Brewster Kahle. "Preserving the Internet." Scientific American, March 1997. <http://www.sciam.com/0397issue/0397kahle.html>

[Kuny 1998] Terry Kuny. "The Digital Dark Ages? Challenges in the Preservation of Electronic Information." International Preservation News, No. 17, May 1998. <http://www.ifla.org/VI/4/news/17-98.htm#2>

[NLC 1998] National Library of Canada, Electronic Collections Coordinating Group. Networked Electronic Publications Policy And Guidelines, October 1998. <http://www.nlc-bnc.ca/pubs/irm/eneppg.htm>

[NLA] National Library of Australia. "Selection of Online Australian Publications Intended for Preservation by the National Library of Australia." <http://www.nla.gov.au/scoap/guidelines.html>

[NRC 1995] National Research Council. Preserving Scientific Data on Our Physical Universe:  A New Strategy for Archiving the Nation's Scientific Information Resources. 1995.

[National Library of Sweden] Royal Library. National Library of Sweden. "Kulturarw3." <http://kulturarw.kb.se/html/kulturarw3.eng.html>

[Rothenberg 1999] Jeffrey Rothenberg. "Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation.". Report to CLIR, January, 1999. <http://www.clir.org/pubs/reports/rothenberg/contents.html>

Copyright © 2000 Conseil International Pour L'Information Scientifique et Technique [International Council for Scientific and Technical Information]
<img src= Line
Top | Contents
Search | Author Index | Title Index | Monthly Issues
Previous story | Next story
Home | E-mail the Editor
Spacer Line
Spacer

D-Lib Magazine Access Terms and Conditions

DOI: 10.1045/january2000-hodge