D-Lib Magazine | |
William G. LeFurgy1 |
AbstractDigital preservation research has made important gains in recent years, and the capability for libraries and archives to manage digital collections continues to grow. This is obviously good news in that an expanded body of digital records, publications, and other objects will be preserved and made available. What is less obvious is that there is no magic bullet in the offing for dealing with all permanent digital materials: only a fraction will meet necessary conditions for optimal preservation and use. Materials that deviate from these conditions can still be included in digital collections, but finding and using them will be more difficult, perhaps substantially so. This article outlines conditions that govern the persistence of digital materials and suggests a model for future levels of service for digital repositories. BackgroundOver the course of a 25-year career working with historical documents and publications, I have spent my share of time worrying about the problems posed by materials in digital form. From my first job as a manuscripts curator at the Maryland Historical Society, it was clear that digital materials would eventually have to be included in the holdings of many repositories, both because the technology held great promise for facilitating research and because certain recorded information would exist in no other form. This sense of inevitability only grew as I moved to the Baltimore City Archives and then to the U.S. National Archives and Records Administration. The worry came from wondering how to preserve digital materials over the course of technological change and to keep them easily available to researchers. There seemed to be no easy answer, and my personal experience (such as getting "404Page Not Found" responses when clicking some hyperlinked footnotes) did not instill much confidence that a solution was imminent. I could only assume that technology itself would eventually provide the means for repositories to overcome all the problems associated with digital materials. My work over the last several years in helping federal agencies manage a diverse and ever-growing array of digital materials has led to a rethinking of this assumption. It became obvious that some digital materials presented more problems than did others in terms of keeping them available for the long term. Results from digital preservation research also began to indicate that the future held an uneven promise for various kinds of materials. What this all means is that digital holdings for many repositories will not be equal in terms of how effectively they can be preserved and made available. The easiest way for me to think about it is as differences in levels of service: repositories will be able to do more with certain kinds of digital materials and less with others. This has implications for evaluating materials for potential permanent retention as well as for conceptualizing systems that will manage and preserve such materials. The concept of levels of service reduces my worry. While it does affirm that poorly accessible materials will always be with us, it also provides a path beyond the monolithic view of digital materials where all preservation and access challenges are grouped into one insoluble problem. By teasing apart the underlying strands of the issue, it is possible to envision some practical solutions. IntroductionArchivists, librarians, and others with an interest in preserving and making available digital information face an impending paradox. This stems from the prospect of developing solutions for long-term management of digital records, publications, and other objects. Once in place, these solutions will help fill a pressing need for repositories to preserve and make available significant information in electronic form. But as digital materials are increasingly acquired it will become obvious that not all can be equally preserved and used. Digital materials vary in how they are constructed, organized, and described, and these factors will play a huge role in determining preservation and access possibilitieseven when advanced systems, technologies, and techniques are available to repositories. Current research indicates that digital materials can be managed independent of specific technology. "Persistence" is the term used to indicate the degree to which this is possible [2]. For complete persistence, materials must adhere to strict conditions regarding their construction and description. These conditions make it possible to use technology to dynamically recreate a digital object based on explicit and consistent rules defining the object's content, context, and structure. But in a world where few standards govern the technical construction of a digital item (a report can exist in any one of a dozen common file formats) and fewer still govern how an item is described (the report may or may not identify an author, date of issue, or other descriptors), it is realistic to expect that many materials will not fully meet the rigorous conditions for persistence. This likely will remain the case even when user-friendly tools are established for creation of persistent digital materials. Inevitably, materials will vary in their degree of compliance with established rules. Looking ahead, digital collections can be seen as falling into three levels: optimal, enhanced, and minimal. The optimal level will consist of fully persistent digital materials that can be placed in an information technology architecture that permits their maintenance in perpetuity without significant alteration of content, structure, or any other significant characteristic. Such materials will also retain their original context (e.g., their relationship among themselves and with other materials), and they will remain discoverable through multiple attributes. The enhanced level will have materials that possess some persistent qualities but that lack others. Perhaps the structural rules are variable or metadata are incomplete, but the materials will nevertheless permit a degree of continuing preservation and discoverability. The minimal level will be populated by digital materials that have few, if any, persistent characteristics. They might consist of loosely structured files in various native formats with minimal metadata; preserving their significant characteristics and making them discoverable will be difficult. These levels will dictate the extent to which a repository can manage collections of digital materials and make them available to users. Conditions Required for PersistenceBroadly speaking, persistence requires two parts. The first is an architecture that defines the system that will acquire, manage, preserve, and access digital materials in a repository (or among repositories). The second is a specification for the materials that will go into the system. The most influential conceptual construct for both parts is the Reference Model for an Open Archival Information System (OAIS). The OAIS model outlines a design where digital materials are placed into a package with three basic elements:
Materials are transmitted to repositories through Submission Information Packages (SIPs). Use of SIPs enables persistence: they implement decisions regarding the essential characteristics of digital materials and provide for preservation and access in a manner that is independent of specific technology [3]. The OAIS model is dependent upon construction of SIP elements according to detailed, rigorous, and transparent rules, since this is the only way an automated system can effectively manage and manipulate digital information. For optimal performance, each individual digital object within a package must be consistently described and structured. It is possible to modify existing materials (adding metadata, converting to different formats and so forth) to build SIPs, but this is labor intensive and may raise questions about the authenticity of the materials. The preferred means would be to incorporate the rules and consistency needed for SIPs with the technology used to create digital materials. Creation would have to occur under an enduring, widely accepted, and carefully controlled processwhich is a radical departure from current practice. Extensible Markup Language (XML), a universal format for structured documents and data, offers a practical demonstration for creating materials under the controlled process needed for SIP elements. XML can allow for highly reliable abstraction of a digital object's significant properties, such as structure, formatting, and contextual relationships. The abstractions are provided through Document Type Definitions (DTDs) or schema, which are expressed rules governing how materials are constructed and presented. These rules "allow machines to carry out rules made by people. They provide a means for defining the structure, content and semantics" of digital materials [4]. Other methods of creation apart from XML can also be used to create persistent materials. But whatever method is chosen, it must be capable of generating materials according to known rules that an information technology architecture can manage into the future. The basic ideas behind OAIS are represented in many of the most promising digital preservation research projects. The Library of Congress National Digital Library Program, for example, relies on creating exacting electronic reproductions using standard formats and assigned metadata [5]. The Australian Victorian Electronic Records Strategy uses highly controlled methods of electronic document creation and description to enable archival management [6]. The CURL Exemplars for Digital Archives (CEDARS) project in the United Kingdom is based directly on the OAIS model and involves, among many other activities, developing practices for creating detailed representation information about the significant properties of digital objects [7]. CEDARS is also exploring emulation as a mean of digital preservation. Emulation involves developing encapsulated packages that contain a specification for recreating an original computer application to view and interact with objects created by the application [8]. The specification could be an abstraction of the original software code and related documentation, or it could be based on a customized emulator built to provide access via a host platform. In either case, the specification provides the expressed rules needed for technological independence [9]. In developing its Electronic Records Archives (ERA), the U.S. National Archives and Records Administration also looks directly to the OAIS model. The ERA project is focusing on persistent object preservation, which involves managing digital objects with clearly defined structures and metadata to permit ongoing access and retrieval [10]. All of these efforts focus on work with highly persistent digital materials: that is, materials whose context, content, and structure are transparent and well-defined. The bottom line is that we are headed for a future where those digital materials that conform to exacting rules can be effectively preserved and accessed. But these materials will constitute only a small fraction of the overall universe of digital information. Nearly all the digital materials now in existence and many of those yet to be created do not use clear and consistent rules (proprietary software depends on hiding many of the code-based rules used to structure and display objects, for example) and thus will not be easily managed through applications of the OAIS model. Yet vast quantities will have value that warrants continued preservation, even if they are in a persistently non-persistent form. Levels of Service DefinedThe scenario outlined above will require many digital repositories to adopt a strategy for providing different levels of service for different parts of their collections. Levels of service can best be thought of as a matrix with one set of values determined by the available technology and the other set determined by the degree to which digital materials have persistent qualities. The first set depends on incremental development of new and improved tools and processes and can be seen evolving as follows:
The second set of values is tied to the degree to which digital materials are persistent (based on consistent and transparent rules for description and structure, standardized file formats, and so forth). In general terms, degrees of persistence can be represented by three categories:
Given that persistence is closely tied to the clarity and consistency of the rules used by digital materials, it follows that materials that are highly structured tend to be inherently easier to preserve and access over time. Conversely, less structured materials tend to be harder to manage. Another way to categorize inherent persistence is whether the materials are homogeneous (closely tied to known and consistent rules regarding structure, technical parameters, and metadata) or heterogeneous (not closely tied to known and invariable rules). For some homogeneous materials the rules are completely unambiguous, such as those used by delimited ASCII (along with associated metadata) to represent a database file. Because the rules are so clear, the technology and processes needed to preserve and access the file is comparatively simple and it can be kept available in perpetuity. Other bodies of homogeneous materials are tied to rules that are less explicit but that are known and consistent to some minimal extent. For example, if materials are in a format that will remain accessible far into the future, and if the metadata are sufficient, they can have some degree of persistence. Materials that are not connected to transparent, consistent rules are heterogeneous. Most often with heterogeneous collections the rules are varied, unclear, or both. There could be a mix of file formats based on different operating systems, or a jumble of methods used to structure file content. A good example are the files on most personal computer hard drives, which typically contain a mix of spreadsheets, word processing, e-mail, images, and other formats and types. Another example would be most World Wide Web sites, which are made up of HTML documents, graphic and audio files, Java and CGI script, and other highly variable elements. Heterogeneous materials generally have low persistence. Since each object can differ from the next in unpredictable ways, effective and efficient preservation and access is difficult. There are options to convert heterogeneous materials to more homogeneous forms, but this is not always a practical solution, both in terms of cost and maintaining record integrity. Heterogeneous materials can be preserved as a stream of bits, but they will generally be difficult to use over time as file formats become obsolete and other rules become increasingly opaque. The diagram below provides a graphical model of how levels of service will likely evolve.
Levels of Service Over TimeAs Phase I of the diagram indicates, repositories now have two basic service options for digital materials. Enhanced service is possible for some homogeneous materials (such as ASCII delimited data), while minimal service is available for other materials. Change will come, however, as a result of new technology and techniques generated by vendors and by the library and archival communities. This overall process of advancement is depicted in the diagram through the first "cloud," which represents the research and technology that will yield improved solutions. One predictable outcome from this first cloud will be improved service for homogenous digital materials. Wider collaborative use of markup languages and associated schema, for example, will expand the categories and formats of digital materials for which repositories can provide an enhanced level of service. Projects such as CEDARS and ERA, among others, will also lead to much greater understanding about the processes and technologies necessary for building persistent information technology architectures. Phase II indicates the general outcomes of these advances. Highly structured homogenous materials (the green arrows) will continue to occupy the top service level, and the most significant change will be capability to provide better service for other homogenous materials (the blue arrows). Note that blue arrows go to both the enhanced and to the minimal levels, since the improvements will be uneven in relation to all the materials potentially eligible for acquisition. This split would occur, for example, if a uniform XML-based process were used to generate some federal government reports while other federal reports were created in a less persistent manner. The XML-associated reports could reside in a higher level of service than the other reports, which would continue to occupy the minimal level. All heterogeneous materials (the red arrows) would also remain in the minimal level of service. Progression to Phase III will require additional improvements, which are represented in the diagram as flowing from a second cloud. The primary feature of this phase will be wide availability of persistent materials that can be effectively managed in an integrated architecture. With green, blue, and red arrows pointing to the optimal level, the model posits that all varieties of digital materials will have the potential for robust preservation and access. But less persistent materials will also continue to occupy lower levels of service. These materials will include vast stores of legacy data as well as more current items that, for one reason or another, lack complete persistence. It is difficult to say what percentages of files will be associated with any particular level of service under any of the phases. That will depend on what comes out of the two clouds, most particularly with respect to changes in how stringently digital materials are created. One fact is certain: getting large quantities of materials into the enhanced and optimal service levels will require dramatic change in how digital materials are now produced and maintained. It will be possible to use tools and processes to modify or manipulate digital materials to move them into higher levels of service. This could involve converting to different file formats, reformatting content, attaching metadata, or "wrapping" files in some kind of software container. Emulators or viewers may also enable suitable access to native formats. Before undertaking such work, it will be necessary to analyze a number of factors, including the extent to which file integrity could be harmed and how much the effort will cost. Some categories of files will have high enough value to warrant significant effort to move them into higher service levels, but for the foreseeable future many files will likely not have mobility among levels of service. ConclusionThere is much to be optimistic about with regard to digital preservation. Technologies and processes are on the horizon (or in some cases already here) that will enable libraries and archives to do much better in terms of keeping and servicing digital materials. As important as this development is, however, it raises questions about how repositories will cope with the potentially enormous quantity of materialsboth legacy as well as more contemporarythat will not easily fit into higher levels of service. These issues range from making decisions about timetables for acquiring digital materials to planning for operational systems to overall expectations for digital preservation in general. Given that the initial capabilities of emerging preservation systems will be oriented toward homogeneous materials with uniform, well-defined rules, it makes sense to investigate policies and methods that encourage expanded creation of such materials. Structured markup languages such as XML might prove to be the solution, most particularly if uniform schema are widely and consistently used. The greater the homogeneity, the better the level of service. The path is less clear regarding more heterogeneous materials, despite the fact that they currently comprise the vast majority of all digital materials in existence. Technology will certainly offer better opportunities for doing more with such materials, but they will lag behind in terms of service. It is clear, however, that archives and libraries will need to make plans for coping with materials that lie on the continuum between optimal and minimal serviceability. The need to contend with varied levels of service does not, of course, in any way diminish the urgency for bringing digital materials under library and archival control. If anything, the prospect of inevitable differences in serviceability should cause repositories to reexamine strategies that involve deferring responsibility for digital materials. Repositories may recognize an eventual need to manage such materials, but they may hold out hope that some future technology will solve the problem. But the problem will resist a simple solution. Most digital materials now in existenceas well as those that will be created during the foreseeable futurewill remain a challenge to manage for years to come, regardless of technological advances. The best course might well be to start capturing and managing appropriate digital materials now with the expectation that the future will bring varied improvements in preservation and access options. Notes and references[1] The views and opinions expressed herein are those of the author and do not necessarily reflect those of the U.S. National Archives and Records Administration. [2] Reagan Moore et al. "Collection-Based Persistent Digital Archives - Part 1," D-Lib Magazine, March 2000, Volume 6 Number 3, <http://www.dlib.org/dlib/march00/moore/03moore-pt1.html>. [3] Consultative Committee for Space Data Systems, Reference Model for an Open Archival Information System, June 2001, pages 2-4 through 2-7, <http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html>. [4] World Wide Web Consortium (W3C), <http://www.w3.org/XML/Schema>. [5] Library of Congress National Digital Library Program, <http://memory.loc.gov/ammem/dli2/html>. [6] Public Record Office Victoria (Australia), Victorian Electronic Records Strategy Final Report, <http://www.prov.vic.gov.au/vers/published/final.htm>. [7] CURL Exemplars for Digital ARchiveS (CEDARS), <http://www.curl.ac.uk/projects/cedars.html>. [8] Jeff Rothenberg, Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation, A Report to the Council on Library and Information Resources, 1999, <http://www.clir.org/pubs/abstract/pub77.html>. [9] David Holdsworth and Paul Wheatley, Emulation, Preservation and Abstraction, <http://129.11.152.25/CAMiLEON/dh/ep5.html>. [10] Kenneth Thibodeau, "Building the Archives of the Future: Advances in Preserving Electronic Records at the National Archives and Records Administration," D-Lib Magazine, February 2001, Volume 7 Number 2, <http://www.dlib.org/dlib/february01/thibodeau/02thibodeau.html>.
| |
| |
Top | Contents | |
| |
D-Lib Magazine Access Terms and Conditions DOI: 10.1045/may2002-lefurgy
|