Stories

D-Lib Magazine
July/August 1998

ISSN 1082-9873

Archiving Digital Cultural Artifacts

Organizing an Agenda for Action


Peter Lyman
School of Information and Management Systems
University of California
Berkeley, California
Plyman@sims.berkeley.edu


Brewster Kahle
Alexa Internet
San Francisco, California
Brewster@alexa.com

Both authors are directors of the Internet Archive.

Introduction

Our cultural heritage is now taking digital form, whether born digital or born-again by conversion to digital from other media. What will be the consequences of these changes in the nature of the medium for creating and preserving our cultural heritage? What new technological and institutional strategies must be invented to archive cultural artifacts and hand them on to future generations? We will explore these questions through the practical perspective gained in building Alexa Internet and the Internet Archive.

Our purpose is not so much to answer these questions in a definitive manner, but to organize a discussion between communities which must learn to work together if the problem of digital preservation and archiving is to be solved -- computer scientists, librarians and scholars, and policy makers. This paper, then, is the product of a dialogue between a computer scientist/entrepreneur and a political theorist/librarian, and represents an attempt to create a common agenda for action.

Defining the Problem

Who among us can read our old WordStar or VisiCalc files? Find our first email message? Or, on a larger scale, what happens to the history of science if we can't read the first data from the first interplanetary exploration by the Viking mission to Mars? The origins of the digital era are probably already lost, and millions must be spent if strategic government and corporate records are to survive the transition to the year 2000 on digital clocks. (See, for example, Business Week, "From Digits to Dust," April 20, 1998, p. 128-129; see also, U.S. News and World Report, February 16, 1998, "Whoops, there goes another CD-ROM" http://www.usnews.com/usnews/issue/980216/16digi.htm). Digital information is seemingly ubiquitous as a medium for communication and expression, increasingly strategic for scientific discovery and the records that constitute institutional order in a modern society. And yet, it is at the same time fugitive, the pace of technical change makes digital information disappear before we realize the importance of preserving it. Like oral culture, digital information has been allowed to become a medium for the present, neither a record of the past nor a message to the future. Unless, that is, we redesign it now.

In exploring the consequences of digital artifacts and their use for the way we preserve and archive our culture, we will focus upon the World Wide Web. The Web is a born digital cultural artifact intrinsic to the digital/electronic environment that defines the parameters of the question in useful ways. The advantage of this example is that our experience with Alexa Internet and the Internet Archive provides practical experience with the archival problem. The disadvantage is that although now ubiquitous, the Web is only one example of a digital document in a rapidly changing environment. Moreover, in many ways the Web is modeled on print traditions, a publishing technology more than an unprecedented way of representing cultural expression. Other digital artifacts that might yield different kinds of insights are, for example, simulation software like SimCity, visualization and scenario software, Jurassic Park dinosaur animations, or collaboratories and virtual communities. Nevertheless, precisely because of the ubiquity of web-based information resources, and precisely because of their conceptual proximity to known artifacts whose archiving is well understood (i.e., books and manuscripts), looking at the web offers a comparatively powerful beginning point.

What Are Digital Cultural Artifacts?

Culture is something we do, a performance which fades into memory then disappears, but the record of culture consists of artifacts which we make, which persist but inevitably decay. Like other media, digital cultures are simultaneously performances and artifacts, although digital artifacts are profoundly different from physical artifacts. We will not attempt a formal definition of something that is still being shaped by experimentation and practice, except to describe some of the parameters differentiating digital artifacts from other kinds of cultural artifacts that may be useful in building digital archives. Most notably, while things occupy places (and are therefore always local), digital documents are electronic signals with local storage but global range. As things, digital cultural artifacts are dramatically different from those in other media, as illustrated by these estimates of size.

Type

Example

Size When Digital

Newspaper

Wall Street Journal

100Mbytes/year (text)

Computer discussion

Netnews

300Gbytes/year

Television

CNN News

1GB/hour, 6TB/yr (compressed)

Radio

WABC

270GB/yr uncompressed

Internet Publishing

World Wide Web

4 Terabytes in 1997

Video Rental Store

Block Buster Video

9 Terabytes

Research Library

Library of Congress

20Terabytes text in all Books

Card catalog

Library of Congress

17GB

Branch Library version of books

Palo Alto, CA

1.4TB of scanned

Composer's work

Mozart

100MB?

Many other cultural performances and artifacts are entering the digital realm, such as classroom lectures, lecture notes and textbooks, scanned paintings, and government publications. Thus, digital documents are both ubiquitous, by virtue of their global range, and are a universal medium for archiving the record of culture, by virtue of their size and ability to represent cultural expressions in all other media. Relative to physical counterparts:

Two consequences for the problem of digital archives are worth noting. First, digital documents are at once tangible (representation in code) and at the same time intangible (the code is meaningless unless transmitted and represented), thus what must be preserved is the totality of a dynamic performance consisting of both text and context -- the unit of knowledge is the entire Web, over time. Secondly, digital cultural artifacts are not the property of cultural elites, for this medium is profoundly democratic -- millions of people are creating cultural artifacts in intangible forms, using computers and networks. Neither are they archived by the traditional cultural institutions organized and funded by cultural elites.

The World Wide Web as a Cultural Artifact

Although the Web is an original new medium for cultural expression, like all new modes of representation of knowledge the first experiments are likely to imitate the forms of past media. It is original because although other communication technologies are global, this one has no central control points (other, perhaps, than the definition of technical standards). It is new because its cultural expressions will be in multimedia (to use that redundant term), even though today its guiding metaphors are derived from print publication. We know very little about its character as a cultural artifact, both because it is new and because it is decentralized. The key questions about it are not to be answered in the nature of its artifacts alone, but in the emerging social forms which are made possible by these new media: What is a virtual community? A transnational financial market? A collaboratory? Who is "the public" on the Web? What is the nature of personal identity on the Web?

The first step in the archeology of the Web has been to use other kinds of cultural artifacts as guiding metaphors, as if it is a text or library, in order to understand its deep structures. These metaphors are useful, but limited.

However, these metaphors compare the Web to an institution -- a library or an archive -- rather than defining it as a new kind of cultural artifact that will require the invention of new kinds of institutions and management techniques. Described as a cultural artifact:

Like reading a book, every reading is a unique performance in which the user links information together; but unlike reading a book, every reading leaves a trail, which can be collected and archived. These links are the trails through an information wilderness. Alexa Internet is mapping, trying to discover the structure of the Web by understanding how its information is used.

Alexa Internet's Web statistics sketch a remarkable picture of this new domain of cultural expression; the statistics cited below are from May 1997, by Z Smith (http://www.webtechniques.com/features/1997/05/burner/burner.shtml). See also the inventory of Internet statistics and demographics, at http://www.yahoo.com/Computers_and_Internet/Internet/Statistics_and_Demographics/.

Collectively, they begin to answer some baseline questions about the Web as a cultural artifact: What is the Web, described as a technical artifact? What is the Web, described in terms of the social functions of digital documents?

What is the Web, from a technical point of view? As of January 1997, one million Web site names were in common usage, on 450,000 unique host machines (of which 300,000 appear to be stable, 95% accessible at a given point in time), and there were 80 million HTML pages on the public Web. The figure is incomplete, because some sites are dynamic (generating unique pages in response to queries). The typical Web page had 15 hypertext links to other pages or objects and 5 imbedded objects such as sounds or images. The typical HTML page was 5 KB, the typical image (GIF or JPEG) was 12KB, the average object served via HTTP was 15KB, and the typical Web site was about 20% HTML text, 80% images, sounds and executables (by size in bytes). The median size for a Web site was about 300 pages; only 50 sites had more than 30,000 pages; the 1000 most popular sites accounted for about half of all traffic. In mid-1997 it took about 400GB to store the text of a snapshot of the public Web, and about 2TB to store non-text files.

Because the Web is dynamic and seems to be doubling yearly, the typical Web page is only about two months old. The typical user downloads around 70KB of data for each HTML page visited, and visits 20 Web pages per day. One percent of all user requests result in '404, File Not Found' responses. After analyzing search engine data from Alexa Internet, Mike Lesk commented that "free" is the most used search term, not "sex," as might have been predicted; sex ranks second, and "free sex" is the most used phrase.

Who uses the Web? The Web is also a society that, although global, is not universal. Worldwide, English speakers are about 65% of the world online population (http://www.euromktg.com/globstats/), but hundreds of languages and dialects are used on the Internet. Business Week (May 5, 1997), commissioned a census of the use of the Web by 1000 U.S. households, which begins to document its expanding cultural domain. 21% of adults, an estimated 40 million people, use the Internet or the World Wide Web, double the number a year ago. The online census is now 41% female (up from 23% in 1995), but still 85% white, and 42% have incomes over $50,000 a year; the study comments that student users probably over represent the use of the net by the poor. According to the survey, the Net is primarily used for research (82%), education (75%), news (68%), and entertainment (61%); online shopping was only 9% of use, but about 25% have bought something on the Net, and the number of .com sites has expanded dramatically since the survey. Entertainment is a more likely use among the young (51% of 10-29 year olds) and surfing among those 18-29 (47%). Surfing was common among only 30% of those 50 and older. Most preferred sites that were not interactive (77%), but among those using interactive sites, 57% said they felt a sense of community.

Who Will Preserve Digital Cultural Artifacts?

The only problem is, digital documents are disappearing. Alexa Internet has created an archive of "404-document missing" Web Pages, because although the World Wide Web is now growing at an estimated 1.5 million pages a day, most of them disappear every year. The Alexa Internet archive of "404" Web pages is now 10 TB, and may be accessed by downloading the Alexa software (http://www.alexa.com). Given the dramatic growth of digital media, it is paradoxical that we do not yet know how to preserve digital cultural artifacts.

Print made possible institutions, in the modern sense of the word in which social order is based upon record keeping. Record keeping combined with archival preservation of other kinds of documents makes possible the historical memory that gives culture continuity and depth. Cultural institutions like universities, publishers, theaters and symphonies are dedicated to enacting the cultural traditions which we call civilization, and institutions like libraries, museums, and archives are dedicated to collecting, organizing, conserving and preserving cultural artifacts. What are, and will be, the social contexts and institutions for preserving digital documents? Indeed, what new kinds of institutions are possible in cyberspace, and what technologies will support them? What kind of new social contexts and institutions should be invented for cyberspace? Consider just a few of these questions, seen through the lens of the transitions from manuscripts to print.

Many of these concerns resolve into sociological questions. While discussion of this social agenda is now beginning to take off in national information policy debates, we believe that it is premature to define final institutional forms before there is a technological response, an agenda for a more robust design for digital documents which recognizes their cultural importance.

Examples of Cultural Innovation on the Web

This is a time of both technological and social invention, indeed the two are inseparable. Alexa Internet and the Internet Archive are only two examples of cultural innovation in the development of a Web literature. In print, a literature is interconnected by citations and has structure because it has been filtered by editorial boards before publication; on the Web, new technologies must be created to define quality and to discover organization. The following list contains only a few examples among many still in process working on the production of a new kind of cultural artifact.

Conclusion: What Technical Work Remains to be Done?

Our goal has been to define problems that might be solved collaboratively rather than to propose solutions; thus, we propose the following schema to organize the work:

1. 0 Infrastructure technologies.

Infrastructure technologies are both technological and political, including both legal and engineering standards. The issues and requirements include the following:

1.1 Build a legal infrastructure tolerant of digital libraries, archives, and museums. Will intellectual property law, now being optimized for electronic commerce, allow for the preservation of digital documents by public institutions like libraries and archives? While copyright law evolved provisions for libraries, archives, and museums, the proposed laws tend to treat digital documents exclusively as private property, governed by contract rather than copyright (http://sims.berkeley.edu/BCLT/events/ucc2b/). Trends towards licensing information rather than having information under copyright may end the Fair Use of digital documents for educational purposes, the circulation of information, and copying for preservation and archiving. For example, in the early 20th century, laws governing the preservation of film were differentiated from those governing the preservation of books; as a result, today there is no comprehensive archive of radio or television. In addition, national legal codes governing digital cultural artifacts must be coordinated on a global scale through treaties, since in a sense, local or national jurisdictions are no longer enforceable through traditional means.

1.2 Build a high-speed data network. In the United States, most Internet traffic uses voice communication lines. Whereas this helps make the Internet quickly deployable, it is limiting in terms of cost reduction. Where computer components have cost-performance improvements of 100% every 18 months, the long distance phone technology evolution is measured in decades. Thus, computer processors, RAM, disk, and LAN speeds have all been rapidly improving while the long distance systems have not.

1.3 Define a standard public video format. Sometimes when a public standard is established, use flourishes (e.g., TCP/IP, HTML). Several proprietary video formats are now being promoted on the Internet. To build a popular and long term archival medium, it is very helpful if a format is internationally standardized and non-proprietary.

2.0 Technologies for digital publishing.

The current web publishing tools use freeform hypertext and do not yet encourage a set of templates. If printed book publishing can be used as a precedent, then idioms of page formats, tables of contents, indexes, page numbers, and the like will emerge so that different websites will have standard structures that can be counted on. If these paradigms existed in the web, then navigation and categorization tools could be effectively applied. For instance, it would then be easy to tell the difference between a personal homepage, scholarly publication, or corporate brochure from an email message. At this point, it is very difficult to tell the difference with any certainty. These formats, fortunately, are maturing in the tools and standards committees for websites but currently, digital publishing is a mess. Metadata standards for declaring the intentions of the authors in the areas of structure and usage rights will also be helpful. Furthermore, making it easy for authors to use URLs that stay stable across many versions of their websites will be helpful in allowing others to refer to the documents and services over time.

3.0 Technologies for digital libraries.

Alexa Internet is an example of a library of digital materials -- the web and netnews -- but there are many other examples, such as Paul Ginsparg's XXX server at Los Alamos, and archives of astronomical, meterological or medical data, which can be rendered as images, statistical data sets, and so on. The technologies helpful in building such collections are gatherers, storage mechanisms, data mining tools, and serving tools. In the case of Alexa Internet, some of the components could be purchased, but most of the technology had to be developed internally. More tools for dealing with terabyte digital collections would be very helpful this field. Most software tools do not support terabytes very well, even though the cost of the equipment is quite low.

We see the opportunities in these areas to be exciting and essential to the evolution of this field.

3.1 Gathering. Alexa and the Internet Archive act as a centralized repository for data so every researcher or company does not need to write its own gathering software. This service needs to be matured and could be helped by cooperation with other organizations that gather large collections.

3.2 Storing. Storing and managing 300 million objects (increasing to 1 billion soon) tax most existing storage and database technologies. The commercial database technology that Alexa tried could not perform fast enough transactions on low cost hardware; therefore, we needed to write all our own management and indexing code. Offsite storage and redundancy is essential. While Alexa does create a copy and does give it to the non-profit entity Internet Archive for storage and protection, both institutions are in the United States, where a change in law could well make certain kinds of collections and activities illegal. A more resilient strategy would be to engage a set of active organizations and to exchange collections.

3.3 Data mining. Finding patterns in terabyte collections of semi-structured information is different enough from much of the datamining work being done by mailing list companies that we have not found a match with commercial tools yet. Therefore, we have written our own tools. We hope this area will be of interest to academics because of the fertile ground for new ideas in pattern finding and artificial intelligence based on this large semantic network.

3.4 Serving. Alexa Internet serves information about a user's current webpage such as site statistics and related sites. This information must be dispensed for every page turn of every user. As these technologies are built into browsers, this needs to be done on every page turn of every user on the Web. Alexa Internet has built servers to be able to do this. Further work in this area can be quite fruitful for building high capacity services.

3.5 Critical Mass Digitization Project. Michael Lesk has argued that the cost of scanning a book and making it available on the Internet is less than building the space to store the book in a library (Practical Digital Libraries: Books, Bytes and Bucks. Morgan Kaufmann, 1997). While the Web is often the information resource of first resort, print publication has been the medium of record for quality, and print libraries are far more comprehensive and reliable. If a large-scale library were to be scanned and offered on the Internet, then we would be confronted with a potential testbed for new models of copyright and royalties and might then develop new economic models for digitization of print. Current projects relevant to this goal include:

4.0 Technologies for digital archives.

4.1 Low Cost Bulk Storage. If we assume that material worth archiving is limited by our server storage systems, then we need an archival medium that is less expensive that original copy. Tape storage systems have historically been the inexpensive mechanism to store large amounts of data. If these tapes are put into a tape robotic system, then they can be accessed slowly and inexpensively. Unfortunately, the cost per gigabyte of these robot systems may not offer much cost advantage over disk subsystems, given the historic trends. Currently, the cost per gigabyte on a disk is about $50, and on a tape robotic system it is about $12. David Patterson, of UC Berkeley, said in discussion: "If you follow the curves, they will cross in 4 years." Thus archival storage might soon be on the same medium as the originally published material which will make it as costly to archive as it is to serve. This could severely limit what can be cost-effectively saved.

4.2 Long Term Storage. While good paper lasts 500 years, computer tapes last 10. While there are active organizations to make copies, we will keep our information safe, we do not have an effective mechanism to make 500 year copies of digital materials. There is one company, Norsam ( http://www.norsam.com ) with a technology to write micro images on silicon wafers, which can be used for this purpose.

4.3 Archive Television. We are not aware of any comprehensive collections of television or radio. The original producers may have copies, but the networks often do not have copies, nor does any library. Before television grows much older and mutates much further, we believe it would be important to have a record of these cultural artifacts.

5.0 Time capsule technologies.

What will be an Internet Time Capsule, serving archeologists of the distant future? The techniques suggested in much of the digital archiving work require an institution to "refresh" magnetic media every 10 years. If future technologies also require this frequent refreshing, then the digital artifacts will not last through a dark age of the future. Therefore, another writing technology would be needed to endure 1000 years without maintenance. To be viable, this "time capsule" technology would not have to be as easily readable as the archival technology, and could be more expensive to write because it is assumed that it will be more selectively written and read.

How will historians of the digital age read old code? Is it possible, as Danny Hillis has speculated, to build a universal Turing machine that would emulate all of the operating systems of the past?

A discussion of these technologies has been started by The Long Now Foundation, http://www.longnow.org/.

 

©1998 Peter Lyman and Brewster Kahle. Permission to reproduce this story has been granted by the Authors.

Top | Magazine
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | E-mail the Editor

hdl:cnri.dlib/july98-lyman