It was my distinct pleasure to listen to a talk by William Y. Arms, given at the AusWeb '96 conference, this past July. Bill highlighted a number of areas where the distributed library community was working hard to solve serious problems with the World Wide Web (WWW) infrastructure. In particular, he discussed in detail the difficulties involved in solving two problems:
As always, Bill made an excellent case for the need for both of these items. He further argued that the existing WWW protocols do not provide solutions to these problems. After hearing him out, I found that I agree completely with the need for persistent, globally unique names and for metadata. But I do not agree that the existing WWW protocols do not offer the solutions. Indeed, I believe that most of the engineering needed to accomplish them is already complete, but there is work of a societal nature that remains to be done. This work can be done only by the institutions most directly involved with the problems.
This article, prepared at his request, is my response to his excellent talk. It is, mostly, a call for the digital library community to join with the Web community to solve the remaining engineering problems, but suggests that the real problems are not primarily technical.
To explain my viewpoint, I must first describe both the World Wide Web Consortium (W3C) and my own work at the Consortium. W3C is a formalized collaboration between internationally known research organizations: the U.S. office is a research group at Massachusetts Institute of Technology's (MIT) Laboratory for Computer Science (LCS); the European office is a pair of research groups at INRIA (Institut National de Recherche en Informatique et en Automatique, the French national computer science laboratory); and the Japan/Korea office is a research group at Keio University. The funding model of the Consortium is primarily membership fees from companies, organizations, and government offices. In addition, each W3C office is free to seek additional funding through traditional research or development grants.
All three offices share a common Director, Tim Berners-Lee, the inventor of the Web. Jean-François Abramatic is the Chairman of the organization, and is the manager of the W3C team world wide. W3C has about 30 full-time staff members at its three offices, plus about five additional engineers who have been seconded to the W3C by their employers. is to "help the Web reach its maximum potential," which we do by working with our member companies to evolve the specifications that underlie the Web. Our work primarily involves working with companies in a pre-competitive environment, helping to define what parts of the specifications are critical to ensure the long term growth and interoperability of the Web, and what parts are best left to competition in the marketplace. We develop and refine both specifications and reference code (all of which we make freely available to the public).
We also initiate joint projects whose aim is to use the existing Web infrastructure in new ways. These projects lead both to modifications to the infrastructure and to the design of protocols or agreements on the use of the infrastructure for particular new applications.
Our technical work is organized in three broad areas:
The W3C has over 150 corporate members, drawn about 50% from the U.S. and 50% from Europe; we are expanding rapidly in Japan at the current time. We have been successful in simultaneously moving the Web specifications forward while making sure that the major browsers and servers interoperate. Ensuring this graceful evolution is not easy, and constitutes the majority of our work at the W3C. We sponsor workshops, run on-going working groups, elect editorial review boards, and manage cross-industry projects to make sure that all member companies have an opportunity to understand and help evolve the technology, and to make sure that the major technology innovators will work together when necessary to ensure interoperability.
It is against this background that I listened to Bill's talk and found myself in both agreement and disagreement. Let me cover his two points individually.
At the core of this problem lie two issues. The first is the guarantee of perpetual existence, the issue Bill emphasized in his talk. What form of name can be created that can be used forever, outlasting the Internet and current media? The second is the guarantee of uniqueness: two distinct names refer to two distinct objects.
From an engineer's point of view (and good engineering design is the hallmark of the Web: it innovates only when necessary, but remains extremely flexible) the former is easily answered. To an engineer, there is no forever. Instead, there is a fixed lifetime and a mechanism for moving forward before that lifetime expires. This is precisely the work of OCLC on its PURL server, and it can be combined with the work of Hyper-G to allow updates of referring documents as the naming system moves forward (i.e. the federating of separately maintained PURL servers).
What we need to move forward on persistent names, then, is not new technology or engineering. Instead, there must be one or more entities that take institutional charge of the issuing and resolving of unique names, and a mechanism that will allow this entire set of names to be moved forward as the technology progresses. While changes to the Web itself might help make the problem simpler or more robust, the need for institutional commitment to the naming system can not be "engineered away."
Thus my first response: The Digital Library community must identify institutions of long standing that will take the responsibility for resolving institutionalized names into current names for the forseeable future.
Once this is done W3C can help explore with these institutions any remaining naming issues. But without the institutions to back the names, there can be no true progress.
What might such an institution be? And what would be the costs? I propose that a small consortium of well-known (perhaps national) libraries could work to provide the computing infrastructure. What is needed is a small set of replicated servers, perhaps two per site with three sites. Each would need roughly 64 megabytes of memory and 4 gigabytes of disk space to resolve about 10,000,000 names. At current prices, this would cost about $80,000. And if we add 4 gigabytes of disk per year to each system (thus supporting 10,000,000 new names per year), the additional cost would be only on the order of $10,000 per year. One funding model for this infrastructure might be a charge for creating a permanent copy of a document, on the order of $1.00 per megabyte. This could cover a notarization step (proving the document existed as of a certain date), copying the data itself, permanent archival and updating as media changes and naming systems change, and the creation of the permanent name itself.
There are three separate subproblems that underlie the "metadata problem." The first, and by far the hardest, is a question of what the metadata elements should be. This entails hard decisions about what information must be captured, who can capture it reliably, what will be useful for searching both today and 1,000 years hence, standardization of names, and canonicalization (i.e. standardization of representation) of the metadata values. Again, these problems can be addressed only by institutional agreement, and are subject to modification over time. In fact, the set of workshops organized by OCLC and partners over the past year is directly addressing this very hard problem and the form of the results (the Dublin core and the Warwick framework) are becoming clear.
The remaining problems are technical and much easier to solve once the first has been addressed: the encoding of the metadata into a form that can be used by a computer, and the retrieval and transmission of that metadata. I argue that the Internet has a system that is in the process of wide deployment that addresses both of these issues: the PICS(Platform for Internet Content Selection) system. While PICS was initially created to address the problem of child protection on the Internet, it is important to look beyond this at the actual technology to see that PICS is, fundamentally, a metadata system. At a recent meeting, several members of the digital library community did exactly this and have suggested that PICS (with slight modifications) may well form a base for encoding and transmitting metadata derived from the Dublin core and the Warwick framework.
The key to understanding PICS as a metadata system is to look carefully at the three things that it specifies:
Thus my second response: The Digital Library community must identify the sets of metadata that are important. Once that is done, encoding and distributing metadata using PICS will leverage the existing infrastructure of the Web so that deployment and use of metadata will be a natural extension of existing systems.
The World Wide Web Consortium is interested in working with the Digital Library community to make the vision of a worldwide, searchable, information space a reality. The Consortium is prepared to work with institutions that can commit to supporting such an infrastructure. We will help ensure that the infrastructure is sound from an engineering point of view, as well as work towards its universal adoption and integration with the existing Web information space. Toward that end, W3C will work with both the W3C member companies and the PICS community to improve the existing PICS infrastructure to support the full metadata needs of the Digital Library community. But the hardest problems to be solved are not technological: they are problems of our social and institutional structure that can only be solved by cooperation and agreement within the Digital Library community itself. And these processes are well underway.