From Static to Dynamic Surrogates

Resource Discovery in the Digital Age

Carl Lagoze
Department of Computer Science
Cornell University
lagoze@cs.cornell.edu
D-Lib Magazine, June 1997
ISSN 1082-9873

This paper explores how resource discovery in the networked (or digital) environment presents new opportunities for the use of surrogates. Our argument has several elements. We show that resource discovery is a complex process that includes multiple phases, is iterative, and highly dynamic. In agreement with others, we argue that resource discovery involves the manipulation of object surrogates, which are vital because of their "order-making" role. Resource discovery in the physical realm -- for example searches in the traditional library -- has been constrained by the use of static surrogates. In the traditional library these are the card catalog and its digital analog, the OPAC. These static surrogates have been quite successful as resource discovery aids, albeit frequently with the mediation of trained reference librarians. We argue that significant improvements to resource discovery in the networked realm can be made using techniques that match surrogate semantics to the instance-specific requirements of the resource discovery process. This can be accomplished most easily through architectures, such as the Warwick Framework, that allow the association of multiple surrogates with objects, or more ambitiously through methods that construct derived, or dynamic surrogates that respond to current resource discovery needs.

The Complexity of the Resource Discovery Process

With or without the computer and the Internet, the resource discovery process is complex and often over-simplified. To begin with, what is the goal of resource discovery? It is easy to assume that what is sought is the answer to some information need, or query. Yet, depending on the situation, person, costs, and other factors this "answer" may have quite different characteristics. In some cases, what one is seeking is the best possible response to a query (we ignore the vagaries of what characterizes "best"). In other cases, because of constraints such as time, cost, patience, and the like, the information seeker may be satisfied with less response specificity in the response. In some cases, the goal of the discovery process may change as the seeker is sidetracked by intervening needs or newly discovered information. Finally, the process may begin without even a clearly defined information goal, and the satisfactory answer might be some information that is of value simply as a result of the serendipity of the process itself.

Our focus here is the character of the process as it proceeds from initiation to realization of goal. By understanding this process we will be better equipped to formulate architectures to facilitate networked resource discovery. For a broader perspective on networked resource discovery and retrieval (NIDR), we refer readers to the draft of a White Paper on Networked Information and Discovery and Retrieval [CNI], being prepared for the Coalition for Networked Information (CNI). This white paper, while unfinished, is the clearest exploration to date on many of the issues relevant to discovery and retrieval in a networked environment.

Resource discovery is a long-term, multi-threaded, and iterative process with complex and dynamic requirements. We can characterize it as having a number of dimensions, whose relationships range from completely orthogonal to highly interdependent. We briefly describe some of these dimensions below, with the realization that a full examination of the process is the subject of a much more in-depth study.

There are certainly other dimensions to the resource discovery process. The process over time involves the complex interplay of these dimensions and a resulting shifting set of demands on the information architecture. In the remainder of this paper we focus on the central role of surrogates in this information architecture and on the importance of designing resource discovery tools and processes that allow surrogates to adapt to these dynamic requirements.

Cataloging and the Role of Surrogates in the Discovery Process

The importance of surrogates in traditional resource discovery is well known. One well-known example is that which takes place in the physical library, where the widely accepted Anglo-American cataloging rules  [AACR2] and its physical manifestation in the card catalog provides the basis for resource discovery. The MARC  interchange format [MARC] and its use in online public access catalog (OPAC) systems translates this physical artifact to digital form. There are numerous other examples such as specialized abstracting and indexing services as well as subject bibliographies.

An interesting perspective on the role of surrogates comes from David Levy [LEVY] of Xerox Palo Alto Research Center. Levy depicts cataloging as a method for creating an illusion of order (a schema) of the chaotic information universe. In his words "...it [cataloging] is a set of practices which quite literally put a library's collections in order and provide access through a set of systematically organized surrogates...". Cataloging and the surrogates it produces allow us, as information seekers, to assume that resources have a common set of attributes, such as title, author, subject, and the like. In fact, these attributes may not actually exist, but are derived from and associated with information objects as the result of professional cataloging. We use these attributes as the basis for formulating search criteria and a means of conceptualizing and examining the results of the searches

As argued by Lynch, Michelson, et. al., it is tempting to believe that in the age of networked information and digital libraries, the significance of surrogates to resource discovery will diminish. After all, if the full content of the object is available in digital form, why not use it as the basis for resource discovery in preference for some substitute object? They list a number of reasons why this logic is false including:

A distinguishing characteristic of traditional surrogates is that they are static. That is the surrogate exists in physical form either as ink on paper, for example a card catalog item, or a set of bits in magnetic storage, for example a MARC record in an OPAC. This does not imply that the physical manifestation is not revisable, but revisions occur with relatively low frequency, and not in response to dynamic schema requirements of the resource discovery process. As a result, information seekers are forced to continually translate their instance-specific requirements to the static format of the catalog record.

The librarian, in the traditional library setting, plays a vital role in this translation from current user requirements to the static surrogates. Terry Smith [SMITH], in what he calls the "meta-information environment" of libraries, describes this as an iterative mapping of cognitive models. The cognitive model of the user is based on her background, depth of knowledge of the subject, familiarity with the library, and current information requirements (manifested in the "query" for which she would like an "answer"). The cognitive model of the library builds on the information resources currently available and the meta-information (e.g., descriptive cataloging records and indexes) that acts as surrogates for those resources and related resources outside the library. Bringing these sometimes divergent models together is the raison d'être of a good reference librarian; and, as a side effect, the librarian hopes to advance the ability of the patron to perform this mapping with less assistance in the future.

We should note before continuing that the traditional library scenario described above, and the role of the reference librarian, has been enormously successful for resource discovery. We do not suggest that the purpose of digital libraries should be to entirely replace this interaction, or do we think that it is possible for the foreseeable future.  We do recognize, however, that the traditional library model is unduly restrictive in many instances. Face-to-face interaction between patron and librarian is possible only when the library is open and the library is nearby. The proximity problem is somewhat alleviated by telephone contact and may further reduced in the future by video-conferencing technology and attendant improvements in networking technology. More problematic is the cost of the traditional library model, both in terms of the cost of professional cataloging and of the public service and reference service. Ideally, we would like to continue to provide this high-cost service when necessary, but use digital library technology to provide a more lightweight solution for resource discovery. This lightweight solution may be sufficient for the sake of convenience (7-by-24 service), as an alternative for more savvy or experience researchers, or for more informal resource discovery.

The Current State of Networked Resource Discovery

We are in the midst of a rapid transition of the information infrastructure from the physical to the digital realm. For better or worse, the combination of ease-of-access and sheer quantity of current information has made the Internet, manifested in the World Wide Web, the preferred source of information for a large number of people. We are reminded of a Cornell student who, when asked to do a library literature search said, "...sorry I don't do libraries." While the Web is undeniably valuable as an information resource, any suggestion that it supplants the role of a library is foolhardy. It ignores the fact that a library is not just a repository of information, but a rich collection of services including selection, preservation, categorization, location, and reference.

In this paper, we focus on opportunities for improving resource discovery for information in digital form; the Web and Internet being current realizations of that form. The fact is that while the quantity of resources on the Internet continues to expand at an explosive rate, there is not a commensurate advance in the tools for finding those resources. The current tool-set for networked resource discovery uses a model that has evolved little since the Archie [ARCHIE] tool for finding FTP resources. This model, known as "web-crawlers" or "web-indexers" and characterized by services such as Digital's Alta-Vista, relies on periodic global scans of the directed, cyclic graph that is web-space, using hyperlinks as the guide. The crawler uses the HTTP GET request to download each resource and then uses a variety of IR techniques to index the contents of that resource. Users then submit full-text requests to this centralized service (which may be distributed among many servers and replicated at multiple sites).

The technology behind these indexers has four notable problems.

We do not intend to dismiss the current flock of web indexers as useless. In fact, in the course of the writing this paper we found ourselves using them quite frequently. Making innovative use of IR technology, the indexers are often successful at supporting resource discovery in a framework (the Web and HTTP) that provides little infrastructure support for the service. In fact, even when only marginally successful, the web indexers have a definite role in the resource discovery process.

What distinguishes current networked resource discovery from the traditional library model that was discussed in the previous section is the limited, if almost non-existent, role of surrogates in the process. Using Kunze's two-phase model as the basis for examination, we find that these web indexing services make no use of surrogates during the location phase. The notion of a common cataloging record format and a means of associated it with HTML documents are still in the development stages. These network resource discovery tools treats resources and queries as unstructured collection of tokens. In the process of query/resource matching, attempts are made to improve precision and recall using heuristics that attempt to interpret which tokens are of greater semantic relevance. As an aid during the examination phase, these services generally construct an informal surrogate to display search hits. This surrogate generally consists of the URL, title of the Web page, and some summary text that is derived with the aid of some heuristics.

We do not doubt that there is value in resource discovery tools that operate without the aid of structured surrogates. They allow automatic indexing and location of any resource. They suited for "needle in a haystack" type resource discovery tasks. However they can not be viewed as the solution for networked resource discovery, but as a compliment to more structured methods that make use of surrogates.

In the final section of this paper, which follows, we describe the potential for moving beyond the surrogate abstraction in the networked realm. Through the use of composite object architectures and processing techniques we can create a more adaptive surrogate abstraction and, as a result, more powerful resource discovery tools.

Beyond Static Surrogates - Opportunities in Networked Resource Discovery

Recognizing the limitations of current Internet resource discovery tools, members of the Internet, World Wide Web, and digital library community are currently actively pursuing both standards for descriptive surrogates for networked objects and methods for associating surrogates with those objects. One well-known effort is the Dublin Core Metadata Workshop Series [DC], which has produced a fifteen element descriptive metadata set, the so-called Dublin Core. The Dublin Core, as originally conceived, is intended to be a core metadata set that is easy to create and maintain and contains the minimum number of elements required to facilitate resource discovery in a networked environment. Using the terms we developed in the first section of this paper, the Dublin Core was conceived as a coarse granularity, domain-independent metadata scheme.

The effort to develop a method of associating descriptive metadata with HTML documents has most recently centered on PICS [W3C].  PICS was originally designed as an infrastructure for associating ratings with networked content. One targeted use of PICS was to enable parents to control what children access on the Internet. The PICS-NG [PICSWG] effort extends this infrastructure to make it possible to attach any descriptive labels, or metadata, with content.

Creating new descriptive surrogate standards for networked objects is essential, but not sufficient. We argued earlier in this paper that the resource discovery is notable for its instance-specific set of requirements. In other words, no single descriptive standard is sufficient for the wide-range of needs - specific to role, granularity, phase, etc. - which overlap through the resource discovery process. In fact, any attempt to formulate an all-purpose descriptive standard for networked objects is in danger of revisiting territory already explored by the AACR2 and MARC community.

We argue that the more useful alternative is to consider an information object and resource discovery architecture that allows more complete matching of the semantics of the resource surrogate to the current resource discovery requirements. In the remainder of this section we suggest how this might be done. We propose a relatively simple mechanism that makes use of technology for associating multiple static surrogates with networked objects. We then suggest more complex mechanisms that make use of derived surrogates.

In an effort to provide some scope to the Dublin Core effort we developed the Warwick Framework [WF], a container mechanism for associating multiple sets, or packages, of metadata with intellectual objects. The Warwick Framework concept grew out of the recognition that there are many different forms of metadata that one might associate with networked objects. The information architecture should allow each form to be created, administered, and accessed independently. Finally, it should allow sharing and distribution of individual packages associated with a container.

An important application of the Warwick Framework is to encapsulate semantically distinct metadata forms, such as terms and conditions, provenance, and administrative metadata. We focus here on the capability of the Warwick Framework to encapsulate semantically overlapping metadata packages, in particular multiple descriptive surrogates for intellectual objects. For example, using the Warwick Framework, we can associate content objects with general descriptive forms such as the Dublin Core and MARC, and domain-specific descriptive forms such as that encoded in the Content Standard for Digital Geospatial Metadata (CSDGM) [FGDC]. Each descriptive form can provide data appropriate for a relatively specific niche in the resource discovery process. The Dublin Core is appropriate for coarse-granularity, domain-independent resource discovery. The MARC record is more appropriate for the finer granularity stages of resource discovery. Finally, the CSDGM package enables fine-granularity, domain-specific resource discovery. We expect that more metadata forms and extensions of forms will develop, which will provide other targeted semantics. As a side effect, we hope that evolution and use of the Warwick Framework will provide an incentive for metadata developers and standards efforts to maintain a targeted focus.

The mechanism for associating these multiple descriptive forms with network objects is the subject of the PICS-NG effort. We recognize, however, that no technology exists at the client side for automatically selecting among the descriptive surrogate formats. However, a simple manual solution would plausibly make a significant improvement in the current state of network resource discovery. Assume, for example, that over the near-term standards like Dublin Core and PICS-NG become widely adopted. We can then envision that the corpus of networked objects on the World Wide Web evolves to a mixture of "high-integrity" objects, which are packaged with one or more descriptive surrogates, and "low-integrity" objects, which are only stand-alone content. Search service providers, such as Alta-Vista, might then add selectable options to their interfaces that allow the user to fine tune their searches. One option might be to search only high-integrity objects; e.g., those with associated surrogates. Another option might be to search only high-integrity objects in a coarse granularity search; e.g., use Dublin Core type metadata as the surrogate for the search processing. This selectivity option will take some experimentation over time, but has the potential for being quite effective.

We believe, however, that the greatest potential for improvement to networked resource discovery lies in the use of dynamic, or derived, surrogates. Lynch, Michelson, et. al. refer to this capability with the comment "...it is important to recognize that the networked information environment offers new opportunities to derive (by extraction or computation) a much richer and more diverse set of surrogates from networked objects than the surrogates that were typically found in the print world."

We distinguish the notion of deriving surrogates from the essentially surrogate-free resource discovery tools (e.g., Alta-Vista) described earlier. Our intention is to preserve the order-making capacity of surrogates by developing a set of logical surrogate templates that model user requirements in the different stages of resource discovery. Research in this area can proceed with detailed user behavioral studies, both in the physical and networked realm. Research of this type is being undertaken as part of the NSF/ARPA/NASA DLI Projects [BISHOP] and in other venues [PAYETTE]. As a result of these studies we can enumerate key stages in the resource discovery process and develop detailed profiles of these stages. These profiles can then be used to develop a finite number of surrogate templates. With more experience these profiles can be refined and their number increased to allow finer granularity in surrogate/requirements matching.

One way of thinking of this approach is as an extension of the Warwick Framework. The original description of the Framework, in the Lagoze, Lynch and Daniel paper [WF], presented it as a container of physically distinct metadata packages. It is entirely consistent with the semantics of the Framework to move from physical metadata packages to logical metadata packages. In fact, this is by-and-large an implementation detail hidden behind a common interface that makes a set of metadata packages, either static or derived, available to a client.

The mechanisms for deriving these logical surrogates from either source intellectual content or other surrogates are the subject of current and future research. This could be done in a variety of manners. Most simply, it might involve the extraction of structured data based on the DTD and tags within an SGML document. A more interesting possibility is to derive structure [SUMMERS] from documents that are not explicitly tagged. Finally, there is research in both the information retrieval and natural language communities to derive summary information [SALTON] from documents.

An equally interesting research problem is client or user side mapping to the dynamic surrogates. This can be characterized as two problems. First there is the issue of tracking and modeling the current user requirements, presenting those requirements to a resource discovery tool, and then matching them to the appropriate surrogate template. Issues related to this are being examined as part of the user agent research within the University of Michigan Digital Library Project [UMDL]. Second, there is the issue of providing search user interfaces that adapt to the current requirements of the resource discovery process. One interesting examination of this area is the work in the University of Maryland HCI Laboratory [DOAN].  The increasingly ubiquity of Java will make it possible for increased research in both of these areas.

Conclusion

In this paper we have argued that substantial improvements in networked resource discovery are possible if we think in terms of dynamic rather than static surrogates. The success of static surrogates in the traditional library realm has been possible with large amounts of human mediation to translate from instance-specific patron needs to the notion of information order expressed in the catalog record. While we can probably never match the expertise of the professional research librarian or cataloger, we can improve networked resource discovery by providing mechanisms that map surrogate semantics to the current research discovery needs of the information seeker. Realizing this goal will require substantial research in agent and user interface technology. In the mean time, we can improve networked research discovery by standardizing on targeted surrogate formats, developing methods for associating multiple surrogates with content objects, and providing methods for user to adjust the surrogate profile of resource discovery tools.

Acknowledgements

Work on this paper was funded by the Defense Advanced Research Project Agency under Grant No. MDA 972-96-1-0006 with the Corporation for National Research Initiatives. This paper was inspired by my many stimulating discussions about metadata and resource discovery with John Kunze from University of California San Francisco. Thanks also to Ron Daniel, Jr. of Los Alamos Advanced Computer Lab and Sandy Payette of Cornell Computer Science for their comments on earlier drafts of this paper.

Bibliography

[CNI] Clifford Lynch, Avra Michelson, Cecilia Preston, and Craig A. Summerhill, CNI White Paper on Networked Information Discovery and Retrieval, Incomplete Draft, http://www.cni.org/projects/nidr/www/toc.html

[KUNZE] John Kunze, A Citation Model for Resource Discovery and Retrieval, to appear in D-Lib Magazine

[OCLC] CNI/OCLC Workshop on Metadata for Networked Images - Executive Summary, http://www.oclc.org:5046/research/dublin_core/summary.html

[AACR2] American Library Association, Anglo-American Cataloging Rules, 2nd edition.

[MARC] Library of Congress, MARC Standards, http://lcweb.loc.gov/marc/marc.html

[LEVY] David Levy, Cataloging in the Digital Order, Digital Libraries '95, http://csdl.tamu.edu/DL95/contents.html

[XXX] xxx.lanl.gov e-Print archive, http://xxx.lanl.gov
 
[STEFIK] Mark Stefik, Letting Loose the Light: Igniting Commerce in Electronic Publication, in Internet Dreams Archetypes, Myths, and Metaphors, MIT Press, 1996

[SMITH] Terence R. Smith, The Meta-Information Environment of Digital Libraries, D-Lib Magazine, July/August 1996, http://www.dlib.org/dlib/july96/new/07smith.html

[ARCHIE] Alan Emtage and Peter Deutsch, Archie -- an Electronic Directory Service for the Internet, USENIX Winter 1992 Technical Conference Proceedings, http://www.bunyip.com/research/papers/1992/archie-usenix.ps

[DC] Stuart L. Weibel and Carl Lagoze, An Element Set to Support Resource Discovery: The State of the Dublin Core, to appear in Journal of Digital Libraries, Draft Copy available at http://www2.cs.cornell.edu/lagoze/papers/jodl.html

[W3C] Jim Miller, Paul Resnick and David Singer, Rating Services and Rating Systems (and their Machine Readable Descriptions), Platform for Internet Content Selection Version 1.1, May 1996, http://www.w3.org/pub/WWW/PICS/services.html

[PICSWG] Bob Schloss and Eric Miller, PICS 1.x Changes to support digital libraries, a talk at PICS WG meeting, January 1997, http://www.w3.org/pub/WWW/PICS/970113/DigiLib/pics970113.htm

[WF] Carl Lagoze, Clifford A. Lynch, and Ron Daniel Jr., The Warwick Framework: A Container Architecture for Aggregating Sets of Metadata, Cornell University Technical Report TR 96-1593,
http://cs-tr.cs.cornell.edu/Dienst/UI/2.0/Describe/ncstrl.cornell/TR96-1593

[FGDC] Federal Geographic Data Committee, Content Standards for Digital Geospatial Metadata, http://geochange.er.usgs.gov/pub/tools/metadata/standard/metadata.html

[BISHOP] Ann Peterson Bishop, Working Towards an Understanding of Digital Library Use, D-Lib Magazine, October 1995, http://www.dlib.org/dlib/october95/10bishop.html

[PAYETTE] Sandra D. Payette and Oya Y. Rieger, Supporting Scholarly Inquiry: Incorporating Users in the Design of the Digital Library, to appear in Journal of Academic Libraries

[SUMMERS] Kristen Summers and Daniela Rus, Using Non-Textual Cues for Electronic Document Browsing, in Digital Libraries: Current Issues, Lecture Notes in Computer Science, Springer-Verlag 1995, http://www.cs.cornell.edu/Info/People/summers/segment.html.

[SALTON] Gerard Salton and Amit Singhal, Automatic Text Theme Generation and the Analysis of Text Structure, Cornell Computer Science Technical Report TR94-1438,
http://cs-tr.cs.cornell.edu:80/Dienst/UI/2.0/Describe/ncstrl.cornell%2fTR94-1438

[UMDL] Michael P. Wellman, Edmund H. Durfee and William P. Birmingham, The Digital Library As Community of Information Agents, to appear in IEEE Expert, June 1996, http://ai.eecs.umich.edu/people/wellman/pubs/expert96.html

[DOAN] Khoa Doan, Catherine Plaisant, and Ben Scheiderman, Query Previews in Networked Information Systems, Technical Report CAR-TR-788, University of Maryland, September 1995, ftp://ftp.cs.umd.edu/pub/papers/papers/3524/3524.ps.Z

Copyright © 1997 Carl Lagoze

D-Lib Magazine | Current Issue | Comments
Previous Story | Next Story

hdl:cnri.dlib/june97-lagoze