OCLC Internet Cataloging Project Colloquium
Field Report
By Amanda Xu
MIT Libraries
When we joined the OCLC Intercat Project, our first concern was the feasibility of using MARC formats and AACR2 for describing and accessing Internet resources of various types. Are there any other information discovery and retrieval standards or techniques that can adequately replace our traditional cataloging tools?
This field report searches for answers via titles that we contributed to the Project, by mapping the data elements and the data structure designed for describing Internet resources among metadata standards such as the Dublin Core Metadata Element Set, the TEI Header, the Uniform Resource Characteristic (URC), and the USMARC format. This report compares the relative flexibility, compatibility, comprehensiveness, reliability, and sophistication of data structure for each of these standards.
The report also evaluates primary search tools currently available on the Internet, such as robot-based search engines, general purpose catalogs, and locally created "classification" schemes (e.g., library Web pages) that arrange resources alphabetically, chronologically, geographically, by subject, or in various combinations thereof. While these engines and catalogs are powerful tools for retrieving massive amounts of data, their search results are usually indiscriminate, so that the user must spend a great deal of time identifying worthwhile and reliable information. The typical library Web page, on the other hand, presents evaluated materials, but usually offers limited access points, and segregates Internet resources from the library's catalog.
Internet resources organized by MARC formats and AACR2 offer important benefits: (1) they have been filtered by library subject selectors to suit the needs of a given user community; (2) they have been controlled formally and concisely via bibliographic description, authority control, and subject analysis; (3) the automated library systems in which they reside have been developed to handle sophisticated searches of very large data quantities. But most important, Internet resources can be integrated with the millions of bibliographic entities already indexed with MARC formats.
This experience helps us to understand the pros and cons of Internet
information access standards and technology, with emphasis on the
value of USMARC format and AACR2. It also helps us to identify the
limitations of, and potential ways to improve, USMARC format and AACR2.
Introduction
When we joined the OCLC Intercat Project, our first concern was the feasibility of using USMARC format and AACR2 for describing and accessing Internet resources of various types. After all, we now work in an environment crowded with powerful Internet search engines, usually available to us free of charge. With increased pressure on our technical services budget, and like most other research libraries, faced with cataloging backlogs, we at MIT had to ask ourselves whether traditional cataloging approaches really made sense or were necessary at all, at a time when automatic indexing, though imperfect, might provide our users ENOUGH access to networked materials with little or no expensive cataloger intervention.
At the same time, newly emerging metadata standards, such as the TEI header, the Uniform Resource Characteristic (URC), and the Dublin Core Metadata Element Set, have been specifically developed for electronic information discovery and retrieval. Can they adequately replace our traditional cataloging tools such as USMARC and AACR2?
In this field report, we will summarize our findings first by briefly
evaluating the relevant feedback of primary Internet search tools, and
then by mapping data elements and data structure among these newer
metadata standards so as to compare them to USMARC.
Internet Search Tools
A number of Internet search tools are commonly used by the library community. Many libraries, for example, maintain their own Web pages of selected Internet resources. This approach presents filtered and arranged information geared to a specific user community, but usually offers poor user interface and limited search options, may require surfing hyperlinks page after page, and segregates Internet resources from the library's catalog. Furthermore, since the creation and maintenance of these web pages depend entirely on scarce and expensive human resources, the databases produced by this approach are seldom comprehensive.
Another prominent tool is the robot-based search engine, such as Lycos or Harvest. These engines automatically navigate through Web spaces, searching for hyperlinks, retrieving relevant documents, indexing them, and creating a database out of them.
An advantage of this type of search is quantitative recall. One of the drawbacks is lack of precision. With the size of the Internet expanding exponentially, it will become more and more difficult to sift through these massive indiscriminate search results. At the same time, the databases created by these search engines could potentially become bigger than the Internet resources themselves.
Furthermore, none of these search engines adequately address the issue
of indexing non-textual files such as executable, image, compressed,
sound, or moving image files. Such problems lead us to conclude that
using metadata--data about data--is an important long-term
approach for finding information on the Internet.
Metadata Approaches
Metadata refers to a set of data elements that can be used to describe
and represent information objects. At MIT, we mapped data elements
for the electronic publication City of Bits into TEI Header, Dublin
Core, and URC metadata standards. We compared their flexibility in
creating new data elements, compatibility with USMARC records,
comprehensiveness in the representation of electronic objects, their
reliability for retrieval, and the sophistication of their data
structure with that of USMARC. This comparison helped us to determine
the feasibility of using AACR2 and USMARC to access information on the
Internet.
TEI Header
The Text Encoding Initiative, or TEI, uses SGML as the basis for
encoding and interchange of machine-readable texts among research
communities. The TEI header, whether directly attached to a
TEI-conformant text or not, consists of (1) a file description, which
describes the electronic object and its sources, (2) an encoding
description which shows how the text was encoded along with editorial
decisions made during the markup of the document, (3) a profile
description which includes contextual nonbibliographic information,
and (4) a revision description which documents each change made to the
electronic text.
TEI Header and USMARC Comparison
The encoding scheme of the TEI header is flexible. Its only mandatory data elements are the title statement, publication statement, and source description statement, which are the components of the file description. The header's descriptive elements can be defined and added to locally with the modification of SGML DTD. Encoders can decide which information to be included according to their local needs.
The header can either be directly attached to the document it describes or be freestanding. Since the data elements are labeled intuitively, the encoding procedure requires less training than USMARC.
The reliability of TEI headers for information retrieval is variable. When the header is attached to a TEI encoded text, the user can search not only the full text but the SGML tags as well. However, TEI records may be unreliable for retrieval because of TEI's flexibility in encoding. This will affect data exchange among cooperating cataloging institutions unless there is agreement among them regarding the form and content of specific data elements.
The TEI header's data structure is looser and less compact than USMARC. In the Publication Statement, for instance, a descriptive label is required for each separate data element, e.g., publication place, publisher, and date of publication, while USMARC uses tag 260, and three subfields to represent the same amount of information. USMARC carries data very concisely and efficiently, so a TEI header may be longer than a USMARC record for the same electronic object.
The only relationship between bibliographic entities directly recorded
in the TEI header is the source description statement, which contains
a bibliographic record for the original text. We have discovered no
provision in the standard to connect the electronic text to other
related works. USMARC records, on the other hand, can document
complex and sophisticated relations among bibliographic entities.
Potential for Libraries
TEI header and USMARC can both be used to describe electronic text.
Encoding in both formats is time consuming and labor-intensive. TEI
header can describe electronic objects more vividly than USMARC. But
USMARC strongly supports bibliographic data sharing and distribution.
Libraries will be able to use TEI headers for describing Internet
resources provided that (1) the encoding forms and specifications are
standardized as strictly as USMARC, and (2) library systems can
accommodate automatic mapping from TEI headers to USMARC records.
Dublin Core
The Dublin Core is a recently proposed metadata standard for
describing networked resources and assisting in their discovery and
retrieval. It is a simple set of thirteen data elements. The Core's
flexibility allows these elements to be modified and expanded. It is
so intuitive in design that information providers themselves may
encode their own documents at the point of creation. Indeed, this is
one of the developers' major goals.
Dublin Core and USMARC Comparison
The core elements are designed to assist machine harvesting and indexing in the networked environment. Another goal of the developers is to accommodate mapping the core into USMARC records or other standards, such as TEI.
The thirteen elements originally identified emphasize access versus description. But since the source data is so uncontrolled, its reliability for retrieval is questionable.
The data structure of the Dublin Core is simple and unrefined, geared
toward supplying a small number of generally applicable elements.
Dublin Core represents an advance over unaided machine indexing and
searching. Author-supplied metadata will be better than no metadata,
but it still is not the same thing as a professionally prepared
catalog record.
Potential for Libraries
Information objects on the Internet with attached meta-information like Dublin Core's will facilitate great improvements in machine indexing. However, whether the Dublin Core will provide reliable retrieval depends largely on how information providers implement it.
By definition, the Core's encoding forms and specifications will never
be standardized as strictly as USMARC, but it is conceivable that
library systems could adapt elements of the Dublin Core for import and
expansion into full USMARC records.
URC
Unlike the other standards we have evaluated, the URC focuses overwhelmingly on guaranteeing machine retrieval of electronic resources. Here are some definitions that helped us to understand the alphabet soup surrounding the URC.
URN stands for Uniform Resource Name, a proposal for assigning persistent, unique, location-independent identifiers to networked information objects, similar to the ISBN in the publishing world.
URL stands for Uniform Resource Locator and is the electronic address for a networked resource.
URC stands for Uniform Resource Characteristic, and has been proposed to serve as a connection between URNs and URLs. If a URL changes, authorized users can go into the URC service to modify the URL associated with the URN. The URN stays the same even though the URL may change.
The URC may contain meta-information such as author, title, publisher, subject, and so on, which could assist in resource discovery and retrieval. It may also include other data elements, such as electronic signature and review information to ensure the veracity of the resource, an access element for usage restrictions, and a version element for revision history.
The URC service would include bibliographic search capabilities at the
URC search site. In the future, it may be possible to connect to a
variety of URC servers, along with some dedicated sites such as OCLC
and the Library of Congress.
URC and USMARC Comparison
Of the metadata standards that we have reviewed, the URC contains the fewest bibliographic elements (10 so far). Its emphasis, instead, is on assuring machine retrieval, authenticity of resources, and an access-restriction capability.
At this time, data elements in the URC are identified by descriptive
labels, such as author, title, and abstract. And the data structure
is loose. We are aware of no provision in the proposal for including
bibliographic links within URC records.
Potential for Libraries
Both URC and USMARC can be used to document electronic objects. But the meta-information proposed for the URC is far more sketchy than in USMARC.
As with the Dublin Core and TEI header, consistency will be required among cooperating servers regarding which metadata elements should be included in URC records.
The URC uniquely guarantees retrieval of a resource if its name is
known by mapping URNs to URLs; and it accommodates resource validation
via digital signatures and seals of approval. Thus URC can assist
librarians in cataloging and archiving quality information.
USMARC
USMARC is actually a complex set of standards for the description, storage, exchange, manipulation, and retrieval of machine-readable bibliographic data. It is highly developed and refined, with individual data elements defined at a granular level.
Originally developed in the 1960s for the description of printed
books, USMARC is still being adapted to provide description, access
and location information for networked resources. With the
introduction of the 856 field, which allows a hyperlink between the
MARC record and the electronic text it describes, USMARC has become a
feasible standard for the discovery and retrieval of networked resources.
USMARC and Newer Metadata Standard Comparison
The standard is strictly controlled; changes or additions to USMARC take years and painstaking coordination among institutions, industry, and utilities. For example, those of us in the OCLC Intercat Project are familiar with the underscore and tilde problem in the 856 field. The USMARC character set does not include the spacing versions of the tilde and underscore that are widely used in URLs. It is now a full year since MARBI approved a proposal to provide spacing versions of these characters in USMARC, but the change has not yet been implemented because of the implications in a standard that facilitates data exchange among library systems on an international scale. The result is inoperable URLs in MARC records.
USMARC allows for rich analysis of the content of information objects. It strongly supports both access and description. This is especially true for full level encoding as it requires professional encoders to strictly follow standard cataloging code such as Anglo-American Cataloging Rules, 2nd ed. Rev. (AACR2R), controlled thesaurus such as Library of Congress Subject Headings (LCSH), and standard classification scheme such as Library of Congress Classification (LCC), etc. Although these cataloging tools can also be applied to the newer standards, none of them have the intention to implement the tools as strictly and fully as USMARC records. Unlike the newer metadata standards, USMARC includes an authorities format which facilitates successful retrieval.
Professional involvement in data creation, USMARC's strict encoding guidelines, and the sophistication of well-developed library systems further enhance access control. Library automated systems use USMARC to exchange data between systems. Most recently developed library systems such as Web-based OPACs or some window-based OPACs permit hyperlinks from the Electronic Location and Access (856) field to its primary information which resides on the internet. With full Z39.50 connections added to this kind of interfaces, users can not only do the traditional searches such as keywords, Boolean, etc., but also use the new features such as simultaneous searches of multiple databases with a merged or de-duplicated search results from both print and electronic resources.
USMARC's data structure is compact and economical, but still allows
for the encoding of complex and sophisticated relationships among
information objects. For instance, it allows encoders to document
sequential relationship, referencing relationship, etc., among
bibliographic entities. Thus, users can relate one information object
to all its related objects by following the bibliographic links within
the record. None of the newer metadata standards support such
detailed and complete bibliographic relation tracing.
Potential for Libraries
With the few titles we have cataloged during the Intercat Project, we found that USMARC is indeed a feasible vehicle for metadata on networked resources. In term of flexibility for new data element creation, USMARC ranked the lowest; comprehensiveness for representing electronic objects, it ranked the second, next to the TEI Header; but its compatibility with existing bibliographic databases, reliablity for retrieval, and sophistication of data structure ranked the highest among the newer metadata standards.
Internet search engines and library Web pages can provide access, more or less successfully, to items on the net. But libraries are about more than this. They are value-added information centers where information in every format should be carefully selected, organized, preserved, and disseminated. If a networked resource is of high quality, of interest to our users, and relevant to our collections, it belongs in our catalog. In this way, we present our users with a more accurate and complete picture of the information available in a given subject area.
The fact that USMARC is labor-intensive, time-consuming, expensive,
and sometimes misused does not prevent it from being useful for
providing access information on the internet. USMARC offers us an
immediately available tool for integrating important, relevant,
high-quality networked resources into our catalogs.
Conclusion
The emerging standards function differently than USMARC. Dublin Core and URC are still in the proposal stage. Even when they are fully implemented, they'll only serve as a basis for full catalog records. The TEI header's data elements are particularly suited for use with networked resources, but the standard lacks a provision for authority control, and this has negative implications for data exchange.
USMARC is such a ubiquitous metadata standard that publishers, vendors, suppliers of automated systems, and indeed every aspect of library operation depend upon it. Although it is to modify the standard, we must continue the legacy represented by millions of USMARC records that are the heart of library service itself.
As we know, USMARC was originally developed for print resources, but it must be updatable on a constant basis to accommodate information in emerging formats. Our cataloging approaches must be adaptable as well.
During our exercise in data element and data structure mapping, we
envisioned a future where selected internet resources that meet the
criteria of a certain user community will still be cataloged with
USMARC. But we also look forward to a time when automated systems
capable of unassisted metadata conversion from one standard to the
other. This will help us to avoid duplication of effort and maximize
user access to valuable networked resources.
References
Caplan, Priscilla. 1995. "You Call It Corn, We Call it Syntax-Independent Metadata for Document-Like Objects." The Public-Access Computer Systems Review 6, no.4 (1995). URL:http://info.lib.uh.edu/pr/v6/n4/capl6n4.html
Cheong, Fah-Chun. 1996. Spiders for Indexing the Web. In Internet Agents: Spiders, Wanderers, Brokers, and Bots. Indianapolis, Indiana: New Rdiers Publishing.
Daniel, Ron Jr. 1994. Proposed URC External Representation.
URL: http://www.acl.lanl.gov/URI/ExtRep/urc0.html
Daniel, Ron Jr., Michael Mealling. June 1995. URC Scenarios and Requirements. Internet Engineering Task Force. Internet-Draft, expires Dec. 29, 1995. URL: ftp://ds.internic.net/internet-drafts/draft-ietf-uri-urc-req-00.txt
Gaynor, Edward. 1994. Cataloging Electronic Texts: the University of Virginia Library Experience. Library Resources and Technical Services 38 (4): 403-413.
Gordano, Richard. 1994. The Documentation of Electronic Texts Using Text Encoding Initiative Headers: An Introduction. Library Resources and Technical Services 38 (4): 389-401.
Guenther, Rebecca S. 1994. The Challenges of Electronic Texts in the Library: Bibliographic Control and Access. In Literary Texts in an Electronic Age: Scholarly Implications and Library Services, ed. Brett Sutton. 149-172. Papers Presented at the 1994 Clinic on Library Applications of Data Processing, Apr.10-12, 1994, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign.
IFLA Study Group on Functional Requirements of Bibliographic Records. Draft July 31, 1995. Functional Requirements of Bibliographic Records.
Library of Congress. Network Development and MARC Standards Office. May 1995. Discussion Paper No. 86. Mapping the Dublin Core Metadata Elements to USMARC. URL: gopher://marvel.loc.gov/00/.listarch/usmarc/dp86.doc
Library of Congress. Network Development and MARC Standards Office. 1994. USMARC Specifications for Record Structure, Character Sets, and Exchange Media.
Liu, Jian. Sept. 1995. Understanding WWW Search Tools. URL: http://www.indiana.edu/~librcsd/search/
Patton, Glenn. [Glenn_patton@oclc.org]. "The tilde and the underscore." In [intercat@oclc.org], 20 June 1995.
Sperberg-McQueen, C.M., and Lou Burnard, eds. 1994. TEI P3: Guidelines for Electronic Text Encoding and Interchange. Oxford and Chicago: The Text Encoding Initiative.
Sha, Vianne. 1995. "Cataloging Internet Resources: the Library Approach." The Electronic Library 13, no. 5 (1995). Vizine-Goets, Diane, Jean Godby, Mark Bending. 1995. Spectrum: a Web-based Tool for Describing Electronic Resources. Computer Networks and ISDN Systems 27 (1995) 985-1001.
Weibel, Stuart, Jean Godby, Eric Miller. 1995. OCLC/NCSA Metadata Workshop Report. URL: http://www.oclc.org:5046/oclc/research/conferences/metadata/dublin_core_report.html
Winship, Ian R. 1995. World Wide Web Searching Tools - An Evaluation. VINE (99) 1995, 49-54. URL: http://www.bubl.bath.ac.uk/BUBL/IWinship.html
The USMARC Formats: Background and Principles.
URL: ftp://wais.com/pub/protocol/USMARC.txt
Appendixes: Data Element and Data Structure Mapping
TEI Header Record for City of Bits
<teiheader> <filedesc> <titlestmt> <title>City of bits: Space, Place, and the Infobahn</title> <author>William J. Mitchell</author> </titlestmt> <editionstmt>World-Wide Web ed.</editionstmt> <publicationstmt> <publisher>MIT Press</publisher> <pubplace>Cambridge, Mass.</pubplace> <idno type=oclc>32437789</idno> <date>1995</date> </publicationstmt> <notesstmt> <note>856 7 $u URL:http://mitpress.mit.edu/City_of_Bits/WWWPreamble.html</note> </notesstmt> <sourcedesc> <bibfull> <titlestmt> <author>Mitchell, William J.</author> <title>City of bits</title> <title type=sub>space, place, and the infobahn</title> </titlestmt> <extent>225 p. : ill., maps, plans ; 24 cm.</extent> <imprintstmt> <pubplace>Cambridge, Mass.</pubplace> <publisher>MIT Press</publisher> <idno type=isbn>0262133091</idno> <idno type=oclc>33278259</idno> <date>1995</date> </imprintstmt> </bibfull> </sourcedesc> </filedesc> <encodingdesc>NA</encodingdesc> <profiledesc> <textclass> <keywords scheme=lcsh> <list> <item>Computer networks</item> <item>Information technology</item> <item>Virtual reality</item> <item>Computers and civilization</item> </list> </keywords> <classcode scheme=lc>TK5105.5.M57</classcode scheme> </textclass> </profiledesc> <revisiondesc>NA</revisiondesc>
Subject: scheme=keywords: Electronically mediated environments Cyberspace Urbanism Architecture scheme=LCSH: Computer networks Information technology Virtual reality Computers and civilization Title: City of Bits: Space, Place, and the Infobahn Author: Mitchell, William J. Publisher: MIT Press OtherAgents: otherAgent role=WWW team member: Stevenson, Daniel C. otherAgent role=WWW team member: Ehling, Teresa otherAgent role=WWW team member: Kalin, Jeffrey T. otherAgent role=WWW team member: Schoonover, Regina otherAgent role=WWW team member: Beamish, Anne otherAgent role=WWW team member: Ishizake, Suguru otherAgent role=WWW team member: Urbanowski, Frank Date: 1995 Identifiers: scheme=ISBN: 0262133091 scheme=URL: http://www-mitpress.mit.edu:80/City_of_Bits/WWWPreamble.html Object type: book Form: Text/HTML, Video/(MPEG, Quicktime), Image/GIF Language: English Source: type=print ed.: City of Bits: Space, Place, and Infobahn
URC.0 { // This is a hypothetical record URN: Universal Resource Name (not yet available) Title: City of Bits: Space, Place, and the Infobahn Author { Name: Mitchell, William J. Email:wjm@mit.edu Phone:617-253-4402 Facsimile:617-253-9417 } Subject: electronically mediated environments; cyberspace; urbanism; architecture Abstract { Textual: A hyperlinked exploration of the "virtual city" which is now emerging through our burgeoning use of the information superhighway. } Location { URL: http://mitpress.mit.edu/City_of_Bits/WWWPreamble.html Content type: text/html, image/GIF, video/MPEG, Quicktime Content-length:Signature (Not yet available) } Review { //There is no URN available for the media reviews. //This URL will lead to the content of them. URL: http://www-mitpress.mit.edu/City_of_Bits/ reviews.html } Version: World-Wide Web ed. }
000 cmm Ia 001 32437789 003 OCoLC 005 19000000003748.0 008 950508s1995 maun d eng d 040 MYG $c MYG 090 TK5105.5 $b .M57 1995b 100 1 Mitchell, William J. 245 10 City of bits $h [interactive multimedia] : $b space, place, and the infobahn / $c by William J. Mitchell. 250 World-Wide Web ed. 256 Computer data. 260 [Cambridge, Mass.] : $b MIT Press, $c 1995. 516 Text (HTML), images (GIF), and video (MPEG, QuickTime). 538 System requirements: Web browser; video viewer such as QuickTime or MPEGPlay required for video applications. 538 Mode of access: Internet. Address: http://mitpress.mit.edu/CityofBits/. 500 Title from title screen. 530 Also available in printed ed. 520 A hyperlinked exploration of the "virtual city" which is now emerging through our burgeoning use of the information superhighway. Re-examines architecture and urbanism in light of our increasingly digital means of communication. 505 0 1. Pulling glass -- 2. Electronic agoras -- 3. Cyborg citizens -- 4. Recombinant architecture -- 5. Soft cities -- 6. Bit biz -- 7. Getting to the good bits. 504 Includes bibliographical references. 650 0 Computer networks. 650 0 Information technology. 650 0 Virtual reality. 650 0 Computers and civilization. 856 7 $2 http $z http://mitpress.mit.edu/City_of_Bits/ $u http://purl.oclc.org/OCLC/OLUC/32437789/1