OCLC Internet Cataloging Project Colloquium
Field Report

Accessing Information on the Internet

Feasibility Study of USMARC Format and AACR2

By Amanda Xu
MIT Libraries

Abstract
Introduction
Internet Search Tools
Metadata Approaches
TEI Header
Dublin Core
URC
USMARC
Conclusion
References
Appendixes: Data Element and Data Structure Mapping

Abstract

When we joined the OCLC Intercat Project, our first concern was the feasibility of using MARC formats and AACR2 for describing and accessing Internet resources of various types. Are there any other information discovery and retrieval standards or techniques that can adequately replace our traditional cataloging tools?

This field report searches for answers via titles that we contributed to the Project, by mapping the data elements and the data structure designed for describing Internet resources among metadata standards such as the Dublin Core Metadata Element Set, the TEI Header, the Uniform Resource Characteristic (URC), and the USMARC format. This report compares the relative flexibility, compatibility, comprehensiveness, reliability, and sophistication of data structure for each of these standards.

The report also evaluates primary search tools currently available on the Internet, such as robot-based search engines, general purpose catalogs, and locally created "classification" schemes (e.g., library Web pages) that arrange resources alphabetically, chronologically, geographically, by subject, or in various combinations thereof. While these engines and catalogs are powerful tools for retrieving massive amounts of data, their search results are usually indiscriminate, so that the user must spend a great deal of time identifying worthwhile and reliable information. The typical library Web page, on the other hand, presents evaluated materials, but usually offers limited access points, and segregates Internet resources from the library's catalog.

Internet resources organized by MARC formats and AACR2 offer important benefits: (1) they have been filtered by library subject selectors to suit the needs of a given user community; (2) they have been controlled formally and concisely via bibliographic description, authority control, and subject analysis; (3) the automated library systems in which they reside have been developed to handle sophisticated searches of very large data quantities. But most important, Internet resources can be integrated with the millions of bibliographic entities already indexed with MARC formats.

This experience helps us to understand the pros and cons of Internet information access standards and technology, with emphasis on the value of USMARC format and AACR2. It also helps us to identify the limitations of, and potential ways to improve, USMARC format and AACR2.

Introduction

When we joined the OCLC Intercat Project, our first concern was the feasibility of using USMARC format and AACR2 for describing and accessing Internet resources of various types. After all, we now work in an environment crowded with powerful Internet search engines, usually available to us free of charge. With increased pressure on our technical services budget, and like most other research libraries, faced with cataloging backlogs, we at MIT had to ask ourselves whether traditional cataloging approaches really made sense or were necessary at all, at a time when automatic indexing, though imperfect, might provide our users ENOUGH access to networked materials with little or no expensive cataloger intervention.

At the same time, newly emerging metadata standards, such as the TEI header, the Uniform Resource Characteristic (URC), and the Dublin Core Metadata Element Set, have been specifically developed for electronic information discovery and retrieval. Can they adequately replace our traditional cataloging tools such as USMARC and AACR2?

In this field report, we will summarize our findings first by briefly evaluating the relevant feedback of primary Internet search tools, and then by mapping data elements and data structure among these newer metadata standards so as to compare them to USMARC.

Internet Search Tools

A number of Internet search tools are commonly used by the library community. Many libraries, for example, maintain their own Web pages of selected Internet resources. This approach presents filtered and arranged information geared to a specific user community, but usually offers poor user interface and limited search options, may require surfing hyperlinks page after page, and segregates Internet resources from the library's catalog. Furthermore, since the creation and maintenance of these web pages depend entirely on scarce and expensive human resources, the databases produced by this approach are seldom comprehensive.

Another prominent tool is the robot-based search engine, such as Lycos or Harvest. These engines automatically navigate through Web spaces, searching for hyperlinks, retrieving relevant documents, indexing them, and creating a database out of them.

An advantage of this type of search is quantitative recall. One of the drawbacks is lack of precision. With the size of the Internet expanding exponentially, it will become more and more difficult to sift through these massive indiscriminate search results. At the same time, the databases created by these search engines could potentially become bigger than the Internet resources themselves.

Furthermore, none of these search engines adequately address the issue of indexing non-textual files such as executable, image, compressed, sound, or moving image files. Such problems lead us to conclude that using metadata--data about data--is an important long-term approach for finding information on the Internet.

Metadata Approaches

Metadata refers to a set of data elements that can be used to describe and represent information objects. At MIT, we mapped data elements for the electronic publication City of Bits into TEI Header, Dublin Core, and URC metadata standards. We compared their flexibility in creating new data elements, compatibility with USMARC records, comprehensiveness in the representation of electronic objects, their reliability for retrieval, and the sophistication of their data structure with that of USMARC. This comparison helped us to determine the feasibility of using AACR2 and USMARC to access information on the Internet.

TEI Header

The Text Encoding Initiative, or TEI, uses SGML as the basis for encoding and interchange of machine-readable texts among research communities. The TEI header, whether directly attached to a TEI-conformant text or not, consists of (1) a file description, which describes the electronic object and its sources, (2) an encoding description which shows how the text was encoded along with editorial decisions made during the markup of the document, (3) a profile description which includes contextual nonbibliographic information, and (4) a revision description which documents each change made to the electronic text.

TEI Header and USMARC Comparison

The encoding scheme of the TEI header is flexible. Its only mandatory data elements are the title statement, publication statement, and source description statement, which are the components of the file description. The header's descriptive elements can be defined and added to locally with the modification of SGML DTD. Encoders can decide which information to be included according to their local needs.

The header can either be directly attached to the document it describes or be freestanding. Since the data elements are labeled intuitively, the encoding procedure requires less training than USMARC.

The reliability of TEI headers for information retrieval is variable. When the header is attached to a TEI encoded text, the user can search not only the full text but the SGML tags as well. However, TEI records may be unreliable for retrieval because of TEI's flexibility in encoding. This will affect data exchange among cooperating cataloging institutions unless there is agreement among them regarding the form and content of specific data elements.

The TEI header's data structure is looser and less compact than USMARC. In the Publication Statement, for instance, a descriptive label is required for each separate data element, e.g., publication place, publisher, and date of publication, while USMARC uses tag 260, and three subfields to represent the same amount of information. USMARC carries data very concisely and efficiently, so a TEI header may be longer than a USMARC record for the same electronic object.

The only relationship between bibliographic entities directly recorded in the TEI header is the source description statement, which contains a bibliographic record for the original text. We have discovered no provision in the standard to connect the electronic text to other related works. USMARC records, on the other hand, can document complex and sophisticated relations among bibliographic entities.

Potential for Libraries

TEI header and USMARC can both be used to describe electronic text. Encoding in both formats is time consuming and labor-intensive. TEI header can describe electronic objects more vividly than USMARC. But USMARC strongly supports bibliographic data sharing and distribution. Libraries will be able to use TEI headers for describing Internet resources provided that (1) the encoding forms and specifications are standardized as strictly as USMARC, and (2) library systems can accommodate automatic mapping from TEI headers to USMARC records.

Dublin Core

The Dublin Core is a recently proposed metadata standard for describing networked resources and assisting in their discovery and retrieval. It is a simple set of thirteen data elements. The Core's flexibility allows these elements to be modified and expanded. It is so intuitive in design that information providers themselves may encode their own documents at the point of creation. Indeed, this is one of the developers' major goals.

Dublin Core and USMARC Comparison

The core elements are designed to assist machine harvesting and indexing in the networked environment. Another goal of the developers is to accommodate mapping the core into USMARC records or other standards, such as TEI.

The thirteen elements originally identified emphasize access versus description. But since the source data is so uncontrolled, its reliability for retrieval is questionable.

The data structure of the Dublin Core is simple and unrefined, geared toward supplying a small number of generally applicable elements. Dublin Core represents an advance over unaided machine indexing and searching. Author-supplied metadata will be better than no metadata, but it still is not the same thing as a professionally prepared catalog record.

Potential for Libraries

Information objects on the Internet with attached meta-information like Dublin Core's will facilitate great improvements in machine indexing. However, whether the Dublin Core will provide reliable retrieval depends largely on how information providers implement it.

By definition, the Core's encoding forms and specifications will never be standardized as strictly as USMARC, but it is conceivable that library systems could adapt elements of the Dublin Core for import and expansion into full USMARC records.

URC

Unlike the other standards we have evaluated, the URC focuses overwhelmingly on guaranteeing machine retrieval of electronic resources. Here are some definitions that helped us to understand the alphabet soup surrounding the URC.

URN stands for Uniform Resource Name, a proposal for assigning persistent, unique, location-independent identifiers to networked information objects, similar to the ISBN in the publishing world.

URL stands for Uniform Resource Locator and is the electronic address for a networked resource.

URC stands for Uniform Resource Characteristic, and has been proposed to serve as a connection between URNs and URLs. If a URL changes, authorized users can go into the URC service to modify the URL associated with the URN. The URN stays the same even though the URL may change.

The URC may contain meta-information such as author, title, publisher, subject, and so on, which could assist in resource discovery and retrieval. It may also include other data elements, such as electronic signature and review information to ensure the veracity of the resource, an access element for usage restrictions, and a version element for revision history.

The URC service would include bibliographic search capabilities at the URC search site. In the future, it may be possible to connect to a variety of URC servers, along with some dedicated sites such as OCLC and the Library of Congress.

URC and USMARC Comparison

Of the metadata standards that we have reviewed, the URC contains the fewest bibliographic elements (10 so far). Its emphasis, instead, is on assuring machine retrieval, authenticity of resources, and an access-restriction capability.

At this time, data elements in the URC are identified by descriptive labels, such as author, title, and abstract. And the data structure is loose. We are aware of no provision in the proposal for including bibliographic links within URC records.

Potential for Libraries

Both URC and USMARC can be used to document electronic objects. But the meta-information proposed for the URC is far more sketchy than in USMARC.

As with the Dublin Core and TEI header, consistency will be required among cooperating servers regarding which metadata elements should be included in URC records.

The URC uniquely guarantees retrieval of a resource if its name is known by mapping URNs to URLs; and it accommodates resource validation via digital signatures and seals of approval. Thus URC can assist librarians in cataloging and archiving quality information.

USMARC

USMARC is actually a complex set of standards for the description, storage, exchange, manipulation, and retrieval of machine-readable bibliographic data. It is highly developed and refined, with individual data elements defined at a granular level.

Originally developed in the 1960s for the description of printed books, USMARC is still being adapted to provide description, access and location information for networked resources. With the introduction of the 856 field, which allows a hyperlink between the MARC record and the electronic text it describes, USMARC has become a feasible standard for the discovery and retrieval of networked resources.

USMARC and Newer Metadata Standard Comparison

The standard is strictly controlled; changes or additions to USMARC take years and painstaking coordination among institutions, industry, and utilities. For example, those of us in the OCLC Intercat Project are familiar with the underscore and tilde problem in the 856 field. The USMARC character set does not include the spacing versions of the tilde and underscore that are widely used in URLs. It is now a full year since MARBI approved a proposal to provide spacing versions of these characters in USMARC, but the change has not yet been implemented because of the implications in a standard that facilitates data exchange among library systems on an international scale. The result is inoperable URLs in MARC records.

USMARC allows for rich analysis of the content of information objects. It strongly supports both access and description. This is especially true for full level encoding as it requires professional encoders to strictly follow standard cataloging code such as Anglo-American Cataloging Rules, 2nd ed. Rev. (AACR2R), controlled thesaurus such as Library of Congress Subject Headings (LCSH), and standard classification scheme such as Library of Congress Classification (LCC), etc. Although these cataloging tools can also be applied to the newer standards, none of them have the intention to implement the tools as strictly and fully as USMARC records. Unlike the newer metadata standards, USMARC includes an authorities format which facilitates successful retrieval.

Professional involvement in data creation, USMARC's strict encoding guidelines, and the sophistication of well-developed library systems further enhance access control. Library automated systems use USMARC to exchange data between systems. Most recently developed library systems such as Web-based OPACs or some window-based OPACs permit hyperlinks from the Electronic Location and Access (856) field to its primary information which resides on the internet. With full Z39.50 connections added to this kind of interfaces, users can not only do the traditional searches such as keywords, Boolean, etc., but also use the new features such as simultaneous searches of multiple databases with a merged or de-duplicated search results from both print and electronic resources.

USMARC's data structure is compact and economical, but still allows for the encoding of complex and sophisticated relationships among information objects. For instance, it allows encoders to document sequential relationship, referencing relationship, etc., among bibliographic entities. Thus, users can relate one information object to all its related objects by following the bibliographic links within the record. None of the newer metadata standards support such detailed and complete bibliographic relation tracing.

Potential for Libraries

With the few titles we have cataloged during the Intercat Project, we found that USMARC is indeed a feasible vehicle for metadata on networked resources. In term of flexibility for new data element creation, USMARC ranked the lowest; comprehensiveness for representing electronic objects, it ranked the second, next to the TEI Header; but its compatibility with existing bibliographic databases, reliablity for retrieval, and sophistication of data structure ranked the highest among the newer metadata standards.

Internet search engines and library Web pages can provide access, more or less successfully, to items on the net. But libraries are about more than this. They are value-added information centers where information in every format should be carefully selected, organized, preserved, and disseminated. If a networked resource is of high quality, of interest to our users, and relevant to our collections, it belongs in our catalog. In this way, we present our users with a more accurate and complete picture of the information available in a given subject area.

The fact that USMARC is labor-intensive, time-consuming, expensive, and sometimes misused does not prevent it from being useful for providing access information on the internet. USMARC offers us an immediately available tool for integrating important, relevant, high-quality networked resources into our catalogs.

Conclusion

The emerging standards function differently than USMARC. Dublin Core and URC are still in the proposal stage. Even when they are fully implemented, they'll only serve as a basis for full catalog records. The TEI header's data elements are particularly suited for use with networked resources, but the standard lacks a provision for authority control, and this has negative implications for data exchange.

USMARC is such a ubiquitous metadata standard that publishers, vendors, suppliers of automated systems, and indeed every aspect of library operation depend upon it. Although it is to modify the standard, we must continue the legacy represented by millions of USMARC records that are the heart of library service itself.

As we know, USMARC was originally developed for print resources, but it must be updatable on a constant basis to accommodate information in emerging formats. Our cataloging approaches must be adaptable as well.

During our exercise in data element and data structure mapping, we envisioned a future where selected internet resources that meet the criteria of a certain user community will still be cataloged with USMARC. But we also look forward to a time when automated systems capable of unassisted metadata conversion from one standard to the other. This will help us to avoid duplication of effort and maximize user access to valuable networked resources.

References

Caplan, Priscilla. 1995. "You Call It Corn, We Call it Syntax-Independent Metadata for Document-Like Objects." The Public-Access Computer Systems Review 6, no.4 (1995). URL:http://info.lib.uh.edu/pr/v6/n4/capl6n4.html

Cheong, Fah-Chun. 1996. Spiders for Indexing the Web. In Internet Agents: Spiders, Wanderers, Brokers, and Bots. Indianapolis, Indiana: New Rdiers Publishing.

Daniel, Ron Jr. 1994. Proposed URC External Representation.
URL: http://www.acl.lanl.gov/URI/ExtRep/urc0.html

Daniel, Ron Jr., Michael Mealling. June 1995. URC Scenarios and Requirements. Internet Engineering Task Force. Internet-Draft, expires Dec. 29, 1995. URL: ftp://ds.internic.net/internet-drafts/draft-ietf-uri-urc-req-00.txt

Gaynor, Edward. 1994. Cataloging Electronic Texts: the University of Virginia Library Experience. Library Resources and Technical Services 38 (4): 403-413.

Gordano, Richard. 1994. The Documentation of Electronic Texts Using Text Encoding Initiative Headers: An Introduction. Library Resources and Technical Services 38 (4): 389-401.

Guenther, Rebecca S. 1994. The Challenges of Electronic Texts in the Library: Bibliographic Control and Access. In Literary Texts in an Electronic Age: Scholarly Implications and Library Services, ed. Brett Sutton. 149-172. Papers Presented at the 1994 Clinic on Library Applications of Data Processing, Apr.10-12, 1994, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign.

IFLA Study Group on Functional Requirements of Bibliographic Records. Draft July 31, 1995. Functional Requirements of Bibliographic Records.

Library of Congress. Network Development and MARC Standards Office. May 1995. Discussion Paper No. 86. Mapping the Dublin Core Metadata Elements to USMARC. URL: gopher://marvel.loc.gov/00/.listarch/usmarc/dp86.doc

Library of Congress. Network Development and MARC Standards Office. 1994. USMARC Specifications for Record Structure, Character Sets, and Exchange Media.

Liu, Jian. Sept. 1995. Understanding WWW Search Tools. URL: http://www.indiana.edu/~librcsd/search/

Patton, Glenn. [Glenn_patton@oclc.org]. "The tilde and the underscore." In [intercat@oclc.org], 20 June 1995.

Sperberg-McQueen, C.M., and Lou Burnard, eds. 1994. TEI P3: Guidelines for Electronic Text Encoding and Interchange. Oxford and Chicago: The Text Encoding Initiative.

Sha, Vianne. 1995. "Cataloging Internet Resources: the Library Approach." The Electronic Library 13, no. 5 (1995). Vizine-Goets, Diane, Jean Godby, Mark Bending. 1995. Spectrum: a Web-based Tool for Describing Electronic Resources. Computer Networks and ISDN Systems 27 (1995) 985-1001.

Weibel, Stuart, Jean Godby, Eric Miller. 1995. OCLC/NCSA Metadata Workshop Report. URL: http://www.oclc.org:5046/oclc/research/conferences/metadata/dublin_core_report.html

Winship, Ian R. 1995. World Wide Web Searching Tools - An Evaluation. VINE (99) 1995, 49-54. URL: http://www.bubl.bath.ac.uk/BUBL/IWinship.html

The USMARC Formats: Background and Principles. URL: ftp://wais.com/pub/protocol/USMARC.txt

Appendixes: Data Element and Data Structure Mapping

TEI Header Record for City of Bits

<teiheader>
  <filedesc>
    <titlestmt>
      <title>City of bits: Space, Place, and the Infobahn</title>
      <author>William J. Mitchell</author>
    </titlestmt>
    <editionstmt>World-Wide Web ed.</editionstmt>
    <publicationstmt>
      <publisher>MIT Press</publisher>
      <pubplace>Cambridge, Mass.</pubplace>
      <idno type=oclc>32437789</idno>
      <date>1995</date>
    </publicationstmt>
    <notesstmt>
       <note>856 7 $u                  
URL:http://mitpress.mit.edu/City_of_Bits/WWWPreamble.html</note>
    </notesstmt>
    <sourcedesc>
      <bibfull>
         <titlestmt>
           <author>Mitchell, William J.</author>
           <title>City of bits</title>
           <title type=sub>space, place, and the infobahn</title>
         </titlestmt>
         <extent>225 p. : ill., maps, plans ; 24 cm.</extent>
         <imprintstmt>
           <pubplace>Cambridge, Mass.</pubplace>
           <publisher>MIT Press</publisher>
           <idno type=isbn>0262133091</idno>
           <idno type=oclc>33278259</idno> 
           <date>1995</date>
         </imprintstmt>
      </bibfull>
    </sourcedesc>
  </filedesc>
  <encodingdesc>NA</encodingdesc>
  <profiledesc>
    <textclass>
      <keywords scheme=lcsh>
         <list>
           <item>Computer networks</item>
           <item>Information technology</item>
           <item>Virtual reality</item>
           <item>Computers and civilization</item>
         </list>
      </keywords>
      <classcode scheme=lc>TK5105.5.M57</classcode scheme>
    </textclass>  
  </profiledesc>
  <revisiondesc>NA</revisiondesc>

Dublin Core Record for City of Bits

Subject:
     scheme=keywords:    Electronically mediated environments
                         Cyberspace
                         Urbanism
                         Architecture

     scheme=LCSH:        Computer networks
                         Information technology
                         Virtual reality
                         Computers and civilization
                         
Title:                City of Bits: Space, Place, and the Infobahn

Author:                  Mitchell, William J.
Publisher:               MIT Press
OtherAgents:
          otherAgent role=WWW team member: Stevenson, Daniel C.
          otherAgent role=WWW team member: Ehling, Teresa
          otherAgent role=WWW team member: Kalin, Jeffrey T.
          otherAgent role=WWW team member: Schoonover, Regina
          otherAgent role=WWW team member: Beamish, Anne
          otherAgent role=WWW team member: Ishizake, Suguru
          otherAgent role=WWW team member: Urbanowski, Frank
     
Date:               1995
Identifiers:
     scheme=ISBN:   0262133091
     scheme=URL:    
     http://www-mitpress.mit.edu:80/City_of_Bits/WWWPreamble.html
Object type:        book
Form:               Text/HTML, Video/(MPEG, Quicktime), Image/GIF
Language:           English
Source:             type=print ed.: City of Bits: Space, Place, and Infobahn

URC Record for City of Bits

URC.0 {
  // This is a hypothetical record
  URN: Universal Resource Name (not yet available)
  Title: City of Bits: Space, Place, and the Infobahn
  Author  {
          Name: Mitchell, William J.
          Email:wjm@mit.edu 
          Phone:617-253-4402
          Facsimile:617-253-9417  
          }

  Subject: electronically mediated environments; cyberspace;
            urbanism; architecture

  Abstract {
          Textual: A hyperlinked exploration of the "virtual
          city" which is now emerging through our burgeoning use of 
          the information superhighway.
           }
  Location {
         URL:
http://mitpress.mit.edu/City_of_Bits/WWWPreamble.html
         Content type: text/html, image/GIF, video/MPEG,
                         Quicktime
         Content-length:   
         Signature (Not yet available) 
         }

  Review {
         //There is no URN available for the media reviews.
         //This URL will lead to the content of them.
         URL: http://www-mitpress.mit.edu/City_of_Bits/
              reviews.html  
    }
  Version: World-Wide Web ed.
}

USMARC Record for City of Bits

000 cmm  Ia  
001 32437789  
003 OCoLC 
005 19000000003748.0 
008 950508s1995    maun       d        eng d 
040    MYG $c MYG  
090    TK5105.5 $b .M57 1995b  
100 1  Mitchell, William J. 
245 10 City of bits $h [interactive multimedia] :  $b space, place, and the infobahn / $c by William J. Mitchell.  
250    World-Wide Web ed. 
256    Computer data. 
260    [Cambridge, Mass.] : $b MIT Press,  $c 1995.  
516    Text (HTML), images (GIF), and video (MPEG, QuickTime). 
538    System requirements: Web browser; video viewer such as
QuickTime or MPEGPlay required for video applications.  
538    Mode of access: Internet. Address:
http://mitpress.mit.edu/CityofBits/.  
500    Title from title screen. 
530    Also available in printed ed. 
520    A hyperlinked exploration of the "virtual city" which is now
emerging through our burgeoning use of the information superhighway.
Re-examines architecture and urbanism in light of our increasingly
digital means of communication.     
505 0  1. Pulling glass -- 2. Electronic agoras -- 3. Cyborg citizens
-- 4. Recombinant architecture -- 5. Soft cities -- 6. Bit biz -- 7.
Getting to the good bits.   

504    Includes bibliographical references. 
650  0 Computer networks. 
650  0 Information technology. 
650  0 Virtual reality. 
650  0 Computers and civilization. 
856 7  $2 http   $z http://mitpress.mit.edu/City_of_Bits/   
$u http://purl.oclc.org/OCLC/OLUC/32437789/1

Back to beginning

Accessing Information on the Internet

Feasibility Study of USMARC Format and AACR2

Contents

Abstract

TEI Header Record for City of Bits

Dublin Core Record for City of Bits

URC Record for City of Bits

USMARC Record for City of Bits