D-Lib Magazine
|
|
Carol Jean Godby,
Jeffrey A. Young, and Eric Childress |
AbstractThis paper proposes a model for metadata crosswalks that associates three pieces of information: the crosswalk, the source metadata standard, and the target metadata standard, each of which may have a machine-readable encoding and human-readable description. The crosswalks are encoded as METS records that are made available to a repository for processing by search engines, OAI harvesters, and custom-designed Web services. The METS object brings together all of the information required to access and interpret crosswalks and represents a significant improvement over previously available formats. But it raises questions about how best to describe these complex objects and exposes gaps that must eventually be filled in by the digital library community. IntroductionDifferences between metadata records must be reconciled whenever incompatible descriptions hinder effective searching or database management. The flavor of the problem can be illustrated by the MARC [1] and Dublin Core [2] records shown below:
Here it is easy to recognize a correspondence between the MARC 100 author field and "...are used to 'translate' between different metadata element sets. The elements (or fields) in one metadata set are correlated with the elements of another metadata set that have the same or similar meanings. This is also sometimes called 'semantic mapping.'" Figure 2 shows a portion of the crosswalk [4] proposed by the Library of Congress that maps Dublin Core to MARC elements and is a template for the records shown in Figure 1.
Figure 2 A fragment of a Dublin Core - MARC crosswalk Crosswalks evolved from the need for online information systems to cope with the metadata standards that have been developed in response to the recent onslaught of digital material, which presents concerns not addressed by standards developed for traditionally published work. For example, since electronic resources are so easily produced, custom-designed metadata standards must be relatively simple and potentially automated; otherwise, the store of electronic resources will grow much faster than our ability to describe them. And because electronic objects are more easily modified than traditionally published materials, the new standards must give prominence to elements that describe lifecycle events such as revision histories and relationships to similar versions. But the online information environment also produces a demand for compatibility with other descriptions: to locate materials in heterogeneous collections, to assemble a rich context for research or learning, or to satisfy some unnamed information need that integrates text, images, raw data, and born-digital genres. And here is where crosswalks are critical. Despite the obvious differences evident even in the simple illustration in Figure 1, the two records have common elements. Both descriptions encode some understanding of author, title, and publisher. Though the correspondences are inexact, they are useful for promoting some degree of interoperability. The project we describe here is designed to facilitate the use of crosswalks in systems that must resolve differences among metadata standards. Our work proceeds from the hypothesis that usable crosswalks must have the following characteristics:
Though the Library of Congress makes a good start toward achieving these objectives, the crosswalks are scattered across several pages and involve only Dublin Core and MARC-related metadata standards, for which machine-processable encodings are not always available. The Getty Museum [5], UKOLN [6], Northwestern University [7], and the Canadian Heritage Information Network [8] maintain Web sites that collect links to crosswalks involving other standards. But readers who explore the links on these pages discover a broadly defined universe of discourse, not a usable tool that promotes the standardization of crosswalks. In addition to crosswalk tables, such pages contain links to pertinent definitions, high-level and theoretical discussion, and home pages for institutions that have created crosswalks or standards for which crosswalks have been written. Our project takes these efforts forward in two ways. First, we define a data model that expresses a crosswalk not as a single file, as implied by Figure 2, but as a complex object representing six pieces of information: the table of equivalences, the source metadata standard and the target metadata standard, each of which may have a machine-processable encoding and a human-readable description. Second, we support the current interest in XML [9] processing by creating XML-encoded metadata records for crosswalk objects, linking them to XML-encoded versions of the relevant standards as well as XSLT [10] expressions of crosswalk tables, and making this data available in a repository that is built entirely from open source software and supports other XML processes such as Open Archive Initiative (OAI) [11] harvesting. The result is a tool that collects the essential information required to interpret and execute metadata crosswalks. The tool can be searched and browsed by human readers, but it also serves as a platform that can be enhanced with other services that automate the routine and labor-intensive data-processing tasks associated with metadata conversion. 2. A data model for crosswalksWe propose to model the crosswalk object in the Metadata Encoding and Transmission Standard (METS) [12]. As Tennant writes [13], "The roots of METS go back to the beginning of digitization projects in libraries. Once you've scanned a book, what do you have? You have hundreds of individual digital files and no practical way to "bind" them all back together into an easily navigable whole. This is where METS comes in. METS provides a method to describe the structure of a digital object as well as encapsulate one or more packages of descriptive metadata, rights information, and information about how the item was digitized. METS provides a way to create a neat package of all the relevant files and metadata pertaining to a digital object. More important, it provides a standard way to package a digital object that can then be shared with other libraries, thereby promoting interoperability of digital objects." Our use of METS deviates somewhat from the intent of the designers because a crosswalk record describes a critical piece of infrastructure, not an intellectual property to be curated. Nevertheless, the standard does not prevent this usage and it has many advantages, as we will argue.
Encoded in the XML schema defined by the METS sponsors, the crosswalk object produces a relatively simple but not a trivial record. The essential fragment is shown in Figure 3, which depicts a crosswalk from MARC XML to an OAI encoding of Unqualified Dublin Core; the complete record is available at [14]. As in all METS records, the heart of the crosswalk object is the structural map, the
Figure 3 A crosswalk object encoded as a METS record
Following the METS guidelines [15], we specify a structural map whose type is logical because it conforms to an organization we have devised. Each component of the crosswalk objectthe table of correspondences, the metadata source, and the metadata targetcan be represented by a human-readable description, a machine-processable file, or both. These are coded as nested
The elements in the structural map have pointers to two other locations in the METS record. For example, the
We could discuss additional detailsfor example, a For further information about our proposed crosswalk object, readers can consult our METS profile [16], which we have submitted to the Library of Congress. Compliant records can be rendered into a form more suitable for human consumption with a simple XSLT stylesheet [17]. 3. A crosswalk repositoryWe are exploring potential applications of the Open Archives Initiative, which was designed to promote standards for interoperability. One outcome is a repository that collects publicly accessible metadata into a repository that can be harvested using standard XML protocols and provides tools for creating sample services, such as customizable views of the data. METS-encoded crosswalk objects are a natural fit in this repository because they represent the intelligence required for metadata conversion and promise to automate a common management task. With the addition of an SRW/SRU [18] search engine to the OAI repository, the METS record can support structured searches that largely eliminate the confusion produced by the current generation of Web bibliographies of crosswalks. For example, the user can issue high-precision searches to obtain the human-readable documents for all translation targets, the XSLT scripts for crosswalks that mention version 1.1 of Dublin Core as the translation sources, or the reference documents for all crosswalks that mention Learning Object Metadata (LOM) [19] and are not .pdf documents. With the addition of publicly accessible services such as XSLT translators and some additional processing that can be accomplished largely through a series of cascading XSLT scripts, the OAI repository can identify appropriate XSLT scripts and translate a user's data, with no human involvement beyond the initial request. The procedure is shown in Figure 4.
The user supplies a URL for the XML-encoded data to be translated. Using the ERRoL functionality [20] developed at OCLC for redirecting URLs to services, the metadata translation service extracts the names of the XML schemas from the data and uses the strings as queries to search the METS repository for crosswalks with formal encodings that match the references in the user's data. If there is a match, the service dynamically constructs a drop-down menu with a list of choices from which the user can select an appropriate crosswalk to translate the data. Once the data is translated, the METS record can be associated with it as a convenient way to document the metadata standards, versions, encodings, and scripts that were used in the conversion process. Readers are invited to view the records in the prototype version of our service [21]. We have seeded it with records for Dublin Core-MARC, MARC-MODS [22], and Dublin Core-LOM crosswalk records. We welcome participation from the digital library communityto refine the METS crosswalk profile, to suggest or develop services for the repository, and to add more records. 4. Open issues4.1. The scope of the crosswalk object. The crosswalk object was inspired by the need to enhance the Web-accessible documents that specify tables of equivalences between metadata standards with references that make them fully interpretable and executable. But how do we create a crosswalk object from a document that defines more than one crosswalk? For example, the Library of Congress has a page that describes the relationship between MARC and Dublin Core and defines separate crosswalks for Qualified and Unqualified Dublin Core. The recommendations in our application profile produce a clear answer to this question. A crosswalk document is required; documents defining the metadata source and target specifications are highly recommended. If these documents exist, the crosswalk is said to be defined; and if human-readable and machine-processable documents exist for each element, the crosswalk is said to be complete. When we attempt to code the above case, the resulting structural map contains two sets of elements that constitute the crosswalk definition, one for Unqualified Dublin Core and one for Qualified Dublin Core. Though the METS specification permits multiple structural maps, machine processes cannot unambiguously identify the ancestor documents that make the definition of the crosswalk object explicit if two maps reside in the same record. Thus we recommend the creation of two METS records to describe the data presented in the Library of Congress page.
But these records clearly do have an affinity. To capture it, we propose the use of
At the highest level is an abstract crosswalk object with only human-readable documents pointing to the abstract descriptions of Dublin Core [2] and the MARC21 Concise Edition [22]. The next level specifies a crosswalk object that makes reference to two widely used versions of the standards: Version 1.1 of Unqualified Dublin Core [23] and MARC XML [24] , which consists of an XML schema, as well as documentation and a toolkit that define an ongoing project. At the most concrete level, the transformation from MARC XML to Unqualified Dublin Core 1.1 is realized in three encodings, requiring three separate XSLT scripts: as OAI DC [25], as SRW DC [26], and as RDF DC [27]. Each of the crosswalk objects shown schematically in Figure 5 can be described as a METS record that conforms to our application profile and is linked using the relation element to create a thesaurus-like browsing structure that defines the constellation of MARC-Dublin Core crosswalks, which can be accessed from our OAI repository. Figure 5 also reveals gaps to be addressed by future work. For example, all of the current encodings refer to Unqualified Dublin Core; encodings for Qualified DC and a new set of related METS records may eventually be available. This example raises an issue about the level of abstraction required to describe crosswalks. Established standards have revision histories, and separate crosswalks are required whenever they refer to different versions and encodings, which result in distinct structural maps. Of course, similar discussions are being conducted about traditional library materials. The confusion surrounding the abstract statements, versions, and encodings of crosswalks echoes the talk of works, expressions, and manifestations in the lively debate about FRBR [28]. 4.2. The crosswalk as an XSLT stylesheet. In an earlier report on the metadata translation problem [29], we argued that XSLT might be an inappropriate tool because semantic information is lost in a process that is designed for manipulating style and structure. But given the popularity of the XML paradigm for document processing, it is inevitable that XSLT will have a role in metadata conversion, despite the overhead required for tracking versions and encodings and associating them with semantic interpretations. The METS solution we have proposed permits a loose association of the syntax and semantics of the transform and mitigates some of the problems. Though the result must still be evaluated by a human judge, it represents an improvement over previously available representations of crosswalks. Nevertheless, our project is neutral on the recommendation of XSLT. We are currently testing a custom application that performs the kinds of fine-grained operations that are cumbersome in XSLT, such as the manipulation of subfields and the detection of errors. The inputs to the system are XML-encoded records and a statement in a scripting language that records the semantic equivalences in a crosswalk more transparently than is possible with the current version of XSLT. Since the application is designed as a Web service, it can be attached as a process to the OAI repository in the same manner as an XSLT interpreter, along with whatever machine-processable files it requires. This example suggests that the record structure for the crosswalk object is a separate issue from the implementation details of metadata translation. If best practices continue to evolve without a major paradigm shift, the METS record profile we have defined should be flexible enough to accommodate many changes. 4.3. The status of the crosswalk as a standard. The repository of crosswalks we have proposed is built entirely from standards and freely available software. It promises to increase accessibility to a critical piece of infrastructure for the digital library and reduce duplication of effort. These are surely signs of progress, but there is one looming problem: it is not obvious that the crosswalks themselves are standards. Two points of view are now being articulated among practitioners in the digital library community. On the one hand, crosswalks can be viewed as a stopgap measure for solving the problem of heterogeneous data until a single standard emerges. This view implies that mappings will always be imperfect and that crosswalks will always be evolving because the standards are changing and different applications have different standards for precision. In other words, the metadata translation problem may be local and temporary, and crosswalks developed from the effort to solve this problem for a given institution may not be of general interest. For example, Yee and Beaubien [30] conclude from their study of IMS-Content Packaging and METS that both represent containers for metadata, that the translation is lossy, and that information would have greater integrity if one or the other standard could be eliminated. The authors of a study conducted by Eisenhower National Clearinghouse [31], who are developing a large database of records that describe educational resources for the National Science Digital Library, have reservations similar to Yee's. They developed a three-way crosswalk that associates MARC, Dublin Core and LOM but express reservations about its value for other institutions. On the other hand, a more hopeful view is that crosswalks represent an attempt to identify interoperable elements among standards. This is the goal of the MARC Standards Office in the Library of Congress, which recognized the issue in the early days of the Dublin Core initiative and has developed simplified versions of MARC that might be more appropriate for digitized materials, as well as crosswalks, XML encodings, and XSLT scriptsall of which are widely used. Yet most standards refer to the commonly understood semantics of intellectual propertyan electronic object was created by someone on a particular date, has a Web-accessible location, and so onthat should be identified and synchronized to permit easier access to collections of heterogeneous resources. This line of reasoning implies that crosswalks, as well as the metadata schemas representing the sources and targets of translation, can achieve the status of standards. Unfortunately, even if this debate is resolved in favor of the more hopeful prospects for crosswalks, the digital library community has much work to do. Anyone who examines the crosswalk records in our repository will discover that this critical material is far from mature, and that progress on the crosswalk problem is not simply a matter of collecting the information available on the Web and assembling it into an easy-to-use application. To test our data model, we examined two crosswalks in detail: LOM to Dublin Core (DC) and MARC to DC. Since the LOM-DC relationship is still experimental, it is perhaps not surprising that all of the crosswalks are semantically different. By contrast, a MARC-DC mapping is cited in every crosswalk bibliography we have encountered. It is achieving maturity and general acceptance, but even this mapping has multiple versions and encodings, not all of which have all six documents required for a complete crosswalk object. Needless to say, none of the LOM-DC crosswalks have this documentation, either. But our system makes it easier to discover these inadequacies and, perhaps, to address them. About eight months ago, we discussed these issues with Juha Hakala, Director of Information Technology of the National Library of Finland, wholike usneeds effective crosswalks to solve the everyday problem of ensuring consistency in large databases that are built of record streams from multiple sources. As he said, crosswalks merely represent a "proof of concept". Right now, they need to be augmented with robust systems that handle validation, enhancement, and multiple character encodings and allow human guidance of the translation process. But as standards and the supporting software infrastructure become more mature, the proof of concept acquires more and more functionality that can eventually replace major components of a production system. It is this vision that motivates our work. References(Links accessed December 5, 2004) [1] MARC (Machine-Readable Cataloging). <http://www.loc.gov/marc/>. [2] Dublin Core. <http://dublincore.org/>. [3] Canadian Heritage Information Network. <http://www.chin.gc.ca/English>. [4] MARC to Dublin Core Crosswalk. <http://www.loc.gov/marc/marc2dc.html>. [5] Mirtha Baca, ed. Introduction to Metadata: Pathways to Digital Information. <http://www.getty.edu/research/conducting_research/standards/intrometadata/>. [6] Michael Day. Mapping Between Metadata Formats. UK Office for Library and Information Networking. <http://www.ukoln.ac.uk/metadata/interoperability/>. [7] Inventory of Metadata Standards and Practices. Digital Library Committee, Northwestern University Library. <http://staffweb.library.northwestern.edu/dl/metadata/standardsinventory/>. [8] Metadata Standards Crosswalks. Canadian Heritage Information Network. <http://www.chin.gc.ca/English/Standards/metadata_crosswalks.html>. [9] XML (Extensible Markup Language) <http://www.w3.org/XML/>. [10] XSLT (XSL Transformations) <http://www.w3.org/TR/xslt>. [11] Open Archives Initiative. <http://www.openarchives.org/>. [12] METS: Metadata Encoding and Transmission Standard. <http://www.loc.gov/standards/mets/>. [13] Roy Tennant, 2004. It's Opening Day for METS. Library Journal. <http://www.libraryjournal.com/article/CA415392>.
[14] The XML version of the METS record is accessible at: [15] METS: Overview and Tutorial. <http://www.loc.gov/standards/mets/METSOverview.v2.html>. [16] Our METS profile is accessible at: <http://purl.org/net/mets_crosswalk_profile>.
[17] A human-readable version of the record is accessible at: <http://errol.oclc.org/schemaTrans.oclc.org.html [18] As Eric Lease Morgan explains in his Ariadne tutorial, SRW (Search/Retrieve Web) and SRU (Search/Retrieve URL) are "brother and sister" protocols for implementing Internet search engines as Web services. <http://www.ariadne.ac.uk/issue40/morgan/>. [19] Draft Standard for Learning Object Metadata, 2002. Sponsored by the Learning Technology Standards Committee of the IEEE. <http://ltsc.ieee.org/wg12/files/LOM_1484_12_1_v1_Final_Draft.pdf>. [20] ERRoL (Extensible Repository Resource Locators for OAI Repositories). <http://www.oclc.org/research/projects/oairesolver/default.htm>. [21] ]A demo version of our service is accessible at: <http://errol.oclc.org/schemaTrans.oclc.org.html>. [22] MARC 21 Concise Edition. <http://www.loc.gov/marc/bibliographic/ecbdhome.html>. [23] Dublin Core Version 1.1. <http://dublincore.org/documents/dces/>. [24] MARC XML. <http://www.loc.gov/standards/marcxml/>. [25] MARC XML to OAI DC XSLT stylesheet. <http://www.loc.gov/standards/marcxml/xslt/MARC21slim2OAIDC.xsl>. [26] MARC XML to SRW DC XSLT stylesheet. <http://www.loc.gov/standards/marcxml/xslt/MARC21slim2SRWDC.xsl>. [27] MARC XML to RDF XSLT stylesheet. <http://www.loc.gov/standards/marcxml/xslt/MARC21slim2RDFDC.xsl>. The standard reference for RDF (Resource Description Framework is <http://www.w3.org/RDF/>. [28] Functional Requirements for Bibliographic Records, Final Report. <http://www.ifla.org/VII/s13/frbr/frbr.pdf>. [29] Carol Jean Godby, Devon Smith, and Eric Childress, 2003. Two paths to interoperable metadata. Presented at the Dublin Core 2003 Conference (DC-2003), Seattle Washington, October 2. <http://www.oclc.org/research/publications/archive/2003/godbydc2003.pdf>. [30] Raymond Yee and Rick Beaubien, 2004. A preliminary crosswalk from METS to IMS content packaging. Library Hi Tech. 22(1): 69-81. [31] Kimberly S. Lightle and Judith S. Ridgeway, 2003. Generation of XML Records Across Multiple Metadata Standards. D-Lib Magazine. 9 (9): September. <doi:10.1045/september2003-lightle>. (Corrections to a tag in Figure 1 and data in Figure 2 were made 12/16/04, and Jeffrey Young's email address was also corrected to jyoung@oclc.org.). The URL for reference 27 was corrected as well. Copyright © 2004 OCLC Online Computer Library Center, Inc. |
||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
Top | Contents | ||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||
D-Lib Magazine Access Terms and Conditions doi:10.1045/december2004-godby
|