D-Lib Magazine
|
|
Herbert Van de Sompel Jeffrey A. Young Thomas B. Hickey |
AbstractThe Open Archives Initiative's Protocol for Metadata Harvesting (OAI-PMH) was created to facilitate discovery of distributed resources. The OAI-PMH achieves this by providing a simple, yet powerful framework for metadata harvesting. Harvesters can incrementally gather records contained in OAI-PMH repositories and use them to create services covering the content of several repositories. The OAI-PMH has been widely accepted, and until recently, it has mainly been applied to make Dublin Core metadata about scholarly objects contained in distributed repositories searchable through a single user interface. This article describes innovative applications of the OAI-PMH that we have introduced in recent projects. In these projects, OAI-PMH concepts such as resource and metadata format have been interpreted in novel ways. The result of doing so illustrates the usefulness of the OAI-PMH beyond the typical resource discovery using Dublin Core metadata. Also, through the inclusion of XSL1 stylesheets in protocol responses, OAI-PMH repositories have been directly overlaid with an interface that allows users to navigate the contained metadata by means of a Web browser. In addition, through the introduction of PURL2 partial redirects, complex OAI-PMH protocol requests have been turned into simple URIs that can more easily be published and used in downstream applications. IntroductionIt comes as no surprise that most current implementations of OAI-PMH [1] repositories mainly make descriptive metadata about resources harvestable. This emphasis on descriptive metadata has its origin in the early motivations of the OAI3 (Open Archives Initiative) effort that focused on making distributed resources discoverable. The OAI-PMH facilitates this by providing a simple, yet powerful framework for metadata harvesting that allows harvesters to gather metadata held by different repositories into a central location and to make it searchable there. Initially, the descriptive metadata provided by OAI-PMH repositories was to a large extent limited to the mandatory unqualified Dublin Core, but an evolution towards the provision of more extensive descriptive metadata, such as MARC21, is becoming apparent. Providing extensive descriptive metadata is possible in the OAI-PMH thanks to the notion of parallel metadata formats that enables repositories to expose metadata about the same resource in multiple metadata formats. Creative interpretation of what actually constitutes a resource about which an OAI-PMH repository holds metadata, and of the nature of metadata formats used in the OAI-PMH, has led to suggestions that the protocol could be quite useful beyond the traditional domain of resource discovery using Dublin Core metadata and could reach into the realm of state maintenance in distributed systems [2 , 3 , 4 ]. As a matter of fact, metadata records in the OAI-PMH are any data that can be validated against a W3C4 XML5 Schema6. Therefore, the OAI-PMH can be a medium for incremental, date-sensitive exchange of any form of semi-structured data. In the Section below entitled "Unconventional OAI-PMH resources and metadata formats", three creative uses of the OAI-PMH notions of resource and metadata format are described in the context of the application in which they are used. The metadata contained in OAI-PMH repositories is typically gathered by harvesters that process it and make it searchable through a user interface. In these uses of the OAI-PMH, repositories are never directly accessed by end-users; the "customers" of the repositories are robots. The Section entitled "A user interface for OAI-PMH repositories", describes an approach to overlay OAI-PMH repositories with an interface allowing users to directly navigate the repository content. We also show how this approach has been used to make the GSAFD Thesaurus, the OpenURL Registry and the XTCat Thesis Catalog user-accessible. The Section entitled "PURLs for simple access to OAI-PMH records" describes an approach to use PURL (Persistant URL7) partial redirects to create simple URIs that lead to records in OAI-PMH repositories. These URIs are easier to publish and use in downstream applications than their corresponding OAI-PMH protocol requests. Unconventional OAI-PMH resources and metadata formatsIn this Section, three examples are given of less conventional OAI-PMH repositories that make creative use of the OAI-PMH notions of resource and metadata format. The repositories are described in the following order: the GSAFD8 Thesaurus, the Digital Library Usage Logs, and the OpenURL9 Registry. The GSAFD ThesaurusLibraries put a great deal of effort into creating, maintaining, and using thesauri to improve the recall and precision of database searches. In the GSAFD Thesaurus project, the OCLC Office of Research attempts to determine the value of cross-thesaurus linking and improved thesaurus-access. The desired enhanced thesaurus services are intended for both machine and human use. As the basis of this effort, the GSAFD Thesaurus is stored as an OAI-PMH repository. The GSAFD Thesaurus records, which are available in MARC 2110 format, were downloaded from the American Library Association site [5]. The records were enriched with 7XX fields where a GSAFD term mapped to a term in the Library of Congress Subject Heading (LCSH) file. The exact nature of this mapping process as well as an evaluation of its added-value is beyond the scope of this article. The records were then converted to the MARC XML format [6] and subsequently stored in an OAI-PMH repository. The resources about which the OAI-PMH repository exposes metadata are the concepts represented by thesaurus terms. In the repository, an OAI-PMH item exists per thesaurus term, and its OAI-PMH identifier is the actual thesaurus term. Three OAI-PMH metadata formats are available per OAI-PMH item:
This approach allows the GSAFD Thesaurus to be simultaneously accessed in three modes, all based on the OAI-PMH protocol:
As a result, by storing the GSAFD Thesaurus as an OAI-PMH repository, its content becomes an integral part of the Web infrastructure where it can be seamlessly used by both human and machine using standard Web tools. Digital Library Usage LogsIn a recent collaboration between Old Dominion University and the Los Alamos National Laboratory (LANL) aimed at the deployment of recommender systems, logs that describe the usage of the LANL Digital Library are exposed through the OAI-PMH. The Digital Library Usage Log repositories currently cannot be publicly harvested. Because of the specific aim of the project, only actions by which users express a preference for a specific document are selected and ingested into a relational database. In the current set-up, preference is measured implicitly [8], and a log-entry is typically created when a user clicks an OpenURL [9] provided for a document available via the LANL Digital Library services. The content of the database populated by events that express user preferences is exposed as two interlinked OAI-PMH repositories:
The User Repository and the Document Repository are interlinked by means of XLinks11 in the following manner:
It is expected that the usage of the OAI-PMH in the context of this project will provide a scalable and sustainable infrastructure to share continuously updated usage information with a fully autonomous downstream application. That application will mine harvested usage logs and provide recommendations based on patterns derived from the mining activity. Recommendations will be accessible by querying the application using an XML ContextObject as specified the Draft NISO OpenURL Standard. The OpenURL RegistryThe upcoming NISO OpenURL Standard is a so-called "framework standard". Its nature is inspired by the Bison-Futé model [10] and extends the usability of OpenURL's context-sensitive services concept [10, 11, 12, 13] beyond the scholarly domain in which OpenURL originated. It does so by specifying a framework that enables communities to define and implement their own OpenURL-based service environment. The approach builds on a Registry that is introduced to contain the explicit definitions for core components of the OpenURL Framework as registered by communities. Such core components include, amongst others, Namespaces of Identifiers that can be used to identify resources, Metadata Formats that can be used to describe resources, ContextObject Formats that can be used to express the payload of an OpenURL using a well-defined syntax, and Community Profiles that list the actual choices a Community makes from Registry entries when it actually deploys its own OpenURL environment. To bootstrap deployment of the new specification in the original OpenURL community, many initial Registry entries provided by the NISO AX Committee12 are relevant for the purpose of open linking in the scholarly information environment. For example:
Two ContextObject Formats that can be used by several Communities to express the payload of an OpenURL have been defined. The first is a Key/Encoded-Value Format [16] that like OpenURL 0.1 expresses the OpenURL payload as a list of ampersand-delimited key/value pairs. Its definition is based on the aforementioned XHTML Template. The second is the XML Format [17] in which the OpenURL payload is expressed as an XML instance document, the format of which is defined by means of a W3C XML Schema. Early in the NISO AX Standardization effort, it had been suggested that making the OpenURL Registry OAI-PMH conformant could lead to an environment in which OpenURL Resolvers could easily remain synchronized with the definitions contained in the Registry by regularly polling for updates and harvesting them whenever required [3]. Although the nature of the content of the Registry has significantly evolved since then, the idea of creating an OAI-PMH compliant Registry has been used for the Registry for Trial Use of the Draft NISO OpenURL Standard. The nature of the OAI-PMH Repository that holds the Registry entries is described below. In order to avoid confusion between OAI-PMH and OpenURL terminology the following convention is used in this description: an OAI-PMH term is followed by [OAI], and an OpenURL term is followed by [OURL]. The resources [OAI] about which the repository [OAI] contains metadata [OAI] are the concrete entries for core components of the OpenURL Framework that are registered by Communities in order to be able to deploy their OpenURL environment. For example, a resource [OAI] can be the DOI Namespace [OURL], an XML Metadata Format [OURL] to describe book-like objects, or the Key/Encoded-Value ContextObject Format [OURL] used to express the payload of an OpenURL. Each such registered item receives an Identifier [OURL] at registration, and this becomes an identifier [OAI] for the item in the OAI-PMH repository. Currently, the OAI-PMH repository supports three metadata formats [OAI] with the following metadataPrefixes [OAI] and characteristics:
If required for the purpose of registration, the repository can support additional metadata formats [OAI], as long as they can be defined by means of W3C XML Schema. For example, it is anticipated that Community Profiles listing Registry choices will be registered and will be unambiguously expressed as well-formed XML instance documents that validate against a special-purpose W3C XML Schema. If this happens, the repository can support a fourth metadata format [OAI] to accommodate such Community Profile documents. In addition to the described interpretations of the OAI-PMH notions of resource and metadata formats, the repository also builds on the OAI-PMH notion of sets. In the OAI-PMH, repositories can optionally implement sets, which group contained items into hierarchical subdivisions of the repository. The repository used for the OpenURL Registry implements a set structure in which every set refers to a core component of the OpenURL Framework. As such, there are, for example, sets to contain registered Namespaces of Identifiers, a set to contain registered Character Encodings, etc. The described approach to the deployment of the OpenURL Registry does effectively facilitate a straightforward synchronization of information that is essential to the functioning of the OpenURL Framework between the Registry and OpenURL Resolvers. But, as will be shown in the next Section, it also enables the creation of a straightforward interface that allow users to navigate the Registry content in a meaningful manner, by the sole use of OAI-PMH requests. The OpenURL Registry repository can be harvested at OAI-PMH baseURL http://alcme.oclc.org/openurl/servlet/OAIHandler. A user interface for OAI-PMH repositoriesThe OAI-PMH was designed to facilitate incremental harvesting of metadata contained in a repository by robots; so far, its uses have largely been restricted to that application area. However, it is also possible to explore the content of repositories from a user interface that only uses OAI-PMH requests as its navigation mechanism. As will be shown, this approach can be highly attractive for certain repositories. In order to implement direct and meaningful access from a Web browser to the content of an OAI-PMH repository, a reference to an XSLT stylesheet is introduced in OAI-PMH protocol requests. Doing so, a protocol response looks as shown in Table 3. When such a response is sent to an automated process such as an OAI-PMH harvester, the stylesheet reference will be ignored and the XML will be processed directly. However, a Web browser receiving the response will use the stylesheet reference to render the response into HTML in the manner specified by the stylesheet. As such, it is possible to create a browser-based user interface to interact directly with an OAI-PMH repository by merely clicking OAI-PMH requests provided in the interface, and by receiving OAI-PMH responses rendered by means of a specified stylesheet. Similarly (perhaps even more generally useful), an OAI service provider can directly issue a GetRecord request to a record's home repository on behalf of the user so that the home repository has control of the transformation/display/branding of the records that users see. The capabilities of the user interface using this method are rather limited because of the limitations of using only OAI-PMH verbs. However, for simple applications or for the navigation of small repositories, the approach can be quite useful as is illustrated in the following examples. Figure 1 shows the result of issuing the OAI-PMH GetRecord request to obtain a Dublin Core record from the XTCAT Experimental Thesis Catalog. The response, which includes a stylesheet reference, is rendered by the browser. No further external mediation is required for displaying the metadata contained in the response to users. In addition, several navigational links are provided in the interface that are OAI-PMH requests.
Figure 2 shows how the OAI-PMH ListIdentifiers verb is used to render terms from the small GSAFD Thesaurus in a user interface. In the GSAFD Thesaurus, terms are treated as OAI-PMH identifiers. As can be seen, the interface allows for further exploration of each thesaurus term. This is achieved by hyperlinking each term with an OAI-PMH GetRecord request for the Z39.19 [7] metadata that describes the term. This approach would not lead to an interesting user interface if the OAI-PMH identifiers were not meaningful in themselves for example, they could be numbers sequentially assigned to thesaurus terms instead of the terms themselves. In such cases, a ListRecords request could be substituted for ListIdentifiers and the meaningful thesaurus terms could be extracted from the appropriate metadata tag.
Figure 3 shows how the OAI-PMH ListSets request is used in the OpenURL Registry to display a list of the core components of the OpenURL Framework. Again, the interface allows for further exploration of the Registry by means of OAI-PMH requests. For example, the hyperlink provided with each set name is an OAI-PMH ListRecords request for all items in the set that have Dublin Core metadata. Because all entries in the OpenURL Registry have Dublin Core metadata, the result will be a list describing each item of the specified set, i.e., of the specified core component of the OpenURL Framework.
Finally, Figure 4 shows the usage of the OAI-PMH ListMetadataFormats request to allow users of the OpenURL Registry to navigate to either of the two currently existing types of definitions for OpenURL ContextObject Formats or Metadata Formats namely, the Key/Encoded-Value Formats defined by means of the XHTML Template, or the XML Format defined by means of XML Schema. For each definition type, a metadata format [OAI] is available in the repository. The hyperlink provided with each of the listed types is an OAI-PMH ListRecord request for Dublin Core metadata of all Format definitions of the specified type, i.e., with the metadata format [OAI].
PURLs for simple access to OAI-PMH recordsOAI-PMH identifiers uniquely identify items in OAI-PMH repositories. They are
resolved through the use of the identifier itself, along with an identifier
of a metadata format available for that item. The resolution occurs through
the submission of rather lengthy OAI-PMH GetRecord requests (e.g., http://alcme.oclc.org/xtcat/servlet/OAIHandler?verb= PURLs [19] are a method for creating and maintaining URLs for digital collections. PURLs do this by offering a level of indirection to URLs that enables a collection owner to change the URL for objects in the collection while maintaining a Persistent URL for publication and access. The PURL system also includes the ability to do "partial redirects" in which only part of the PURL is used for the indirection to an actual URL. This turns out to be an effective technique for creating a name gateway to turn complex OAI-PMH GetRecord requests into cool URLs15. The proposed scheme [20] for creating OAI-PMH GetRecord PURLs is: "http://purl.org/oai/" repository-identifier "/" metadataPrefix "/" local-identifier For OAI-PMH repositories where the identifiers conform to the oai-identifier schema [21], the repository-identifier in the PURL should match the repository-identifier embedded in the oai-identifier. For example, the XTCat [22] Repository has oai-identifiers of the form oai:xtcat.oclc.org:OCLCNo/ocm00006585. The repository-identifier is therefore xtcat.oclc.org and the local-identifier for this particular item is OCLCNo/ocm00006585. Following the proposed PURL scheme, the corresponding cool URL is http://purl.org/oai/xtcat.oclc.org/oai_dc/OCLCNo/ocm00006585, which resolves to the oai_dc GetRecord response shown earlier. Such resolution is achieved by creating a PURL partial redirect of the form: "http://purl.org/oai/" repository-identifier "/" metadataPrefix "/" which will be mapped to the destination: baseURL "?verb=GetRecord &metadataPrefix=" metadataPrefix "&identifier=oai:" repository-identifier ":" Examples: /oai/xtcat.oclc.org/oai_dc/ -> http://alcme.oclc.org/xtcat/servlet/OAIHandler ?verb=GetRecord &metadataPrefix=oai_dc &identifier=oai:xtcat.oclc.org: /oai/registry.openurl.info/oai_dc/ -> http://www.openurl.info/registry/servlet/OAIHandler ?verb=GetRecord &metadataPrefix=oai_dc &identifier= Note, however, that the OpenURL Registry used in the latter example doesn't use identifiers that conform to the oai-identifier scheme, so the entire identifier must be appended to the PURL rather than a parsed local-identifier. Appending a local-identifier to a PURL partial redirect has the effect of appending it to the PURL partial redirect's destination, thus completing the identifier parameter in the OAI-PMH GetRecord request. The described technique makes publishing of OAI-PMH GetRecord requests in downstream applications easier and makes handling the requests by humans more straightforward. ConclusionsThis article has introduced some novel ways to use the OAI-PMH. It has been shown that, through the creative interpretation of the OAI-PMH notions of resource and metadata format, repositories with rather unconventional content, such as Digital Library usage logs, can be deployed. These applications further strengthen the suggestion that the OAI-PMH can effectively be used as a mechanism to maintain state in distributed systems. It has also been shown that simple user interfaces can be implemented by the mere use of OAI-PMH requests and responses that include stylesheet references. For certain applications, such as the OpenURL Registry, the interfaces that can be created in this manner seem to be quite adequate, and hence the proposed approach is attractive if only because of the simplicity of its implementation. The availability of an increasing amount of records in OAI-PMH repositories generates the need to be able to reference such records in downstream applications, through URIs16 that are simpler to publish and use than the OAI-PMH HTTP GET requests used to harvest them from repositories. This article has shown that PURL partial redirects can be used to that end. AcknowledgmentsThe authors would like to acknowledge the work Patrick Hochstenbach (Los Alamos National Laboratory), and Johan Bollen (Old Dominion University) on the Digital Library Usage Log repository, as well as the work of Phil Norman (OCLC) on the OpenURL Registry. Many thanks also to Patrick Hochstenbach, Carl Lagoze, and Michael Nelson for their feedback on the draft of this article. Notes1. XSL - Extensible Stylesheet Language, <http://www.w3.org/Style/XSL/>. 2. PURL - Persistent URL, <http://purl.org/>. 3. OAI - Open Archives Initiative, <http://www.openarchives.org/>. 4. W3C - World Wide Web Consortium, <http://www.w3c.org/>. 5. XML - Extensible Markup Language, <http://www.w3.org/XML/>. 6. XML schema, <http://www.w3.org/XML/Schema>. 7. URL - Uniform Resource Locator, <http://www.w3.org/Addressing/>.
8. GSAFD - Guidelines on Subject Access to Individual Works of Fiction, Drama,
Etc.,
<http://www.ala.org/Content/ContentGroups/ALCTS1/Cataloging_and_Classification_Section/ 9. OpenURL - NISO AX Committee. 2003. The OpenURL Framework for Context-Sensitive Services, Draft Standard. <http://library.caltech.edu/openurl/Public_Comments.htm>. 10. MARC 21 - MAchine-Readable Cataloging (21 refers to the 21st century), <http://www.loc.gov/marc/>. 11. XLink. <http://www.w3.org/TR/xlink/>. 12. NISO AX - National Information Standards Organization, <http://www.niso.org/committees/committee_ax.html>. 13. DOI - Digital Object Identifier, <http://www.doi.org/>. 14. XHTML - Extensible HyperText Markup Language, <http://www.w3.org/MarkUp/>. 15. Cool URLs, "Cool URIs don't change," <http://www.w3.org/Provider/Style/URI.html>. 16. URI - Uniform Resource Identifier, <http://www.w3.org/Addressing>. References[1] Lagoze, Carl, Herbert Van de Sompel, Michael Nelson, and Simeon Warner. 2002. The Open Archives Initiative Protocol for Metadata Harvesting - Version 2.0. <http://www.openarchives.org/OAI/openarchivesprotocol.html> [2] Van de Sompel, Herbert. 2000. Closing Keynote Address for the Task Force Meeting of the Coalition for Networked Information, San Antonio TX, Fall 2000. <http://www.cni.org/tfms/2000b.fall/handout/HVDS-CNI-2000Ftf.pdf>. [3] Van de Sompel, Herbert and Donna Bergmark. 2002. A distributed registry for OpenURL metadata schemas with an OAI-PMH conformant central repository. IEEE Proceedings of the 2002 International Conference on Parallel Processing Workshops, 18-21 August 2002, Vancouver CA, pp. 469-472. [4] Nelson, Michael. 2002. Service Providers: Future Perspectives. Presentation at the 2nd Workshop on the Open Archives Initiative. Geneva, Switzerland, October 2002. <http://agenda.cern.ch/fullAgenda.php?ida=a02333> [5] Association for Library Collections & Technical Services. 2003. ALA | MARC 21 Authority Records for GSAFD Genre Terms. <http://www.ala.org/Content/ContentGroups/ALCTS1/ [6] MARCXML. <http://www.loc.gov/standards/marcxml/>. [7] National Information Standards Organization. 1993. Guidelines for the Construction, Format, and Management of Monolingual Thesauri. <http://www.niso.org/standards/resources/Z39-19.pdf>. [8] Claypool, Mark, Phong Le, Makoto Wased, and David Brown. 2001. Implicit Interest Indicators. Proceedings of the International Conference on Intelligent User Interfaces, January 14-17 2001, Santa Fe, NM, pp. 33-40. [9] Van de Sompel, Herbert, Patrick Hochstenbach and Oren Beit-Arie.
2000. OpenURL Syntax Description. [10] Van de Sompel, Herbert and Oren Beit-Arie. 2001. Generalizing the OpenURL Framework beyond References to Scholarly Works: The Bison-Futé Model. D-Lib Magazine. 7(7/8). <doi:10.1045/july2001-vandesompel>. [11] Van de Sompel, Herbert and Patrick Hochstenbach. 1999.
Reference Linking in a Hybrid Library Environment. Part 1: Frameworks for [12] Van de Sompel, Herbert and Patrick Hochstenbach. 1999.
Reference Linking in a Hybrid Library Environment. Part 2: SFX, a Generic Linking [13] Van de Sompel, Herbert and Patrick Hochstenbach. 1999.
Reference Linking in a Hybrid Library Environment. Part 3: Generalizing the
SFX [14] Johnston, Pete. 2002. Unqualified Dublin Core XML Schema for OAI-PMH. <http://www.openarchives.org/OAI/2.0/oai_dc.xsd>. [15] NISO Committee AX. 2003. The Z39.88-2003 Matrix Constraint Language. <http://www.openurl.info/registry/dc/ori:fmt:kev:mtx>. [16] NISO Committee AX. 2003. The Key/Encoded-Value Physical Representation. <http://www.openurl.info/registry/dc/ori:fmt:kev>. [17] NISO Committee AX. 2003. The XML Physical Representation. <http://www.openurl.info/registry/dc/ori:fmt:xml>. [18] Berners-Lee, Tim. 1998. Hypertext Style: Cool URIs don't change. <http://www.w3.org/Provider/Style/URI> [19] OCLC. 2003. Persistent URL Home Page. <http://purl.org/>. [20] Powell, Andy, Jeffrey A. Young, Thomas B. Hickey. In press. [21] Lagoze, Carl, Herbert Van de Sompel, Michael Nelson, and Simeon Warner. 2002. Implementation Guidelines for the Open Archives Initiative for Metadata Harvesting: Specification and XML Schema for the OAI Identifier Format. <http://www.openarchives.org/OAI/2.0/guidelines-oai-identifier.htm>. [22] OCLC. 2003. XTCat - Experimental Thesis Catalog. <http://alcme.oclc.org/xtcat/>.
Copyright © Herbert Van de Sompel, Jeffrey A. Young, and Thomas B. Hickey |
|||||||||
|
|||||||||
Top | Contents | |||||||||
| |||||||||
D-Lib Magazine Access Terms and Conditions DOI: 10.1045/july2003-young
|