D-Lib Magazine
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Helen Atkins, Director, Database Development, Institute for Scientific Information |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
DOI-X is a prototype metadata database designed to support DOI lookups. The prototype is intended to address the integration of metadata registration and maintenance with basic DOI registration and maintenance, enabling publishers to use a single mechanism and a single quality-assurance process to register both DOIs and their associated metadata. It also contains the lookup mechanisms necessary to access the journal article metadata, both on a single-item lookup basis and on a batch basis, such as would facilitate reference linking. The prototype database was introduced and demonstrated to attendees at the STM International Meeting and the Frankfurt Book Fair in October 1999. This paper discusses the background for the creation of DOI-X and its salient features. IntroductionIn 1997, the Association of American Publishers (AAP) developed the Digital Object Identifier (DOI) to enable readers to find content on the Internet with a persistent and reliable identifier. [1] Hyperlinking between article bibliographies and the cited articles is a natural application of DOIs. Almost from the inception of the DOI, therefore, the first practical application of DOI was deemed to be the development of a DOI lookup service based on a DOI metadata database. Early in 1999, the AAP's Enabling Technology Committee's subcommittee on DOI decided to develop, implement, and evaluate a large scale prototype end-to-end approach to reference linking. The proposed prototype system would enable publishers and others to use journal article metadata to look up DOIs for the purpose of embedding links that lead to the cited articles in journal article reference lists. At the same time that the AAP was considering the need for a metadata database, other industry players began to consider the optimal method of achieving reference linking within the scholarly literature, on a cross-publisher basis. The International DOI Foundation (IDF), for example, created a Metadata Policy Committee, chaired by David Sidman of Wiley, to define the business framework and development model and recommend policy for a fully operational metadata database. Happily, many of the DOI-X participants also served on the Metadata Policy Committee, and useful information and discussions were held between the two groups, each helping the other work through difficult issues. The library community and the university sector also formed a committee under the auspices of NISO, DLF, CNI, NFAIS, and SSP [2] to investigate reference linking. Members included Cliff Lynch of CNI, Bill Arms of CNRI, David Sidman of Wiley, Dale Flecker of Harvard, Priscilla Caplan of the University of Chicago, Evan Owens of the University of Chicago Press, Andy Stevens of Wiley, Jim Ostell of NCBI, Norman Paskin of the IDF, Mary Grace Palumbo of Dawson/Faxon (now RoweCom), Helen Atkins of ISI, and Don Waters of CLIR. The committee’s final report, issued in June 1999 [3], agreed that a metadata database approach was the preferred architecture (conceptually, cost-wise, pragmatically-speaking, etc.), and set forth the major contenders for a basic journal metadata element set [4]. While monitoring the work of the other committees, the AAP DOI Subcommittee, under the leadership of Howard Ratner of Springer-Verlag, decided to build a real, working prototype of a Metadata Lookup Database, populated with hundreds of thousands of actual metadata records from actual publishers, and with a working metadata registration process integrated with the DOI registration process. The Corporation for National Research Initiatives (CNRI) agreed to participate in the project in April, and approval to proceed was received from the IDF in May 1999. The real challenge at that point (and also the strongest insurer of success, if done right) lay in recruiting a critical mass of major publishers to join in and take the database to a truly cross-publisher level, and equally, in building the infrastructure which would enable this to happen. The group therefore sought other participants willing to work towards developing a prototype system. In addition to AAP, IDF and CNRI, participants included both primary and secondary publishing companies: Academic Press, American Institute of Physics, Elsevier Science/ScienceDirect, Institute for Scientific Information, John Wiley & Sons, RoweCom/Information Quest, and Springer-Verlag. With the ambitious goal of demonstrating a live, operational service at the Frankfurt Book Fair in October 1999, the project was launched in earnest on July 1. The project, named DOI-X [5], proceeded with the stipulation that the metadata deposited and the DOIs retrieved would not be used for production level service. Early work involved the development of procedures that would govern how participants would work together, definition of the scope of the project, specification of metadata to be deposited with DOIs to allow later matching of references and articles, and rules for appropriate use of the metadata. Using the matrix of various metadata elements assembled by the NISO/DLF/ CNI/ NFAIS/ SSP working group, the DOI-X team determined its optimal metadata element set for DOI-X, along with the accompanying data input rules. The deposit of metadata into the prototype system and searching of a metadata database began in September and continued until December 31, 1999. It was, in the end, a highly successful prototype which led to the formation of a new non-profit organization called CrossRef which is putting the system into production [6]. GoalsThe overriding goal of the project was to create a prototype system that would support the use of DOIs for the purpose of journal article reference linking. Within that, there were two major phases of the project: 1) deposit of DOIs and metadata corresponding to articles published by the participating primary publishers, and 2) the lookup of DOIs given basic metadata found in journal references for the purpose of creating links from the references to the original articles. Within the first phase, the objectives were to develop:
The second phase development required:
Once these were in place, testing of the system from both perspectives (deposit and lookup) could proceed. Metadata specification The DOI-X data format was specified in an XML Document Type Definition (DTD) [7] and accompanying "rules document." The rules document provided DTD documentation and specific constraints that could not be expressed in XML (e.g., ISO date format; limitation of journal titles to 256 characters; etc.) [8]. The DTD was designed to capture in discrete records the metadata about the full-text of a journal article, an abstract, or a bibliographic record. (Allowing deposit of DOIs and metadata for secondary database records could have enabled the creation of links to bibliographic records, possibly with abstracts, for older articles not yet put online by publishers. In the end, however, we elected not to take deposits for bibliographic records in the prototype because some searches would return multiple hits - DOIs for both the full text of an article as well as DOIs for bibliographic records describing the article in 3rd-party databases. We felt this would be confusing in the prototype stage, especially since we would not have time to develop an interface that would differentiate the various kinds of records for the user.) Within the DTD, journal article metadata was grouped into sub-elements for journal data and article data. A root element for a batch submission allowed data about the submission itself (from whom, on what date, etc.) to be followed by one or more elements containing journal article metadata. The DTD was designed with the intention that the journal article metadata element would be augmented with analogous container elements for conference proceedings and other publishing genres. The group briefly considered whether to define the metadata specification in Resource Description Framework (RDF) [9], which provides a higher-level mapping layer than XML. Though there might be some cost in not implementing RDF at the outset, it was clearly sensible to postpone this decision, given the known overhead of RDF versus its more limited payback in the very near-term. In addition, since at the end of the DOI-X project, the DOI-X field set was to be mapped to the INDECS standard [10], creating an RDF version would be just another implementation detail at that point. Submission: Collection, Validation and FeedbackMetadata batches for bundles of articles were assembled by participating publishers according to the DTD and submitted to a centralized collection service. The XML batches were submitted via HTTP POST to a named HTTP server. From the HTTP server, the batches were passed onwards to a process written in Java, which parsed and validated the XML file, and notified the submitter in real time via an HTML response as to whether or not the XML was valid and had been accepted. If the validation step failed, the batch was rejected and was expected to be corrected by the contributor and resubmitted. [11] For security purposes, the submission process captured and verified a login and password via HTTP basic authentication before validating the XML. The XML files themselves contained publisher-defined batch IDs to provide a mechanism for feedback and timestamps to ensure that replacements were treated correctly. Should the submitter wish, each DOI record could have its own timestamp. Validation in this prototype phase was kept to a minimal syntax check required for the system to operate properly. It was thought that more thorough validation would be essential in a full-scale implementation to ensure the integrity of the metadata database. Particularly, no attempt was made to validate against the "rules" accompanying the DTD, and no attempt was made to validate the various data types contained within the XML files. Other further enhancements were anticipated in the area of security. It was assumed that each metadata batch could be signed by the submitter using PGP (Pretty Good Privacy). Also, consideration was given to whether the submission of the whole batch should be encrypted in some way, either by using a secure channel (e.g., Secure Socket Layer) or by PGP file encryption. However, due to the prototype nature of the project, no encryption was actually implemented. Upon successful submission of a batch by a participating publisher, the batch file was passed to two database systems: the DOI Directory for registration of DOIs and corresponding URLs, and a metadata database (MDDB) for storage of the DOIs and associated metadata. The DOI Directory is based on CNRI Handle System technology and has been previously described [12]. The MDDB will be described in the next section. Upon successful loading into these two systems, the submitter received a diagnostic email within 24 hours of the initial submission. Implementation of database and loaderThe metadata database and loader were implemented using off-the-shelf relational database technology (Oracle 8i) and custom Perl scripts to transfer files and load the database. Conceptually, the database consisted of one table with two fields to hold the DOI and associated metadata (though additional fields in the table were used to keep track of creation dates and timestamps). The metadata was stored directly in XML format and indexed using the Oracle interMedia full-text search feature (also known as the ConText cartridge). Using a full-text XML index is desirable because it allows maximum flexibility in the metadata field set without necessitating changes in the database schema. For example, additional elements could be added to the DTD, or the DTD could even be changed entirely without adding table columns or affecting the database design (though a re-indexing operation would still be necessary). Thus, new document types such as conference papers or books could be added to the database with minimal effort. The loading of the database was carefully coordinated with the loading of the DOI Directory (which makes use of the CNRI Handle System [13]) so that both databases were kept in sync. This two-phase commit process assured that loading failures in either system were subsequently rolled back in both systems, and thus both systems would always contain concurrent data. The loading scripts performed a number of functions, including:
These scripts were all automated to run daily, and any system errors were automatically emailed to the system administrators. The most complex issue in the loading scripts was the correct handling of special characters in author names and article titles, including diacritical marks, hyphens, symbols, and other non-standard characters. Although all metadata was loaded as XML (and thus could have made use of Unicode character entities), it was important that the search interface (described below) allow for a variety of representations of special characters. Thus, all special characters were "down-converted" to their base letter in the database indexes. For example, the letter ü (lower case "u" with umlaut) was stored in the database as ü and was indexed as a simple "u". Then, assuming a similar down-conversion is done to the query terms, all the various representations of ü could successfully be matched. Thus, for example, the name Müller was stored as Müller, was indexed as Muller, and was matched by any of the following query terms: Muller, Müller, Müller, Müller, or Müller. LookupTwo types of lookup mechanisms were designed for searching the metadata database: interactive and batch. The interactive interface allowed an individual user to visit a Web page, fill out a form of various citation fields (author, journal title, volume, page, etc.), and then search for DOIs based on the submitted fields. The returned results would then be all of the DOIs that fit the requested criteria, along with their associated full metadata. The main intended uses for this interface would be for authors or journal editors to validate citations, and for publishers to do spot checks on the loading of their respective metadata. A batch interface was provided to allow publishers to do wholesale lookups of large quantities of citations. Thus, for a given article typically containing twenty citations, all twenty of the citations could be looked up in one submission. The batch query input was specified as a simple text file containing one line for each citation to be matched. Each line contained the basic citation data in a specific format as shown in Figure 1, and was similar in concept to the format used to query the PubMed database [14]. The allowed fields on each line were ISSN, journal title, author name, volume, issue, page, year, and item type. The batch was submitted via an HTTP POST operation to a specific Web address, which then returned a text file containing the complete item metadata plus the DOI (if found), plus a diagnostic message.
Figure 1. Example of batch query input and output. The purpose of the diagnostic message was to indicate the closeness of the match. For example, for a given citation, the database might find an exact match, a close match, no match, or multiple matches. The following table lists some examples of diagnostic messages and their meanings:
The batch interface was designed with the intention that it could be integrated straightforwardly into various publisher production systems, thus allowing the citation lookup process to become part of the standard production workflow for all journal articles. ResultsThis section summarizes the collective experiences of the five primary and two secondary publishers who participated in this prototype. Conversion from in-house format to DOI-X DTD For most publishers, the DOI-X DTD was found to be easy to use for submitting journal article metadata to the database and for registering DOIs. All participating publishers already had their metadata in a tagged format (SGML or XML), and thus it was straightforward for each participant to write a conversion program to convert their internal DTD to the common one used in this prototype. Usability of data rules The data typing rules were deemed suitable for submitting journal article metadata to the database, though the application of the rules was voluntary and open to interpretation. Thus, because there was no validation of the rules, data inaccuracies were accepted by the system. Data upload The HTTP-based push protocol for metadata submission proved to be straightforward and easy to automate by all participants, though several felt that it would be more efficient and effective to use FTP due to existing expertise and experience with FTP. It was also agreed that a pull protocol should be investigated, both for the purposes of avoiding contention at the push target (thus improving scalability), as well as increased efficiency for multiple collection efforts (i.e., laying out the data once for multiple pick-ups as opposed to pushing multiple times to multiple locations). It was also thought that the feedback process could be improved to provide full validation results in real time, and possibly to include a fragment of the data in which the error is located to assist in a data provider's investigation/resolution. Batch size The various participants found the batch size restriction of 10MB to be either acceptable or a major headache. To facilitate large submissions, the collection process itself might be modified to take care of splitting the submissions into manageable batches so that the publisher would not need to do so. This function would be particularly useful for large backfills of metadata into the database. Workflow integration-metadata submission Most of the participants found it easy to incorporate the metadata submission procedures into their existing systems, both for legacy data and as an ongoing process. The turnaround time between submission of data and receipt of feedback was found to be adequate, though most felt it would be better to get a response in real time versus by email in order to facilitate an automated system to match submissions to corresponding error messages. Workflow integration-query mechanism Incorporating the batch query procedures into existing systems was tried by a few participants with varying results. For some it slowed the process down from an entirely local process to one requiring the Internet, whereas for others, it allowed for new functionality where none existed before. All found the process either equivalent or harder than their current systems, but all agreed that having a standard that will work for the entire universe of reference linking will be a marked advantage over existing systems with limited functionality. Based on their experience with the prototype, participants would definitely consider adopting DOI-X for reference linking in a production environment. They agreed that DOI-X provides a functional and entirely satisfactory framework for reference linking in the most general case. The method for handling ambiguous queries was not as helpful as it could have been. Participants suggested that it would have been extremely helpful to list the multiple results and return full records so that matching software on the publisher side would then be able to use this information to pick the correct record. See below: "Resolving Ambiguity". The batch query interface was a definite success. It certainly facilitated the resolution of journal article citations to DOIs. However, the query engine did not allow for "smart queries," which meant that if any of the fields were correct but unmatched (rather than left blank), the result would be a failed lookup. For example, if a query contained either erroneous information or additional information not in the original deposited record in any one field, even if all other fields were accurate, no match would be found. It was recognized that smart, or fuzzy, matching capabilities would be an essential component of a production-level system. Statistics for records During the prototype, there were 578,686 database queries, with 69,149 (or 12%) resulting in matches. This clearly would have been higher if the system had implemented smart queries. Integration of raw DOI-X metadata into very large systems Some of the participants who already maintain their own citation matching engines with large internal bibliographic databases found it more useful to receive a private feed of the raw XML metadata so that it could be integrated with their existing systems (instead of querying the central DOI-X database over the Internet). Although not all participants who might have adopted this method actually had the time to do so, those who did found that a local copy of the metadata was useful for large-scale dynamic reference linking. Using internal matching systems with the raw data resulted in high match rates for at least one participant. In addition, maintaining a local copy of the database enabled processing of huge amounts of citations in real time. The procedure used to distribute the raw metadata to participants (FTP) was deemed adequate for a prototype, but a more structured process would be needed for efficient management so that users could determine when they have collected all of the metadata, and when new metadata is available for testing. Going ForwardDuring the prototype, several potential next steps were identified. Some of these are now under consideration by the CrossRef Project. New elements The use of "et al." and rules governing its use need to be added to the DTD. It was also recommended that there be an element for publisher name, since a given publisher might use an external agency to register their DOIs (which would properly belong within the registrant element). Identification of journal titles Redundant journal-level metadata should be removed from the article record and stored centrally. There is no need to store the same abbreviated names of a journal every time an article published in that journal is recorded. Removing redundant journal data from article records would reduce the quantity of data that would need to be submitted. Journal identification could also be achieved algorithmically or by database lookup. An algorithm could match journal abbreviations with full titles. For example, "JCC" would match Journal of Computational Chemistry. (The more difficult part arises when journals in different disciplines have the same abbreviation.) A database of journal title abbreviations would improve lookup performance. Such a database could be founded on the relevant data already submitted according to the DOI-X DTD. Correct journal-title identification is key to successful DOI retrieval. Participating publishers would be encouraged to enforce a set of standard abbreviations to facilitate accuracy. A journal-title database could be implemented as a front-end to the MDDB query interface and should have a formal journal metadata data model. A DTD and submission and update process would have to be specified. Addition of other publishing genres The DTD, and thereby the whole system, may easily be extended to include publishing genres other than journal article. Prime candidates are conference proceedings and major reference works (encyclopedias). Books, government and corporate technical reports, Web documents, preprints/e-prints, and series should also be considered. Capturing metadata in particular genres would increase the hit rate significantly in those disciplines. For example, including conference proceedings in the database would significantly increase the hit rate for references in the field of computer science [15]. When the data specification is mature, its element set could be registered as a namespace. (Though this would depend on whether reference linking was implemented in a closed or open system.) Data validation The rules document should be integrated with the DTD. Data may need to be validated beyond the capacity of a DTD-based, as opposed to XML-schema-based parser. Datatypes that need to be validated include, in no particular order: URLs, email addresses, dates, ISSNs, CODENs, PIIs, SICIs, and DOIs. Architecture One aspect of the prototype was to use it to estimate costs and personnel requirements for a production-level service. In terms of CPU and bandwidth, hardware and ISP costs for the collection mechanism are on a par with a moderately robust Web server. Storage costs will depend on the quantity of records, but should not be excessive. The hardware for a large-scale deployment of the MDDB and query interface (including loading and querying functions) would require at least two production machines (for failover protection), plus a smaller development machine. One issue for consideration is whether it would be important to use a distributed architecture for collection and loading and, if so, what kind. A central collection facility, reasonably hardened, might be the best approach for feeding a single query service. A much larger collection effort, especially one feeding a collection of possibly heterogeneous databases, would probably be better served by a distributed architecture. In such a scenario, different collection strategies could be considered including a pull protocol, (as previously noted), and a staged collection process, where intermediate collection points could distribute the collection load, at least for the XML validation step. Personnel costs for the MDDB for a robust 24x7 system would require a minimum of two full-time engineers: a half-time Database Administrator to oversee the Oracle database; a half-time system administrator to oversee the hardware, operating system, and network; a half-time customer service representative to help publishers with metadata submissions and queries; and a half-time developer to implement new features, improvements, etc. From the publisher depositor perspective, it might take one half-time person for a publisher of 50,000 pages per year to monitor process and resolve ambiguous and no-match results. This could be less as the system becomes more robust. Resolution of partially incorrect queries In the DOI-X prototype, if any of the fields in a query were incorrect rather than left blank, no results would be returned from the metadata database. Matching against partially incorrect queries was deemed by all participants to be an essential feature of a full production system. Resolving ambiguity Being able to select from among similar results when an incomplete query is submitted is essential. When more than one record matches the query, the full records of those that are partial matches need to be returned to aid in determining which of several is correct. However, by submitting a minimal query, a user could receive a great number of records, and thus use the database for resource discovery (record trawling). There was concern that resource discovery lay outside the terms under which content providers were sharing this data, and that therefore a limit (probably 5) should be put on the number of records returned in response to an incomplete query. Queries in XML syntax There was some interest in implementing, in the future, an XML syntax for querying the database (instead of the pipe-delimited format). XML would be easier to extend as the scope of the repository broadened or changed to include other publishing genres. It might also increase the likelihood of interoperability with other related or similar systems. In addition, there are many tools for parsing and manipulating XML that could be used both on the sending and receiving end of the queries. Having the query results returned as XML would be especially nice for post-processing, and in the case of ambiguous results would facilitate returning all of the matches, at least up to some reasonable limit. However, one unresolved issue in implementing an all-XML query interface is the lack of an industry-standard XML query language. Usage guidelines Guidelines demonstrating best practice, but not contractual rules, need to be developed. There are four areas that might benefit from guidelines:
ConclusionSharing bibliographic metadata for DOI lookup proved to be thoroughly feasible, though not trivial. Participants devoted considerable effort to making the project a success, and their efforts attest to the scientific and commercial imperative for reference linking processes within the infrastructure of electronic publishing. As a proof of concept, we believe DOI-X will promote widespread registration of DOIs across the publishing industry. From disparate competitor systems have grown the beginnings of publishing-industry standards, rooted in experience, for metadata encoding, and all have benefited from a prototype production process, as well as a prototype for the resulting reference links. Members of the CrossRef project are now well positioned to extend this model to a broader data specification and production-strength architecture. Acknowledgments The authors gratefully thank Larry Lannom, Catherine Rey, Jane Euler, Mike Casey, Craig Van Dyck, Tim Ingoldsby, Bernie Rous, Mary Grace Palumbo, and Ed Pentz for their contributions. [1] The home page for information on the DOI is <http://www.doi.org> [2] National Information Standards Organization (NISO), Digital Libraries Federation (DLF), Coalition for Networked Information (CNI), National Federation of Abstracting and Indexing Services (NFAIS), and the Society for Scholarly Publishing (SSP). [3] The final report of the NISO/DLF/CNI/NFAIS/SSP initiative on reference linking can be found at <http://www.lib.uchicago.edu/Annex/pcaplan/reflink.html> [4] See Appendix 1, attached. The main contenders for a basic element set included the NFAIS element set which had been proposed to the IDF earlier in 1998; the Wiley Metadata Database (based primarily on Dublin Core) which had been demonstrated at Frankfurt in October 1998; the Dublin Core 1.0 set itself; the PubMed/PubRef element set; D-Lib Magazine’s own element set (for which Bill Arms and CNRI had obtained separate, non-IDF funding to try externalizing as a lookup mechanism); and the "INDECS/DOI Kernel" standard proposed by Norman Paskin in one of his "for comment" metadata papers. [5] The home page for information on the DOI-X Prototype is <http://meta.doi.org> [6] The home page for information on CrossRef is <http://www.crossref.org> [7] The XML DTD is located at <http://dx.doi.org/10.1000/6382-2> [8] The XML DTD Rules Document is located at <http://dx.doi.org/10.1000/6382-3> [9] The RDF (Resource Description Framework) specification is located at <http://www.w3.org/RDF> [10] The home page for INDECS is located at <http://www.indecs.org> [11] The DOI-X Batch Upload Specification can be found at <http://dx.doi.org/10.1000/6382-4> [12] A description of the DOI resolution system can be found at <http://dx.doi.org/10.1000/100> [13] The home page for the Handle System is <http://www.handle.net> [14] The PubMed citation matcher is located at <http://www.ncbi.nlm.nih.gov/htbin-post/PubMed/wgetids> [15] Private communication, Bernie Rous, ACM Appendix 1 - Comparison Matrix for Possible Metadata Element Set (as of 6/10/99)
Copyright © 2000 Helen Atkins, Catherine Lyons, Howard Ratner, Carol Risher, Chris Shillum, David Sidman, and Andrew Stevens. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Top | Contents Search | Author Index | Title Index | Monthly Issues Editorial | Next Story Home | E-mail the Editor |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
D-Lib Magazine Access Terms and Conditions DOI: 10.1045/february2000-risher |