Search   |   Back Issues   |   Author Index   |   Title Index   |   Contents

Articles

spacer

D-Lib Magazine
July/August 2005

Volume 11 Number 7/8

ISSN 1082-9873

Border Crossings

Reflections on a Decade of Metadata Consensus Building

 

Stuart L. Weibel
Senior Research Scientist
OCLC Research
<weibel@oclc.org>

Red Line

spacer

In June of this year, I performed my final official duties as part of the Dublin Core Metadata Initiative management team. It is a happy irony to affix a seal on that service in this journal, as both D-Lib Magazine and the Dublin Core celebrate their tenth anniversaries. This essay is a personal reflection on some of the achievements and lessons of that decade.

The OCLC-NCSA Metadata Workshop took place in March of 1995, and as we tried to understand what it meant and who would care, D-Lib magazine came into being and offered a natural venue for sharing our work [16]. I recall a certain skepticism when Bill Arms said "We want D-Lib to be the first place people look for the latest developments in digital library research." These were the early days in the evolution of electronic publishing, and the goal was ambitious. By any measure, a decade of high-quality electronic publishing is an auspicious accomplishment, and D-Lib (and its host, CNRI) deserve congratulations for having achieved their goal. I am grateful to have been a contributor.

That first DC workshop led to further workshops, a community, a variety of standards in several countries, an ISO standard, a conference series, and an international consortium. Looking back on this evolution is both satisfying and wistful. While I am pleased that the achievements are substantial, the unmet challenges also provide a rich till in which to cultivate insights on the development of digital infrastructure.

The Achievements

When we started down the metadata garden path, the term itself was new to most. The known Web was less than a million pages, people tried to bribe their way into sold-out Web conferences, and the term 'search engine' was as yet unfamiliar outside of research labs. The OCLC-NCSA Metadata Workshop brought practitioners and theoreticians together to identify approaches to improve discovery. In two and a half days, an eclectic Gang of 52 (we affectionately described ourselves as 'geeks, freaks, and people with sensible shoes') brought forward a core element set upon which many resource description efforts have since been based.

The goal was simple, modular, extensible metadata – a starting place for more elaborate description schemes. From the thirteen original elements we grew to a core of fifteen, and later elaborated the means for refining those rough categories. In recent years much work has been done on the modular and extensible aspects, as application profiles have emerged to bring together terms from separate vocabularies [9].

A Consensus Community

The workshop series coalesced as a community of people from many countries and many domains, drawn by the appeal of a simple metadata standard. Openness was the Prime Directive, and early progress was often marked by the contentious debate of consensus building. But our belief that value would emerge from many voices informed our deliberations, and still does. Not without difficulty: in one early meeting, participants spent an hour of scarce plenary time talking about Type before realizing that the librarians and the computer scientists had been talking about completely different concepts. Crossing borders is often difficult.

This open, inclusive approach to problem solving helped the Dublin Core community to frame the metadata conversation for the past decade. The Dublin Core brand has been for some years the first link returned for the Google search term "metadata", and for a time, it outranked all other results for the search "Dublin" (as of this writing, it is #6). With only moderate irony, we might say "I feel lucky!"

Process

As a workshop series evolved into a set of standards and a community, the need for rules and governance evolved as well. DCMI developed a process for evaluating proposed changes and bringing them into conformance with the overall standard [5]. The DCMI Usage Board is comprised of knowledgeable, experienced metadata experts from five countries who exercise editorial guidance over the evolution of DCMI terms and their conformance with the DCMI Abstract Model [13].

This model itself is among the most important of the achievements of the Initiative, representing as it does the convergence of theory and practice over a decade of vigorous debate and practical implementation. It emerged from early intuition and experience, informed by an evolving sense of grammatical structure [2,6] and further refined by a long co-evolution with the W3C's Resource Description Framework (RDF) and the Semantic Web.

At a higher level, DCMI has a Board of Trustees [1], who oversee operations and do strategic planning, and an Affiliate Program and governance structure that distributes the cost of the initiative and assures that the needs of stakeholders are accommodated [3]. At the time of this writing, there are four national DCMI Affiliates and several more in discussion.

Internationalization

The global nature of the Web demands commitment to internationalization. The difficulties of achieving system interoperability in multiple languages are immense, and still only partially solved (anyone used IRIs recently?). Nonetheless, DCMI has succeeded in attracting translations of its basic terms in 25 languages and offers a multilingual registry infrastructure of global reach [14]. The venues for the workshops and conferences have been chosen to make the Initiative accessible to people in as many places as possible. Workshops and conferences are held in the Americas, Europe, and Austral-Asia on a rotating basis, and Dublin Core principals have given talks on every continent save Antarctica. This policy of international inclusion has been a philosophic mainstay for the Initiative, attracting long-term participation from around the world.

Where we were confused

Confusions and unmet challenges are both interesting and instructive. A few of these are historical curiosities, and interesting mostly as a source of wry humility. Others represent unsolved dilemmas that remain prominent challenges for the metadata world in general.

Author-created Metadata

The idea of user-created metadata is seductive. Creating metadata early in the life cycle of an information asset makes sense, and who should know the content better than its creator? Creators also have the incentive of their work being more easily found – who wouldn't want to spend an extra few minutes with so much already invested?

The answer is that almost nobody will spend the time, and probably the majority of those who do are in the business of creating metadata-spam. Creating good quality metadata is challenging, and users are unlikely to have the knowledge or patience to do it very well, let alone fit it into an appropriate context with related resources. Our expectations to the contrary seem touchingly naïve in retrospect.

The challenge of creating cost-effective metadata remains prominent. As Erik Duval pointed out in his DC-2004 keynote, 'Librarians don't scale' [7]. We need automated (or at least, hybrid) means for creating metadata that is both useful and inexpensive.

What is metadata for?

Another naïve assumption was that metadata would be the primary key to discovery on the Web. While one may quibble about the effectiveness of unstructured search for some purposes, it is the dominant idiom of discovery for Web resources, and may be expected to remain so. What then, is metadata for?

There are many answers to this question, though given the high stakes in the search domain, expect these answers to shift and weave for the foreseeable future. Searching the so-called 'dark web' remains a function of gated access, and metadata is a central feature of such access. One might simply say – harvest and index. OCLC's exposure of WorldCat assets in search engines such as Google and Yahoo is exemplary of this approach [11]. Indexed metadata terms connect users to the location of the physical assets via holdings records, but it is reasonable to ask... would simple, full-text indexing of these assets be better still? We may argue the fine points today but in the future, we'll know the answer, for the day of digitization is fast upon us.

Structured metadata remains important in organizing and managing intellectual assets. The Canadian Government's approach to managing electronic information illustrates this strategy [4]. Metadata becomes the linkage relating content, legislative mandates, reporting requirements, intended audience, and various other management functions. One does not achieve this sort of functionality with unstructured full text.

The International Press Telecommunications Council is exploring embedding Dublin Core in their new generation of news standards [17]. No domain is more digitally now than this one. If you want to know the value of structured metadata, look to the requirements and business cases in such communities [10].

Similarly, in the management of intellectual property rights, well-structured data is essential, and as these requirements become ubiquitous, the creation and management of metadata will be central to the story.

Metadata for images is a critical use. Association of images with text makes them discoverable. When the asset is a stand-alone image, metadata is the primary avenue by which they can be accessed. Picture Australia is an early and enduring (and widely copied) model in this area, showing how a photo archive can become a primary cultural heritage asset through the addition of systematic search tools and Web accessibility [12].

There is much talk of taxonomies, their strengths, and deficiencies these days and in fact the emergence of 'folksonomies' hints at a sea change in the use of vocabularies to improve organization and discovery [9]. The Dublin Core community has struggled with the role of controlled vocabularies, how to declare and use them, and how important (or impotent?) they might be. The notion that uncontrolled vocabularies – community-based, emergent vocabularies – might play an important role in aggregation and discovery occasions a certain discomfort for those schooled in formal information management. Whether it is just the latest fad, or an important emerging trend, remains to be seen.

A Major Unmet Challenge

Entropy is an arrow. In the absence of constant care and fussing, our greatest successes break down. Failures, however, remain potent without much attention, retaining their power to impede.

One of the yet-unsolved problems in the metadata community is the railroad gage dilemma. The first editor of D-Lib, Amy Friedlander, introduced me to the notion of train gages as metaphor for interoperability challenges [8]. Last year I rode that metaphor from Beijing to Ulan Bator, Mongolia. A cursory knowledge of Asian history reminds us that relations between Mongolia and China have been less-than-cordial from time to time, and this history remains manifest at the Gobi border crossing today. In the dark of night, the Beijing train of the Trans-Siberian Railway pulls into a longhouse of clanking and switching as the entire train is raised on hydraulic jacks. Chinese bogeys (wheel carriages) are rolled out, and Mongolian bogeys of a different gage are rolled in. Border guards with comically high hats (and un-comical sidearms) work their way through the train cars in the manner of border guards everywhere. After a couple of hours, the train is rolling through the Gobi anew.

It is a fascinating display of technological diplomacy – a kind of Maginot line that helps those on both sides of the border sleep better. These images belong to a Bogart movie or a Clancy novel, but their abstraction pervades the metadata arena.

Photograph of stacked bogeys

Stacked bogeys, ready to be rolled into use. Photo by Stuart Weibel.

Photograph of railroad car having its bogey changed to fit a differently sized track

A railroad car raised on one of dozens of hydraulic jacks that raise an entire train at once for the exchange of bogeys. Photo by Stuart Weibel.

We load our metadata into structures in one domain and when we cross borders we unload it, repackage it, massage it to something slightly different, and suffer a measure of broken semantics in the bargain. We're running on different gages of track, manifested in different data models, slightly divergent semantics, and driven by related, but meandering, often poorly-understood functional requirements. Crosswalks are the hydraulic jacks – quieter, but no more efficient than the clanking and grinding in the train longhouse.

Metadata standards specify the means to make (mostly) straightforward assertions about resources. Many of these assertions are as simple as attribute-value pairs. Others are more complex, involving dependencies or hierarchies. None are so complicated that they cannot be accommodated within a common formal model. Yet we do not have such a model in place. Why?

  • NIH (Not Invented Here) Syndrome is often blamed for disparities that emerge in solutions from separate domains targeted at similar problems. Certainly our propensity to like our own ideas better than those of others plays a role, but my view is that it is not such a large role.
  • Developments take place in parallel. It is unusual to have the luxury of waiting to see how another group is approaching a particular problem before tackling it yourself. It is quite hard enough to know what is happening in one's own community, let alone to follow related developments in others, whose differences in terminology obscure what we need to know.
  • The functional requirements of various metadata standards are often ambiguous and always focused slightly differently. DCMI focuses on simple, extensible, high-level metadata. IEEE LOM (Learning Object Metadata) also concerns itself with discovery metadata, but focuses more strongly on educational process descriptors. MPEG is about media, where technical image metadata is central, and intellectual property rights management is crucial. MODS is grounded firmly in the legacies of MARC (and the world's largest installed base of resource discovery systems).
  • The cost of collaboration – in intellectual as well as financial terms – is high. People have to know and trust one another, which generally requires face-to-face engagement: transporting ourselves and our ideas to other time zones, surviving frequent-flyer-flues, finding the means to support travel costs, and missing baseball games of our children.
  • The problems are more complicated than we imagine at the outset. The recent approval of the Dublin Core Abstract Model by DCMI is the culmination of a journey that began almost at the outset of the Initiative. Early attempts, under the guise of the DC Data Model Working Group, rank among my most contentious professional experiences. To borrow from the oldest joke of the Dismal Profession, put all the data modelers in the world end to end, and you won't reach a conclusion (we did, but it took ten years to manage it).

The idea of achieving similar consensus across communities with their own legacies of such conflict is daunting in the extreme, though recent discussions on this topic with colleagues in another metadata community remind me that hopefulness and optimism are as much a part of our domain as contention [18].

Collaboration and consensus in the digital environment

The Web demands an international, multicultural approach to standards and infrastructure. The costs in time and treasure are substantial, and the results are uncertain. Paying for collaboration that spans national boundaries, language barriers, and the often-divergent interests of different domains is a major part of these challenges. Doing this while sustaining forward progress and attracting a suitable mix of contributors, reviewers, implementers, and practitioners, is particularly difficult.

A recent presentation by Google's Adam Bosworth, referenced in the Blandiose blog [15], makes for provocative reading for those debating the costs and benefits of heavy-weight versus light-weight standards. The tension between these approaches sharpens designers and practitioners (and especially, entrepreneurs), to the eventual benefit of users. Any standards activity ignores this balancing act at its peril.

As we try to foment change and react to it at once, we are like Escher's Hands – designing the future as it, in turn, designs us... except that there are often implements other than pencils in those hands. Ever try explaining what you do for a living to your mother? In the Internet standards arena, conveying an appropriate balance of glee, terror, satisfaction, frustration, and pure wonder is no easy task. I just tell her I'm not a real librarian, but I play one on the Internet. It seems enough.

Acknowledgements

I wish to acknowledge my personal debt to uncountable colleagues in the Dublin Core community, and my deep sense of gratitude for the opportunity to have played the role I have. The patience, forbearance, and generosity of the support of OCLC management in supporting my efforts and DCMI in general, have been singular and essential.

Thomas Baker reviewed and improved this manuscript with several insightful suggestions.

Amy Friedlander and Bonnie Wilson, successive editors of D-Lib, have made me look better than I am in these pages for 10 years. Congratulations to them and to all who have helped make this journal (and its authors) what they are.

References and Notes

[1] About the Initiative DCMI Website, accessed June 23, 2005
<http://dublincore.org/about/>.

[2] Baker, Thomas
"A Grammar of Dublin Core"
D-Lib Magazine, October 2000
Volume 6 Number 10
<doi:10.1045/october2000-baker>.

[3] DCMI Affiliate Program
DCMI Website, accessed June 23, 2005
<http://dublincore.org/about/affiliates/>.

[4] Committee of Federal Metadata Experts Metadata Action Team,
Council of Federal Libraries.
Government of Canada Metadata Implementation Guide For Web Resources
3rd edition - July 2004
<http://www.collectionscanada.ca/6/37/s37-4016-e.html>.

[5] DCMI Usage Board
DCMI Usage Board Mission and Principle
DCMI Website, June 11, 2003
<http://dublincore.org/usage/documents/mission/>.

[6] DCMI Usage Board
DCMI Grammatical Principles
DCMI Website, 2003-11-18
<http://dublincore.org/usage/documents/principles/>.

[7] Duval, Erik and Wayne Hodgins
"Making metadata go away: Hiding everything but the benefits"
Keynote address at DC-2004
Shanghai, China, October 2004
<http://students.washington.edu/jtennis/dcconf/Paper_15.pdf>.

[8] Friedlander, Amy
Emerging Infrastructure: The Growth of Railroads
Infrastructure History Series, CNRI, 1995
<http://www.cnri.reston.va.us/series.html#rail>.

[9] Mathes, Adam
Folksonomies - Cooperative Classification and Communication Through Shared Metadata
Computer Mediated Communication - LIS590CMC Graduate School of Library and Information Science, University of Illinois Urbana-Champaign.
December 2004
<http://www.adammathes.com/academic/computer-mediated-communication/folksonomies.html>.

[10] News Architecture Version 1.0 Metadata Framework Business Requirements
IPTC Standards Draft, 2005
<http://iptc.org/pdl.php?fn=DRAFT-NAR_1.0-spec-NMDF-BusReq_34.pdf>.

[11] Open Worldcat Program
OCLC Website, accessed June 23, 2005
<http://www.oclc.org/worldcat/open/default.htm>.

[12] Picture Australia Hosted by the National Library of Australia
Website accessed June 23, 2005
<http://www.pictureaustralia.org/>.

[13] Powell, Andy; Mikael Nilsson, Ambjörn Naeve, and Pete Johnston.
DCMI Abstract Model. DCMI Website, 2005-03-07
<http://dublincore.org/documents/abstract-model/>.

[14] Wagner, Harry and Stuart Weibel
"The Dublin Core Metadata Registry: Requirements, Implementation, and Experience"
Journal of Digital Information
Accepted for publication, May, 2005.

[15] "Web of Data"
Blandiose blog, 2005-04-21
<http://www.blandiose.org/index.php?s=bosworth&submit=Search>.

[16] Weibel, Stuart
Metadata: the Foundations of Resource Discovery. D-Lib Magazine, July, 1995 Volume 1, Number 1 doi:10.1045/july95-weibel

[17] Wolf, Misha
DC in XHTML2
Semantic Web and DC-General Mailing lists, June 7, 2005
<http://lists.w3.org/Archives/Public/semantic-web/2005Jun/0058.html>.

[18] The author has been party to discussions with Erik Duval and Wayne Hodgins of the IEEE LOM effort centered around the possibility of cross-standard data modeling that might promote convergence among various metadata activities. The means and methods for carrying such work forward are presently undetermined.

Copyright © 2005 OCLC Online Computer Library Center, Inc.
spacer
spacer

Top | Contents
Search | Author Index | Title Index | Back Issues
Previous Article | JCDL Conference Report
Home | E-mail the Editor

spacer
spacer

D-Lib Magazine Access Terms and Conditions

doi:10.1045/july2005-weibel