Stories

D-Lib Magazine
December 1998

ISSN 1082-9873

Languages for Dublin Core

blue line

Thomas Baker
Asian Institute of Technology
Computer Science and Information Management
P.O. Box 4, Klong Luang
Pathum Thani 12120, Thailand
thomas.baker@cs.ait.ac.th

Over the past three years, the Dublin Core Metadata Initiative has achieved a broad international consensus on the semantics of a simple element set for describing electronic resources. Since the first workshop in March 1995, which was reported in the very first issue of D-Lib Magazine [Weibel 1995], Dublin Core has been the topic of perhaps a dozen articles here. Originally intended to be simple and intuitive enough for authors to tag Web pages without special training, Dublin Core is being adapted now for more specialized uses, from government information and legal deposit to museum informatics and electronic commerce.

To meet such specialized requirements, Dublin Core can be customized with additional elements or qualifiers. However, these refinements can compromise interoperability across applications. There are tradeoffs between using specific terms that precisely meet local needs versus general terms that are understood more widely. We can better understand this inevitable tension between simplicity and complexity if we recognize that metadata is a form of human language. With Dublin Core, as with a natural language, people are inclined to stretch definitions, make general terms more specific, specific terms more general, misunderstand intended meanings, and coin new terms. One goal of this paper, therefore, will be to examine the experience of some related ways to seek semantic interoperability through simplicity: planned languages, interlingua constructs, and pidgins [Baker 1997b].

The problem of semantic interoperability is compounded when we consider Dublin Core in translation. All of the workshops, documents, mailing lists, user guides, and working group outputs of the Dublin Core Initiative have been in English [Dublin Core Metadata Initiative]. But in many countries and for many applications, people need a metadata standard in their own language. In principle, the broad elements of Dublin Core can be defined equally well in Bulgarian or Hindi. Since Dublin Core is a controlled standard, however, any parallel definitions need to be kept in sync as the standard evolves. Another goal of the paper, then, will be to define the conceptual and organizational problem of maintaining a metadata standard in multiple languages.

In addition to a name and definition, which are meant for human consumption, each Dublin Core element has a label, or indexing token, meant for harvesting by search engines. For practical reasons, these machine-readable tokens are English-looking strings such as Creator and Subject (just as HTML tags are called HEAD, BODY, or TITLE). These tokens, which are shared by Dublin Cores in every language, ensure that metadata fields created in any particular language are indexed together across repositories. As symbols of underlying universal semantics, these tokens form the basis of semantic interoperability among the multiple Dublin Cores.

As long as we limit ourselves to sharing these indexing tokens among exact translations of a simple set of fifteen broad elements, the definitions of which fit easily onto two pages, the problem of Dublin Core in multiple languages is straightforward. But nothing having to do with human language is ever so simple. Just as speakers of various languages must learn the language of Dublin Core in their own tongues, we must find the right words to talk about a metadata language that is expressable in many discipline-specific jargons and natural languages and that inevitably will evolve and change over time.

DC-Multilingual

Translating the two-page, fifteen-element standard known as Unqualified Dublin Core [RFC 2413] has proven to be relatively easy. Typical in this regard is the experience of the cataloging experts who created DC-Indonesian (in Bahasa Indonesia, which uses the Latin alphabet). Some elements had obvious translations: format is identical in English and Indonesian. Some elements were translated uncontroversially with English loanwords: subject as subjek; contributor as kontributor (the giver of an idea, as opposed to penyumbang, a charitable donor); and description with deskripsi (which is more general than ringkasan isi for summaries of text or menggambarkan for images). The term creator is usually rendered in Indonesian as pencipta, but to many people this evokes a deity, analogously to Schöpfer in German, skapare in Swedish, and arguably even creator in English. After rejecting some overly specific alternatives, the committee opted for a loanword that covered the proper scope with none of the distracting cultural baggage -- kreator [Prabawa 1998].

Problems such as these must be resolved in the element definitions. This is especially crucial in cases where local concepts divide along lines much different from those of DC-English. To date, the translators of Dublin Core have reported few such conflicts, but one problem with legal implications deserves mention. For historical reasons, the Romance languages use equivalents of the French term éditeur to refer both to "editors" (in Dublin Core a type of contributor) and "publishers" (an element of the Core). Although I am told this has created some difficulties with the wording of legal deposit laws, it has not been raised as a serious problem with regard to Dublin Core.

Discussions of such issues have taken place over the past year or two in many languages. The creation of a DC-Arabic was coordinated by Hachim Haddouti, a researcher of the Bavarian Research Center for Knowledge-Based Systems (FORWISS), with the involvement of colleagues in Germany, Canada, Tunisia, USA, Morocco, United Arab Emirates, and Jordan [DC-Arabic]. In his view, "DC-Arabic constitutes a milestone in promoting metadata in the Arabic language and Arabic-language computing. Reactions from my colleagues at research libraries, documentation centers, software firms, and publishers in the Arabic-speaking world confirm that this standard is addressing a recognized need."

In Japan, the University of Library and Information Science in Tsukuba has begun a large-scale digital library project based on Dublin Core [DC-Japanese]. A DC-Chinese in "Big5" encoding (traditional characters) has been posted at the Department of Library and Information Science at Fu-Jen University in Hsin-Chunag, Taiwan [DC-Chinese (Big5)]. Another DC-Chinese, in "GB" encoding (simplified characters), has been proposed by a masters student at the Asian Institute of Technology [Wu 1998].

Several Asian initiatives to translate Dublin Core have involved centers for scientific information. DC-Korean was undertaken by researchers at the Korea Research Information Center together with colleagues from Yonse University, Ehwa Women's University, and Chung Ang University [DC-Korean]. In Thailand, a translation was prepared at the Technical Information Access Center of the National Science and Technology Development Agency in consultation with colleagues from the Library Association of Thailand and the Asian Institute of Technology [DC-Thai]. DC-Indonesian was discussed and approved by members of the National Library of Indonesia, the University of Indonesia, the Center for Scientific Documentation and Information of the Indonesian Institute of Sciences, and the Library Development Coordination Unit of the Directorate General of Higher Education [Prabawa 1998].

In Southern Europe, work on Dublin Core has often begun on the initiative of government research institutions. DC-Italian was created at the Istituto di Elaborazione dell'Informazione [Institute for Information Processing] of the Italian National Research Council in collaboration with the Istituto Centrale per il Catalogo Unico (ICCU), which maintains a union catalog of Italian libraries. A DC-Spanish has been posted by RedIRIS, an academic network under Spain's Scientific Research Council [DC-Spanish]. DC-French was created at the National Institute for Research in Computer Science and Control (INRIA) [DC-French]. The Greek Dublin Core has been an initiative of the Foundation for Research and Technology -- Hellas (FORTH) in Heraklion, Crete [DC-Greek].

In several countries, Dublin Core has been adopted for use in national repositories or directories of online resources. The National Library of Portugal may adopt Dublin Core for the deposit and preservation of Portuguese digital publications and is promoting the standardization of Dublin Core sub-elements for Portuguese materials [DC-Portuguese]. Having mandated that all public documents be published online, the government of Denmark is teaching thousands of administrators how to create and manage metadata as an integral part of their workflow. A national metadata standard based on Unqualified Dublin Core (with four added sub-elements under Title and Relation) forms a common template for the Danish Library Center, Danish National Library Authority, legal deposit libraries, and the Danish State Information Service. This template is used by online publishers to register their documents for inclusion in the national bibliography and for storage on a legal deposit server in the Royal Library.

Following the Danish example, the government of Finland has begun projects to implement legal deposit, archive online publications, and add Dublin Core metadata to public documents [DC-Finnish]. In the Nordic Metadata Projects, researchers from these countries have collaborated extensively with colleagues to promote regional metadata strategies. There is a DC-Norwegian, and work is underway on DC-Icelandic and DC-Swedish [DC-Norwegian, DC-Swedish]. As a practical matter, all of the tools and documents of the Nordic Metadata Project are published in English [Nordic Metadata Projects]. Koninklijke Bibliotheek, the national library of the Netherlands, has created a DC-Dutch for a project to compile a Directory of Netherlands Online Resources [DC-Dutch].

There are over thirty projects using Dublin Core in Germany. The biggest of these involve Germany's learned societies, Humboldt University, the Lower Saxony State and University Library in Göttingen, Die Deutsche Bibliothek, the Bavarian State Library, the Library of the Max Planck Institute for Human Development, the German Library Institute in Berlin, and the Southwest Consortium of University Libraries [Information and Communication Commission of the Learned Societies, Metadata Projects at German Libraries, German Metadata Registry, Meta-Lib Project]. Die Deutsche Bibliothek will host the next full Dublin Core workshop in October 1999. A German translation of Dublin Core has been posted at the Max Planck Institute for Human Development and Education [DC-German]. In practice, however, many projects have created their own local versions, which are partly in English, partly in German, and partly mixed. These local variants of Dublin Core are being tracked in a comparative database at the Lower Saxony State and University Library [MetaForm].

Work has started on a DC-Burmese. Its first users will be the academics, research librarians, and independent scholars of the Burma Archives Project, which aims at encouraging the collection, preservation, and indexing of materials related to contemporary Burma. Of particular interest are materials related to the Burmese democratic movement of the 1980s and 1990s, from posters, photographs, pamphlets, diaries, correspondence, and memoirs to records from political parties, labor unions, student organizations, and ethnic associations. Having a Dublin Core in Burmese will help the project train Burmese in the techniques of preservation while making these materials available for wider usage.

Shortly before publication deadline, we received news that the Software Research and Development Center of the Middle East Technical University in Ankara is planning to prepare a DC-Turkish. A team of researchers at the School of Information, Library, and Archive Studies at the University of New South Wales plans to work with documentalists from Cambodia on a DC-Khmer.

Expressions, versions, or translations?

It seems natural to talk about DC-Arabic as the Arabic-language version of Dublin Core, and this is what many of us have been doing. But this term has become the object of debate since the Dublin Core workshop of 2-4 November at the Library of Congress. At that workshop, it became clear that part of the Dublin Core community wants to focus on clarifying the semantics of a simple element set that has been relatively stable since December 1996 and already forms the basis of numerous implementation projects. Others, however, wish to redesign parts of that element set in order to address weaknesses that have emerged from implementation experience and to make the elements more useful for potential new users in the publishing and entertainment industries. As a result, the existing element set has been dubbed Dublin Core Version 1.0; a refined version of the same will be called Version 1.1; and work is beginning on a more fundamentally revised Dublin Core Version 2.0. If the current DC-Japanese is now the Japanese-language version of DC-English Version 1.0, does this make it a version of a version?

One obvious alternative is translation. The current DC-Japanese is in fact a translation of DC-English Version 1.0. Indeed, all of the Dublin Cores mentioned above are translations of the English reference description at http://purl.org/DC/. None of them change the scope of definitions or add new "Dublin Core" elements beyond the canonical fifteen. Everyone involved in this process to date has recognized that the Dublin Core in English is the canonical result of an international process. Everyone also recognizes the need to have one version from which all others derive; a DC-Ukrainian should be prepared directly from DC-English and not from DC-Russian. Moreover, everyone recognizes that translations which exactly render an original wording may not be as effective as translations which use some poetic license to convey the intended meanings to a particular audience. Nevertheless, the term has drawbacks. As one participant points out, "translation transmits the idea that DC is an English thing, which can be negative for its acceptance and probably even for its understanding". In an age of automatic translators, available even as pocket gadgets, the word translation evokes a process that is largely mechanical and one-way. What national library, some ask, will want to define its role primarily as custodian of a translation?

Yet none of the alternatives are clearly better. It is natural to talk about Dublin Core expressed in Estonian, though it seems awkward to speak of Dublin Core in its Estonian expression. Nor does it seem entirely satisfactory to talk about the Malay manifestation, Finnish format, Armenian edition, Arabic adaptation, Danish description, Romanian release, Vietnamese variant, or Russian rendering. Many of these terms have points of theory in their favor, or even precedents in library science. But unfortunately they are all a bit unusual and would require some explanation. Further complicating matters is the question of what to call manifestations of Dublin Cores in multiple software formats such as XML, GIF, PostScript, LaTeX, MSWord, and HTML. Final agreement on standard names for these phenomena -- in English, of course -- will bring us full circle to the problem of how to translate these terms and explain their distinctions in other languages.

Many discussants do agree that version becomes appropriate when the line is crossed between translation and adaptation. For example, if a DC-French were to modify Dublin Core with unique qualifiers or definitions for specific local uses, then one might no longer have just a French translation of Dublin Core Version 1.0, but a new French version that is "based on" Dublin Core Version 1.0 for the purposes of software interoperability. This distinction may apply to Site Preview Format (SPF), an adaptation of DC-English for use on Netscape's Netcenter [Miller 1998]. SPF uses the indexing tokens shared by all Dublin Cores but substitutes four of the official labels -- Description, Relation, Subject, and Publisher -- with four nearly equivalent labels of its own -- Content Summary, Related Items, Category, and Provider. Such changes may have little practical effect on the resulting resource descriptions, but SPF is clearly not simply a translation of Dublin Core.

In Germany, Ralf Schimmer of the MetaForm Project has noticed that similar adaptations of Dublin Core (he calls them manifestations) have been created locally for specific projects. These typically deviate somewhat from the standard: "One definitely sees that they belong to the same language, but a closer examination reveals certain unique and idiosyncratic characteristics of the kind one sees in dialects. Despite the common language they invoke and from which (in this case) they claim legitimacy, subtle semantic shifts creep in. In the end, these could make understanding between Dublin Core applications as difficult as between different dialect regions. The Dublin Core label and brand name do not guarantee uniformity of contents and semantics."

Such differences of dialect or usage are an inevitable feature of natural languages. Even within one language, differences can be great between the jargons of scientific and cultural fields. There were moments during the first Dublin Core workshops when people realized they had debated at length on the basis of widely different understandings of fundamental terms such as type, which evokes different things to librarians and computer programmers. Even experienced bibliographers can find themselves at a loss about how to apply traditional concepts such as author and publisher to the description of new electronic genres, such as Web pages.

Indeed, Dublin Core itself could prove to be a new type of genre within the electronic environment. John Kunze suggests that Dublin Core could cease to be "versionable" in the sense of controlled software releases (e.g., 1.0, 1.1, 2.0) if it were to evolve into a larger, dynamically evolving set of elements in common use. Such a system would more closely resemble controlled vocabularies and natural languages, which do not in practice need version numbers to fulfill their functions. The evolution of such a metadata language would need to be marked in only a rough sense (e.g., by year) for the benefit of specialists, which could free the term version for variants of the element set in multiple languages.

Dublin Core as a planned language

The Dublin Core Metadata Initiative began as a hall conversation at the Second International World Wide Web Conference in October 1994 -- the year when the Internet first made the cover of popular magazines. The number of Web pages was doubling every few weeks, and it was becoming harder to find anything, even with new full-text indexing services such as Lycos. Many people agreed that the Web needed a catalog, that there would never be enough librarians to handle it all, and that existing library standards were generally too complex for ordinary people to learn. To promote semantic interoperability between specialized communities on the Web, then, the first Dublin Core workshop drew up a short list of metadata elements that would efficiently yield "simple descriptions of data in a wide range of subject areas" [Weibel 1995].

It is tempting to compare those early 1990s with the 1870s, when the expansion of international trade and colonial empires and the explosion of literature in print posed similar problems of information management and semantic interoperability. It was a decade that saw the creation of library associations, international librarianship, and Dewey Decimal Classification. It also saw the first of several dozen proposals for artificial or planned languages.

At the time, the diversity of national languages seemed an obstacle to progress, but returning to Latin was not an option, and agreement on English or French seemed politically unlikely. Many scholars and practitioners of that time believed that the interests of international understanding and of science would be served by devising a universal auxiliary language that was easy enough for everyone to learn. The first of these to achieve much success was was Volapük or "World Speak" (1879), a morphologically complex synthesis of German, English, and Latin. Its decline coincided with the rise of Esperanto (1887), a simpler language with a Slavic flavor. Most such languages were syntheses of existing natural languages, using Western European word roots and simplified grammars.

Typically, these planned languages were created by a single author, working in isolation, then adopted by a small circle of followers. As in the Dublin Core movement, however, early users of planned languages disagreed about whether to accommodate new words or constructions. Debates often reflected the conflicting requirements of everyday needs versus the demands of specialists. The Volapük movement split over a conflict between its inventor, who wanted it to be as subtle and expressive as a natural language, and followers who wanted to improve its chances of adoption by making it simpler. The Esperanto movement likewise argued over issues such as its use of the circumflex, and factions broke off to promote alternative versions, such as Ido (1907) and Novial (1928).

The Dublin Core movement has experienced an analogous tension between Minimalists, who value the intuitive simplicity of its fifteen broadly-defined elements, and Structuralists, who see the elements as a complexifiable set of primitives for richer and more specialized resource description. It is useful, therefore, to consider how a planned language of the sort proposed in the late nineteenth century might actually have succeeded. Theorists speculate that it would need a critical mass of speakers, official endorsement by governments, and widespread use in mass media. Ideally, it would have an international board to maintain standards, review proposals, and control the language's continued evolution. This institutional control from above would need to be loose enough to allow speakers to coin new terms and constructions for expressing their everyday experiences [Eco 1995]. Other scholars point out that planned languages are by their nature closed in design, rather strictly governed by rules, linguistically unnatural, and ill-suited to change. Natural languages, in contrast, are versatile and open-ended. To achieve success, they suggest, language designers need to provide for peoples' propensity to change or create rules, adapt systems, and negotiate meanings. And to understand such processes, the designers would do well to examine how communities of users interact spontaneously to create pidgins [Laycock and Mühlhäusler 1994].

Pidgin metadata and its creolization

Pidgins are makeshift, hybrid languages that arise when speakers of different languages in regular but superficial contact must work together or conduct trade. They usually have small vocabularies (borrowed largely from a socially dominant group), little inflection, and loose word order. Emphasis is achieved with reduplication and gestures. In the absence of grammatical precision, speakers must sometimes resort to elaborate circumlocutions, and usage is inconsistent between speakers. Interpretation can depend heavily on context; the sentence Sista fo hospitu bin luk wom (from Cameroonian pidgin) can mean either "The nurse at the hospital saw worms" or "The nurse saw worms at the hospital" [Schneider 1966]. Historically, pidgins have arisen among hired or slave workers on ethnically mixed plantations, though "pidginization" constantly occurs today in vacation resorts, port cities, and immigrant communities.

As a pidgin becomes more valuable to its users, for example to conduct business, it stabilizes, its vocabulary expands, and it becomes flexible enough to be used as a speaker's primary language. Researchers have found that when children are raised using a pidgin as their mother tongue during the critical period before adolescence, they use their instinctive language skills to improvise grammatical subtleties, transforming their parents' crude pidgins into grammatically richer, more expressive creoles. Creoles are bona fide languages, with subtle grammatical markers and consistent word orders. Creoles acquire prepositions, extensive vocabularies, and a syntax less dependent on context [Pinker 1994].

By analogy, the Dublin Core Element Set arose when natives of different resource description communities -- from librarians, archivists, scientists, humanities scholars, and geographers to the makers of Internet standards -- suddenly needed to interoperate on the Web and so created a process to negotiate a metadata hybrid. The metaphor of the "virtual tourist" who uses Dublin Core as a phrase book for browsing collections in unfamiliar fields over the Web is especially fitting because real-life tourists are naturally inclined to pidginize (e.g., "Nix spreken Deutsch. Zwei beer, okay?"). This melding of diverse resource description conventions yielded Dublin Core in its unqualified, minimalist form.

If the elements of unqualified Dublin Core constitute the metadata pidgin's small vocabulary, then its simple grammar and syntax are provided by the META tags of HTML 4.0, which typically are embedded in the headers of Web pages. Consider a Web page at the (hypothetical) address http://www.w3.org/Home/Lassila [Lassila and Swick 1998]:

<HTML>
<HEAD>
<TITLE>Ora Lassila's Home Page</TITLE>
<META NAME="DC.Creator" CONTENT="Lassila, Ora">
</HEAD>
<BODY>
This would be the body of Ora Lassila's Home Page.</P>
</BODY>
</HTML>
The Creator field here refers implicitly to its context -- the document at http://www.w3.org/Home/Lassila in which it is embedded. One might then translate this metadata statement to read: This page was created by Ora Lassila. Or, by expanding the implied context, as The resource http://www.w3.org/Home/Lassila has creator Ora Lassila.

A creolized Dublin Core, in contrast, would draw a richer vocabulary from controlled lists of semantically refined sub-elements and qualifiers. Its grammar would be provided by the Resource Description Framework (RDF), a new and more sophisticated conceptual model and encoding syntax for expressing metadata on the Web. RDF allows resource descriptions to reference multiple schemas and externally maintained vocabularies, further broadening its expressive scope. Its grammatical repertoire of subjects, predicates, and objects, along with capabilities such as reification (i.e., referencing statements as wholes) significantly expand its expressive capabilities. Consider the following RDF statement:

<rdf:RDF
    xmlns:rdf="http://www.w3.org/TR/WD-rdf-syntax#"
    xmlns:a="http://description.org/schema/">
    <rdf:Description>
      <rdf:subject resource="http://www.w3.org/Home/Lassila" />
      <rdf:predicate resource="http://description.org/schema#Creator" />
      <rdf:object>Ora Lassila</rdf:object>
      <rdf:type resource="http://www.w3.org/TR/WD-rdf-syntax#Statement" />
      <a:attributedTo>Ralph Swick</a:attributedTo>
    </rdf:Description>
</rdf:RDF>

Note that this statement explicitly gives a subject (http://www.w3.org/Home/Lassila), a predicate ("has creator"), and an object (Ora Lassila). It then defines this subject-predicate-object unit as a statement, the whole of which is attributed to Ralph Swick. In other words, it says: Ralph Swick says that Ora Lassila is the creator of the resource http://www.w3.org/Home/Lassila. It also uses XML namespace declarations (the lines that begin with xmlns) to cite the standards documents for its own RDF format (http://www.w3.org/TR/WD-rdf-syntax) and for its hypothetical metadata schema (http://description.org/schema/).

RDF syntax can support a wide range of metadata applications beyond resource description, such as push channel definitions, ratings, site maps, and digital signatures, so it is seen by many content and service providers -- from Netscape and CNN to New York Times and Amazon.com -- as a crucial enabling technology for electronic commerce [Netscape 1997]. Dublin Core can provide key descriptive elements for many such applications, such as news feeds from wire services. In particular, the entertainment industry is interested in using Dublin Core elements and RDF grammar to construct statements that express business deals about the use of entertainment content and trade publications. For example, RDF can be used to express the following (hypothetical) agreement:

ASCAP declares that Joan Quincy composed the song "Rising Tide" and registered it for copyright in the USA in 1995, and that ASCAP has exclusively acquired the right to administer on behalf of Joan Quincy and her publisher Diva Songs, Inc the right for worldwide public performance from the time of composition until further notice.

Suppose the Harry Fox Agency were to act as the worldwide agent for Diva Songs for the right to record the song; Joan Quincy were to make a recording of her own song; and Nakamura Records were to acquire a license to use this song in a compact disk. One might then formulate a statement like the following:

Nakamura Records says that it has acquired from JASRAC [the Japanese copyright society] acting as agent for the Harry Fox Agency acting as agent for Diva Songs Inc, the mechanical reproduction right in respect of the song "Rising Tide" by Joan Quincy for inclusion in the CD "Californian Sunsets" for distribution only in Japan.

Statements could become even more elaborate if they were to factor in complications such as the rights of individual band members or ownership rights for any pre-existing samples used in a recording. Changes in ownership or representation anywhere in a chain of such statements could affect thousands of related agreements and licenses. Such complex dependencies have hitherto been tracked by hand, so the entertainment and publishing industries are highly motivated to perfect a language for expressing such metadata statements in ways that are automatically parsable, legally binding, and that ensure efficient flows of royalties and payments.

Interlinguas, switching languages, registries, and dictionaries

Some members of the Dublin Core community see the value of its fifteen broad elements less as a native pidgin or creole for direct use in creating catalog records than as a standardized switching language between richer, domain-specific descriptive languages such as MARC (for library catalogs), GILS (for government information), and Z39.50 (for distributed databases). Nobody seems to know the origin of this term, but everyone understands that it has to do with establishing a common reference point of widely understood semantics to which a variety of other schemas can map many-to-one, thus achieving semantic interoperability over multiple systems at little cost and effort [Koch, Hakala, and Husby 1996]. Nobody questions that in practice, the semantics of this focal point must be expressed in English.

Such a construct has analogs in the area of natural language processing. The EuroWordNet project aims at supporting cross-language retrieval by linking comprehensive ontologies of Spanish, English, Dutch, and Italian words to a central InterLanguage or interlingua that holds a superset of all concepts found in the component languages [EuroWordNet]. Similarly, a ten-year project coordinated by the United Nations University in Tokyo aims at mapping the more than one hundred languages of the United Nations to a common Universal Networking Language -- an intermediate encoding of grammar and semantics from which, in theory, it will be possible to generate equivalent sentences in any of the other languages [Universal Networking Language]. Like the switching language, both of these constructs involve central reference points that in theory are language-neutral but in practice express their universal semantics in English.

The notions of switching language and interlingua are compatible with the need for metadata registries to express schemas in both human- and machine-readable formats and offer authoritative usage guidelines, local extensions, crosswalks to related schemas, and lists of legal values. There is a convergence of interest in such registries not just from the digital library community, for resource discovery, but from other service providers in government, business, and education and for a wide range of applications. An ecology of registries could emerge that reflects a diversity of organizational motives, market forces, and user communities [Baker and Lynch 1998].

Metadata registries could use constructs such as interlinguas to link diverse ontologies of elements among themselves. In addition to prescribing certified elements and promoting standards of good practice, registries could also function as metadata dictionaries by monitoring patterns of usage, whether "correct" or not, and describing alternative definitions in actual use [Kunze 1996]. Mediating between these functions could be something like the Usage Panel of the American Heritage Dictionary, whose 173 writers, critics, and scholars help the dictionary editors find a balance between descriptions of actual usage and prescriptions of preferred forms, while evaluating potential entries against "the fundamental linguistic virtues -- order, clarity, and conciseness" [Nunberg 1992].

Focal points such as switching constructs, interlinguas, registries, and dictionaries can provide metadata languages with the public forum that any language needs in order to grow and evolve. But interoperability over time can be ensured only if such constructs are supported by social processes that allow user communities to negotiate global meanings while adapting them to local needs. Ideally, such a process would allow a user of DC-Korean to propose a local extension, defined in English, for addition to the global standard. The process would then see this proposal through a formal review, vote of approval, and incorporation into the canon.

The Working Group on Dublin Core in Multiple Languages

The Working Group on Dublin Core in Multiple Languages grew out of a break-out group at the Canberra workshop of March 1997 at which we agreed that versions of the Dublin Core in multiple languages could interoperate by sharing a set of globally valid, machine-readable tokens [Baker 1997a]. The mission of the working group today is to coordinate the development of Dublin Core as a multilingual metadata standard by addressing the related issues of policy and application [Dublin Core in Multiple Languages].

For starters, we need to develop a common vocabulary and shared understanding of the status of these various Dublin Cores. The discussion of translation versus version summarized above is not entirely closed. Until now, we have been calling them versions, but since the current DC-English has been declared to be Version 1.0 and work has begun both on Version 1.1 and on Version 2.0, this term will no longer serve. Clearly, we will need to approach these issues in a modular fashion. Translations of the two-page document that define the fifteen elements of Version 1.0 are unambiguously translations. Henceforth, local Dublin Cores should specify the version of DC-English on which they are based (e.g., DC-Finnish Version 1.0 would be a translation of DC-English Version 1.0).

In practice, however, many local schemas contain additional elements outside the scope of Dublin Core itself. For example Netscape's Site Preview Format includes Title of Channel, Text of Content, and Image. Rather than shoehorning such extended semantics into the "catch-all" element of Dublin Core, Description -- as some implementors have done -- Netscape follows the good practice of deriving its SPF schema from multiple namespaces. Dublin Core elements within the schema point to the Dublin Core namespace, while the local extensions point to a local namespace for SPF.

The principle of modularity, however, poses challenges for multilinguality. The Dublin Core Data Model Working Group has quite sensibly proposed that the Dublin Core community avoid reinventing wheels. Where reasonable, it should "beg and borrow" sets of terms, controlled vocabularies, and metadata schemas from communities that already maintain them. For example, an organization called the Internet Mail Consortium promotes "vCard", a core set of metadata elements covering email and Web addresses, identification photos, telephone numbers, and company logos -- a "Dublin Core", as it were, for business cards [vCard]. Rather than repropose identical sub-elements for Dublin Core itself, Dublin Core should officially point to the external namespace where this schema is held. How one in practice would maintain local versions of these external vocabularies and schemas for use in non-English-speaking environments is a question we have done no more than pose.

From the standpoint of implementors in other languages, of course, the translation of a two-page schema is just a beginning. As Shigeo Sugimoto has pointed out, it is significantly more expensive for non-English-speaking countries to follow the evolution of the standard because supporting materials must also be produced in the local language. User guides must be translated -- or written from scratch. Crosswalks must be maintained to legacy schemas such as the many variants of the MARC format for libraries. Metadata creation tools must be localized with links to local classification schemes and subject headings.

So far, we have freely invited metadata-using institutions in many countries to translate Dublin Core into their own languages. More than once, a translation begun by a single person has been adopted by a national institution, so we see no need to discourage initiatives by individuals as well. All such efforts spread awareness of the standard, thereby increasing the number of potential users and hence the value of Dublin Core for cross-language interoperability. Regional linguistic differences or institutional rivalries could eventually result in the creation of more than one Dublin Core for any given language, but we see no need to discourage this either; the creators and users of metadata in each country or language area will eventually vote with their templates.

In the medium term, however, we may need a process for evaluating these versions, with peer review for translation quality and verification of an institution's commitment to maintaining the standard on an ongoing basis. Dublin Cores that meet certain criteria could perhaps obtain official endorsement or certification. Peer review could also help maintain the modular distinction between pure Dublin Core and local extensions or adaptations. Through the mechanism of a distributed registry, certified Dublin Core elements in various languages could be offered for automatic loading into browser templates or metadata editors.

As a first step in this direction, we have implemented at the Asian Institute of Technology a simple registry of Dublin Core in about twelve languages [Xu 1998]. In this design, a central registry (to be maintained at http://purl.org/DC/ after January 1999) holds a list of available translations, which are held in RDF format on local servers in places like Bangkok, Tsukuba, and Paris. Conceptually and practically, these local schemas share the namespace http://purl.org/dc/elements/1.0/ along with its shared indexing tokens. Users who request to see Dublin Core in Thai can choose to view the schema directly in HTML, if their browser supports the proper font; to receive the schema as a bitmap image; or, if their browser supports Java, to view the schema using the "Multilingual HTML" system developed at the University of Library and Information Science in Tsukuba, Japan, which wraps files in Thai, Korean, and other languages with the glyphs and applets necessary for properly displaying their fonts [Multilingual HTML].

Starting simple

One engineering maxim holds that "Large successful systems start as small successful systems" [Kelly and Reiss 1998]. As Freeman Dyson has pointed out, computers only really took off when they built them small and fast, shortening the iteration of design. In part this is because the underlying problem is not one of theory, but of practice. There was a theory of flight, but it did not help the Wright brothers build an airplane. Nor were bicycles designed by theory, but by trial and error; indeed, theory still cannot really explain why they work as they do. Projects that start big are doomed to fail, he believes, because you never have time to fix all the bugs [Brand 1998].

If there is any truth to this, then Dublin Core offers a useful starting point for building a metadata infrastructure that covers multiple languages. That starting point is a two-page set of definitions that can quickly and easily be translated into any modern language. Since these translations are, for all practical purposes, semantically identical, linking them into a distributed registry is straightforward. We can proceed with the complexification of this model -- adding sub-elements and qualifiers as they are approved by the Dublin Core community, or making links to semantically related schemas such as GILS or MARC -- incrementally and as the needs arise.

To encourage participation worldwide, we must focus on promoting and adapting open standards such as RDF for expressing schemas, formatting records, and eventually for managing more complex semantic relations within a diverse ecology of metadata registries and ontologies. The open source software movement, which has achieved such visible success with products such as the Linux operating system and the Apache Web server, provides an inspiring model of implementor communities dedicated to the free propagation of tools. Each effort to localize Dublin Core for a particular community or language stands to benefit from the free availability of metadata templates and editors, Java utilities, user guides, and crosswalks to common element sets.

In its simpler form (versions 1.0 and 1.1), Dublin Core could stabilize as a metadata pidgin. A natural-language precedent for this is Tok Pisin, which has stabilized in an extended form, short of full creolization, as the lingua franca of 1.5 million people and the language of government in Papua New Guinea. At the same time, an emerging infrastructure of metadata registries, more sophisticated metadata grammars and data models, and a broadening consensus on the use of controlled vocabularies, qualifiers, and multiple namespaces could lead to the development of a more complex metadata language based on Dublin Core. This process of creolization is likely to be driven in part by the needs of new user communities, such as news agencies, Internet portals, online retailers, and the entertainment industry.

The design of Web-based processes to support these developments could address some of the reasons why universalist language projects have failed in the past. In contrast to Esperanto, Dublin Core stands to acquire a critical mass of users. It already has the official endorsement of some governments. Its use on the Web will give it wide exposure in a mass media. And the formalization of a maintenance agency for Dublin Core will provide a forum for controlling its evolution. To remain relevant and useful, Dublin Core will need to evolve and grow like any other language. This will require processes that strike a balance between institutional control from above and natural change in usage from below. The challenge will be to design these processes so that speakers of all languages can participate in the definition of global semantics. The reward for this effort will be better access to resources across disciplines and languages worldwide.

Acknowledgements

Many thanks to the editors of D-Lib Magazine and to John Kunze (University of California, San Francisco) for their comments on earlier drafts; to Michael Kasper (Amherst College) for help with the literature on pidgins; and to Godfrey Rust (INDECS Project) for advice on rights-management statements. The "virtual tourist" metaphor was coined by Ricky Erway of the Research Library Group.

Formal publications cited

[Baker 1997a] Thomas Baker, "Metadata semantics shared across languages: Dublin Cores in languages other than English", [break-out group report from the Fourth Dublin Core workshop in Canberra], March 1997, http://purl.org/dc/groups/languages/mr19970303.htm.

[Baker 1997b] Thomas Baker, "Dublin Core in multiple languages: Esperanto, interlingua, or pidgin?" Proceedings of the International Symposium on Research, Development and Practice in Digital Libraries 1997, Tsukuba (Japan): University of Library and Information Science, http://www.DL.ulis.ac.jp/ISDL97/proceedings/thomas/thomas.html.

[Baker and Lynch 1998] Thomas Baker and Clifford A. Lynch, "Summary review of the Working Group on Metadata", in: Peter Schäuble and Alan F. Smeaton, A research agenda for digital libraries: summary report of the series of joint NSF-EU Working Groups on Future Directions for Digital Libraries Research, [DELOS Working Group Report (ERCIM-98-W004)], Paris: European Research Consortium for Informatics and Mathematics.

[Baker and Weibel 1998] Thomas Baker and Stuart Weibel, "Dublin Core in Thai and Japanese: Managing universal metadata semantics", Digital Libraries (ISSN 1340-7287), March 1998, Tsukuba (Japan): University of Library and Information Science.

[Brand 1998] Stewart Brand, "Freeman Dyson's brain", Wired 6.02: 130-177.

[Eco 1995] Umberto Eco, The search for the perfect language, Oxford: Blackwell, 1995.

[Kelly and Reiss 1998] Kevin Kelley and Spencer Reiss, "One huge computer", Wired 6.08 (1998): 128-170.

[Koch, Hakala, and Husby 1996] Traugott Koch, Juha Hakala, Ole Husby, "Report from the Metadata Workshop II, Warwick, UK, April 1-3, 1996", NORDINFO-Nytt 1996:2, pp.40-48, http://www.ub2.lu.se/tk/warwick.html.

[Kunze 1996] John Kunze, "A unified element vocabulary for metadata", http://www.ckm.ucsf.edu/personnel/jak/dist.html.

[Lassila and Swick 1998] Ora Lassila and Ralph R. Swick, eds. "Resource Description Framework (RDF) model and syntax specification", [version used for this paper: 8 October 1998], Paris and Cambridge: International World Wide Web Consortium, http://www.w3.org/TR/WD-rdf-syntax.

[Laycock and Mühlhäusler 1994] Donald C. Laycock and Peter Mühlhäusler, "Language engineering: special languages", An encyclopaedia of language, London: Routledge.

[Miller 1998] Eric Miller, "Netscape's SPF Site Preview Format", [PowerPoint presentation], Dublin (Ohio): OCLC Office of Research, http://www.oclc.org/oclc/research/projects/core/workshops/dc6conference/pp/ws-dc6-implementor-miller-netscape.ppt.

[Netscape 1997] Netscape Communications Corporation, "Netscape works with W3C and leading content providers to drive new specification for organizing, describing and navigating information on Internet, intranets, and desktops", [press release], http://www.netscape.com/newsref/pr/newsrelease488.html.

[Nunberg 1992] Geoffrey Nunberg, "Usage in The American Heritage Dictionary: the place of criticism", The American Heritage Dictionary of the English Language, Third Edition, Boston: Houghton Mifflin Company.

[Pinker 1994] Steven Pinker, The language instinct, New York: Harper Collins.

[Prabawa 1998] Bagus Tri Prabawa, Local and global interoperability of metadata: the Dublin Core in Indonesian, [masters thesis], Bangkok: Asian Institute of Technology.

[Schneider 1966] Gilbert D. Schneider, West African Pidgin-English, [possibly a PhD thesis], Athens (Ohio): The Hartford Seminary Foundation [possibly self-published].

[Weibel 1995] Stuart Weibel, "Metadata: the foundations of resource description", D-Lib Magazine, March 1995, http://www.dlib.org/dlib/July95/07weibel.html.

[Wu 1998] Wu Xiaoyun, Interactive query formulation for multilingual information retrieval, [masters thesis], Bangkok: Asian Institute of Technology.

[Xu 1998] Xu Bo, A distributed registry of Dublin Core metadata in multiple languages, [masters thesis], Bangkok: Asian Institute of Technology.

Evolving Web pages cited

[Dublin Core Metadata Initiative] http://purl.org/DC/

[Dublin Core in Multiple Languages] http://purl.org/DC/groups/languages.htm

[DC-Arabic] http://www.forwiss.tu-muenchen.de/~haddouti/DC_arabic.html

[DC-Chinese (Big5)] http://dimes.lins.fju.edu.tw/dublin/

[DC-Dutch] http://www.konbib.nl/coop/donor/rapporten/DCsimpleformat.html

[DC-Finnish] http://linnea.helsinki.fi/meta/dcref-fin.html

[DC-French] http://www-rocq.inria.fr/~vercoust/DOCS/DC-french.html

[DC-German] http://www.mpib-berlin.mpg.de/DOK/metatagd.htm

[DC-Greek] http://www.ics.forth.gr/~sarantos/dublincore.html

[DC-Japanese] http://www.DL.ulis.ac.jp/DC/

[DC-Korean] http://www.kric.ac.kr/~dc_kor/dc-korean.html

[DC-Norwegian] http://www.bibsys.no/meta/dc/dcref.html

[DC-Portuguese] http://bruxelas.inesc.pt/~jlb/publica/metadata/elementos_dublin_core.htm

[DC-Spanish] http://www.rediris.es/metadata/dublin_core_elements.es.html

[DC-Swedish] http://www.sics.se/~preben/DC/dcref-swe.html

[DC-Thai] http://server.tiac.or.th/dublin.pdf

[EuroWordNet] http://www.let.uva.nl/~ewn/

[Information and Communication Commission of the Learned Societies] http://www.mathematik.uni-osnabrueck.de/ak-technik/

[Metadata Projects at German Libraries] http://www.dbi-berlin.de/projekte/einzproj/meta/meta03.htm

[German Metadata Registry] http://www.mpib-berlin.mpg.de/dok/metadata/gmr/gmr1e.htm

[MetaForm] http://www2.sub.uni-goettingen.de/metaform

[Meta-Lib Project] http://www.dbi-berlin.de/projekte/einzproj/meta/meta00.htm

[Multilingual HTML] http://mhtml.ulis.ac.jp/

[Nordic Metadata Projects] http://linnea.helsinki.fi/meta/.

[RFC 2413] ftp://ftp.isi.edu/in-notes/rfc2413.txt

[Universal Networking Language] http://www.ias.unu.edu/research_prog/science_technology/universalnetwork_language.html

[vCard] http://www.imc.org/pdi/

Copyright © 1998 Thomas Baker

Top | Magazine
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | E-mail the Editor

D-Lib Magazine Access Terms and Conditions

hdl:cnri.dlib/december98-baker