Over the past several years, we have been engaged in a number of efforts examining the role, format, composition, and architecture of metadata for networked resources. During this time, we have noticed the tendency to be led astray by comfortable, but somewhat inappropriate, models in the non-digital information environment. Rather than pursuing familiar models, there is the need for a new model that fully exploits the unique combination of computation and connectivity that characterizes the digital library.
In this paper, we describe an extension of the Warwick Framework [WF1] that we call Distributed Active Relationships (DARs). DARs provide a powerful model for representing data and metadata in digital library objects. They explicitly express the relationships between networked resources, and even allow those relationships to be dynamically downloadable and executable. The DAR model is based on the following principles, which our examination of the "data about data" definition has led us to regard as axiomatic:
Coordinating metadata development across all those domains is impossible. Therefore the creation, administration, and enhancement of individual metadata forms should be left to the relevant communities of expertise. Ideally this would occur within an framework that will support interoperability across data and domains. The Warwick Framework (WF) provides just such a modular approach to metadata.
The Warwick Framework originated from an attempt, at the Second Invitational Metadata Workshop [WW], to define an extension mechanism for the Dublin Core Metadata Element Set [DC] in order to prevent unrestricted growth in its complexity. Named after the site of the workshop in Warwick, the WF tackles the extension problem by aggregating typed metadata packages into containers. The WF defines three types of package:
Figure 1 illustrates a simple example of a Warwick Framework Container. The container contains three logical packages of metadata. The first two, a Dublin Core record and a MARC record, are physically in the container. The third metadata package, which defines the terms and conditions for access to a content object, is referenced in the container indirectly via a URI.
The framework is a simple concept, but it has important implications for interoperation, and as the basis for long-lived metadata systems. By factoring complex descriptions into simpler components, interoperation can be addressed at a component level, rather than at an "all or nothing" monolithic level. The framework also allows for lowest- common-denominator descriptions, such as the Dublin Core, to exist beside complex descriptions from specialized communities, such as MARC. Thus, members of the same community can exchange their rich descriptions in preference to more general ones. System evolution is facilitated since, as new purposes for datasets emerge, new metadata schemas and formats can be developed. Instances of those can be added as new packages to the container(s) associated with the dataset. New handlers can be added to utilize the new package, and this can occur without significant disruption to the metadata system architecture as a whole.
To meet this need, we defined a new abstraction called the Warwick Framework Catalog (WFC). A WFC is a list of assertions about individual packages and the relationships between packages. Example relations are one package acting as a digital signature, bibliographic description, or access control specification for another package.
(bibliographic-description package-1 package-2) (terms-for-accessing package-1 package-3) (derived-via-transformation package-1 package-6 package-5) (digital-signature package-1 package-4) (digital-signature package-6 package-7)
Listing 1 illustrates an example Warwick Framework Catalog. It shows package-2 is a bibliographic description of package- 1, while package-3 provides the terms for gaining access to package-1. Relations need not be binary, we might state that package-5 is derived from package-1 by a transformation that is specified in package-6. The same relation might hold between different sets of resources, as shown by the digital-signature relation in the last two lines. Figure 2 shows a simple Warwick Framework Container with a relationship package.
The WFC could be provided as the first package in a container, and would provide enough information to the receiver to allow proper treatment of the remaining packages. Although the example above uses an s-expression syntax, the WFC is essentially another conceptual model that can be expressed in a number of ways. The key contribution of the WFC is that it leads to some far-reaching generalizations to the Warwick Framework. Those generalizations are described in the next two sections.
A better approach is to consider the information architecture as a collection of inter-related resources. While these resources may have a type, such as PostScript, HTML, or a Java program, this type is orthogonal to whether the resource is acting as data vs. metadata in some context. That contextual information is specified by the relationships between the resources. We can model these inter-related resources using directed graphs, where nodes represent the resources and the labeled arrows between nodes represent the relationships. Since a resource may be related to many other resources, nodes may have many arcs originating from or terminating at them. Looking at the direction of an arrow, it is easy to see whether a resource is playing the role of data or metadata in the context of that particular relationship. We can easily accommodate such a model by generalizing the Warwick Framework so that it may contain any resources, not just those considered "metadata". Thus, we can use the Warwick Framework Catalog to specify the relationships between various resources, both inside and outside the container.
As a simple example (we will use more complex examples
later), assume
that the relationship arcs are uni-directional and that the
only relationship
they specify is "has-metadata". Figure 3 shows a set of
resource nodes
and relationship arcs that correspond to the Siskel and
Ebert movie review
mentioned earlier. For the moment, ignore the three
overlapping ovals in
the figure. As illustrated, certain resource nodes have both
outgoing and
incoming arcs; thus they are "data" in one context and
"metadata" in another.
For example, the Siskel and Ebert review is metadata for the
movie "Men
in Black", but the review has metadata of its own (it is acting
as "data"
relative to a Dublin Core record and a Terms and Conditions
specification).
We can take a different perspective on Figure 3 and formulate three digital library resources, which can be found through resource discovery and accessed using unique identifiers (such as URLs and URNs). Each of these resources aggregates data and related metadata. These aggregations, shown by the overlapping ovals, are:
In generalizing the Warwick Framework as a digital object container, we emphasize two features and then introduce a significant extension.
First, recall that the Warwick Framework places no locality restriction on the packages that it "contains". A package may either be physically in a container or indirectly referenced via a URI (thus, it might be located anywhere in the global information space). This is demonstrated in Listing 2, in which the relationships in a Warwick Framework Catalog refer to resources using URIs as well as internal package references. Figure 5 illustrates a digital object container that references, through the relationship catalog, a component of an external digital object. One interesting manifestation of this is that a container, or digital object, may actually have no physically contained data sets, but may act merely as a logical container with only relationships that reference remote data sets.
Second, the example in Figure 3 illustrates only one simple type of uni-directional relation, the "has-metadata" relation. However, as we have emphasized throughout our work on the Warwick Framework, the notion that something "is metadata" does little to convey its actual meaning and, therefore, such a simple relationship should be avoided. The Warwick Framework Catalog can include a variety of relationships with much richer semantics, such as "terms-for-accessing", "bibliographic-description", and the like.
(bibliographic-description package-1 URI-1) (terms-for-accessing package-1 URI-2)
Up to this point, we have assumed that the relationships
in the Warwick
Framework catalog are identified with simple names, which
might be listed
in some registry. A more general solution is to let the
relationship names
be URIs. This provides a scoping mechanism to preclude
name clashes. More
interestingly, it opens up the possibility of making the
relations into
resolvable first-class resources in their own right. These
"relation resources"
might have their "metadata" including access controls and
descriptions.
In this scenario, the simple relationship arcs illustrated in
Figure 3
become nodes in their own right, with possible relationships
to other data
nodes. In the next section, we extend this notion even
further by describing
executable relationships that enable dynamic and
interpretable data and
metadata.
The best way to describe the motivation and use of DARS is to apply them to a well-known problem, rights management. Managing intellectual property rights for digital library objects is complex, and we refer the reader to [GLAD] for a more thorough treatment of the subject. At one end of the spectrum, rights management metadata may be a simple textual description, say that used in "shrink-wrap" licenses. At the other end, there are complex access control schemes that may involve interaction and negotiation with authentication services, billing services, agents, etc. Any reasonable architecture for networked information management must accommodate the full set of rights management possibilities.
One approach to this problem is executable rights management metadata. The metadata returned to a client could be an executable object, or a handle to an executable object using distributed object technology such as CORBA. Using this executable metadata, the client may present, obtain, or negotiate the proper certificates or authorization to access the content of the digital object. During this process, the executable metadata may contact other services that are necessary to obtain the certification or authorization.
Figure 6 illustrates a Distributed Active Relationship that manages the access rights to a resource. In this case, the rights management scheme is based on the notion of an access control list. Note the separation between the access control list in the package labeled P2 and the mechanism for the enforcement, which is in the external relation object. Also note that the relation object is a digital object in its own right, referenced via a global identifier, URN1 in the relationship catalog. The activation package in the figure stands for an executable component of the relation that would be invoked when a client accesses the content in the package labeled P1. The description package in the relationship container might be some textual description of the relationship. Section 6.1 describes one possible implementation of such a rights management mechanism.
An important component of this rights management
scheme, and for the
DAR concept in general, is that the executable aspect of the
DAR is external
to the resource being accessed and to the repository
containing the resource.
This level of modularization maximizes code reuse and
extensibility. This
means that not all contingencies and consequences need be
anticipated before
an object is released. Rather, a rights holder may add to or
subtract from
the metadata as circumstances change and new services
become available.
Section 6.1 describes a digital library repository
architecture that implements
this scheme for rights management.
Another consequence of the DAR model is that metadata packages can be virtual or dynamic [LAG]. That is, the package data may only exist as the result of a computation on some other resource. For example, we might state that both MARC and Dublin Core descriptions of a resource are available. The Dublin Core description could be computed on-demand from the MARC description. Active relationships can capture the dependency of the virtual Dublin Core package on the MARC package. This is similar in concept and could be applied to the notion of "Just-in-time Conversion" addressed in [PW]. For this purpose, a single underlying format, such as a scanned image, could be associated with several different DARs that on-demand can convert the object to a variety of formats such as JPEG, GIF, or OCR-ed text.
While the DAR model is intriguing, there are three problem areas that must be addressed in practical implementations.
FEDORA is based on the concepts of the Kahn/Wilensky Framework [KWF]. It uses the abstraction of a Warwick Framework container to aggregate potentially distinct resources into a digital object. Using DARs, we can then provide disseminations from that digital object. As mentioned in section 5, some of those disseminations may be virtual, that is, computed on the fly, as opposed to having been stored. The DARs that identify disseminations and the resources they draw upon are known as Interfaces, since they provide different ways of accessing the content. For example, the digital object for a technical report might have both a "PostScript" interface and an "HTML" interface.
Another key point in the Kahn/Wilensky architecture is
the need for
protection of intellectual property. FEDORA implements a
DAR-based scheme
for doing this known as an Enforcer. An Enforcer is an
object that
guards the implementation of an Interface with respect to
input and output.
In other words, a request for a specific dissemination from a
digital object
(e.g., a PostScript page) may require the invocation of a
terms and conditions
"machine", for example one that enforces access control list
restrictions,
as illustrated in Figure 6. The output of the dissemination,
the PostScript
page, may be filtered through the same Enforcer to add a
digital signature.
Figure 7 illustrates a FEDORA digital object with that aggregates three MIME-typed datasets (known as DataStreams in FEDORA). There are three Interfaces associated with this digital object that allow clients to access the PostScript content (by page or entirely) and to access MARC or Dublin Core bibliographic descriptions (by field or entirely). Note that Dublin Core metadata is derived from physically stored MARC record. Finally, the Interface for accessing the PostScript content is protected by an Enforcer, which in this case is an access control list mechanism that uses access control list data stored as data in the digital object.
RDF has four components; the modeling facility, the serialization syntax, schema definitions, and rule definitions. Currently, a public draft for the modeling facility and syntax has been released [RDF], and the schema working group has just been chartered. The model and syntax draft will be revised in the near future to add a typing mechanism similar to that of modern Object-oriented programming languages once the interactions between typing and schemas have been specified.
Similar to the approach discussed in section 4, RDF
models are directed
graphs. Nodes represent web resources, arcs state that
certain properties
(such as "Author") are associated with a node, and arcs
terminate either
at a node or at a string. As an example, Figure 8 shows a
model for some
simple Dublin Core bibliographic information associated with
a web page.
Listing 3 shows the serialized version of that model.
<?namespace href="http://www.purl.org/Metadata/DublinCore/" as="DC"?> <?namespace href="http://www.w3.org/Schemas/RDF/" as="RDF"?> <RDF:Serialization> <RDF:Assertions href="http://www.acl.lanl.gov/~rdaniel/"> <DC:Creator>Ron Daniel Jr.</DC:Creator> <DC:Publisher>Los Alamos National Laboratory</DC:Publisher> </RDF:Assertions> </RDF:Serialization>
One of the key features of RDF is its pervasive use of
URIs. The namespace
declarations in Listing 3 provide one indication of this. Tag
names like
DC:Creator expand to a URI, such as
http://www.purl.org/Metadata/DublinCore
,
and the identifier "Creator". This give us scoped names and
allows name
space definitions to be fetched from the network.
In order to implement DARs in RDF we extend the name-
space definition
slightly by allowing scoped tag names to expand to a URI
such as http://www.purl.org/Metadata/DublinCore/Creator
.
(The XML name space is only now being specified [XML2]
and neither blesses nor precludes this extension.) With this
extension,
the arcs in RDF correspond toDARs. For example, the DC:
Creator arc in Figure
8 can be expressed as a DAR through the 3-tuple scheme
shown in Listing
4.
(http://www.purl.org/Metadata/DublinCore/Creator - the arc type http://www.acl.lanl.gov/~rdaniel/ - the source of the arc "Ron Daniel Jr.") - the dest. of the arc
Thus, RDF seems to provide the facilities needed to construct an active metadata system. Work is underway at Los Alamos to investigate this possibility. That effort is currently considering the issue of how to efficiently handle executable relations.
Assume we have a repository similar to that of FEDORA, and that we wish to implement enforcers and interfaces. We can pick a particular form of executable content (such as Java class files) to support in our system. Determining the meaning of an executable relationship and deciding whether to run it remains a problem. As mentioned earlier, blindly executing all relationships would be foolish due to performance and security considerations. We can use RDF's typing system to indicate that particular relations are subclasses of known relationships such as "Enforcer" or "Interface". The security manager of our repository could look at the type of all DARs. Only those that are subclasses of known, pre-approved types would be executed. Therefore we can implement a security manager in our repository that will only execute relations when they are of a known type, giving us some indication of their meaning.
This foundation has proven very useful in the design of FEDORA, where it allowed a graceful and promising integration of such divergent notions as the Kahn/Wilensky Digital Library architecture and downloadable code (e.g. Java applets). We are particularly interested in the capabilities of the new Resource Description Framework to facilitate the construction of systems based on DARs.
[WW] Metadata Workshop II, http://www.oclc.org:5046/oclc/research/conferences/metadata2/
[DC] Dublin Core Metadata Element Set Resource Page, http://purl.oclc.org/metadata/dublin_core/
[ARMS] Arms, William Y., Key Concepts in the Architecture of a Digital Library, D-lib Magazine, July 1995, http://www.dlib.org/dlib/July95/07arms.html
[GLAD] H.M Gladney and J.B. Lotspiech, Safeguarding Digital Library Contents and Users: Assuring Convenient Security and Data Quality, D-lib Magazine, May 1997, http://www.dlib.org/dlib/may97/ibm/05gladney.html
[LAG] Lagoze, Carl, From Static to Dynamic Surrogates: DataStream Discovery in the Digital Age, D-Lib Magazine, June 1997, http://www.dlib.org/dlib/june97/06lagoze.html.
[PW] Price-Wilkin, John, Just-in-time Conversion, Just-in-case Collections: Effectively leveraging rich document formats for the WWW, D-lib Magazine, May 1997, http://www.dlib.org/dlib/may97/michigan/05pricewilkin.html
[DL] Daniel Jr., Ron and Carl Lagoze, Distributed Active Relationships in the Warwick Framework, Proceedings of the 1997 IEEE Metadata Conference, September, 1997, http://computer.org/conferen/proceed/meta97/papers/rdaniel/rdaniel.pdf
[KWF] Kahn, Robert and Robert Wilensky, A Framework for Distributed Digital Object Services, Corporation for National Research Initiatives, http://www.cnri.reston.va.us/cstr/arch/k-w.html
[W3R] Press Release, W3C announces RDF, http://www.w3.org/Press/RDF
[XML] Extensible Markup Language (XML), World Wide Web Consortium, http://www.w3.org/XML/
[RDF] Lassila, Ora and Ralph R. Swick, Resource Description Framework (RDF) Model and Syntax, World Wide Web Consortium, http://www.w3.org/TR/WD-rdf-syntax/
[XML2] Bray, Tim and Dave Hollander and Andrew Layman (eds.), "Name Spaces in XML", W3C XML Working Group White Paper 15-October-1997, http://www.textuality.com/xml/xml-names.html