Digital Object Repository Server: A Component of the Digital Object Architecture

Sean Reilly
Corporation for National Research Initiatives
<sreilly@cnri.reston.va.us>

Robert Tupelo-Schneck
Corporation for National Research Initiatives
<schneck@cnri.reston.va.us>

Abstract

The Digital Object Architecture defines three primary components: an identifier system, metadata registries, and digital object repositories. The identifier system is the widely used Handle System and the CNRI metadata registries are now in use in several projects. This paper introduces the Digital Object Repository Server (DORS), the most recent instantiation of CNRI's repository work. DORS includes an open, flexible, secure and scalable protocol and software suite that provides a common interface for interacting directly with all types of Digital Objects. It has been implemented and tested as server software and provides a trustworthy network interface for invoking operations on objects.

Introduction

The Digital Object Architecture, as described in the seminal 1995 Kahn-Wilensky paper "A Framework for Distributed Digital Object Services," [1] defines three primary components: an identifier system, metadata registries, and digital object repositories. The Handle System has been in wide use for over a decade and has proven to be a scalable, reliable and secure identifier resolution system. CNRI's Digital Object Registry is evolving and is in current use in the registration of learning objects [2] and is being considered for use in a large experimental network environment [3]. In earlier work [4], CNRI designed and developed several versions of repository software and funded Cornell University to build a repository to a defined interface that could interoperate with the then existing CNRI repository. In recent years, CNRI has implemented a more streamlined version of the repository software known as Digital Object Repository Server (DORS) that provides the flexibility, scalability and security to serve as the foundation for a long-term information infrastructure for the Internet. We use the term 'repository' here as a shorthand reference to digital object management services, to include storage but primarily focused on deposit, access, and long-term management. Every DORS instance will use one form of storage or another, as explained below, but one of our primary goals has been to separate out, and make independent, the specifics of that storage from the services provided by the Digital Object Repository Servers. This paper describes DORS and its role in the Digital Object Architecture.

DORS includes an open, flexible, secure and scalable protocol and software suite that provides a common interface for interacting directly with all types of Digital Objects. It has been implemented and tested as server software [5] and provides a trustworthy network interface for invoking operations on objects. All operations use digital object identifiers (aka handles) to identify (1) the target object, (2) the operation being performed on the object, and (3) the entity requesting the operation. DORS normally has a storage module directly attached, but it can also make use of networks to access storage services located elsewhere. Its primary purpose is to perform operations on objects, regardless of where, how or if those objects are actually stored. For objects that have a stored representation, specified and implemented operations provide access to the generic data structure that is assumed for every object. This data structure is machine-independent, parsable, and contains at a minimum a unique persistent identifier and one or more "elements" consisting of named byte sequences. Those elements may themselves be digital objects, although the current implementation does not yet support this capability. This structure is flexible enough to represent most types of information such as audio, video, images, books, articles and more complex information types.

The DOR server provides several network interfaces for performing operations on Digital Objects: the Digital Object Protocol (DOP), HTTP, and DOP-over-TLS (Transport Layer Security, aka SSL). The various interfaces each have their own benefits in terms of security, resilience to firewall blocking, compatibility with proxy servers, or ubiquitous client software. As with the handle resolution protocol, redundancy is built into DOP, along with strong individual and group authentication. Redundancy is supported by a mirroring system in which each DORS communicates with other DORS to keep the object storage in sync. Clients that cannot connect to one server will automatically try another within the same defined service until they are able to establish a connection. A service consists of one or more DORS that share the same repository handle and are expected to hold the same set of objects. Authentication is based on either secret or public/private keys and X.509 certificates.

By providing a formal interface through which objects can be securely and reliably accessed, stored and replicated, the CNRI DOR provides a distributed platform for maintaining and preserving digital information.

Digital Object Repository

We have designed and implemented a protocol and software suite that provides the foundation for an extensible, scalable and secure architecture centered on Digital Objects (DOs) and object identifiers. Our Digital Object Repository (DOR) and the associated client API work in concert with the Handle System to provide a secure, reliable system for creating and interacting with Digital Objects at multiple levels of abstraction. At the most fundamental level, a Digital Object (DO) is an abstract entity, expressed as a sequence of bits or bytes, or a set of such sequences, that has a unique persistent identifier. DOs may be accessed from a Digital Object Repository (DOR) through which operations on the object can be performed. At higher levels of abstraction, a DO can represent specific data structures, provide services, or act as a proxy to external entities.

Persistent Identifiers

At the core of the Digital Object Architecture's design is the idea that all objects should have a persistent identifier that can be resolved to locate and, given permissions, interact with the object, regardless of changes in ownership, location, data format, security level or protocol used to access the object, i.e., such variables are not an intrinsic part of the identifier itself. Clients can interact with objects by establishing a connection to the DORS and submitting operation requests that contain the object identifier (ID) and the identifier for the operation they wish to invoke. Objects can contain data structures by defining, implementing and exposing operations that operate on those data structures. For example, to establish an array data structure an object should support operations to set and retrieve values in the array as well as the length of the array. Both the objects and the operations are represented as handles in order to prevent collisions in operation identifiers while allowing an open namespace for defining different operations.

Using persistent identifiers for all interactions with objects means that DORS can serve as an access point for interacting with objects as opposed to being a container for objects. Indeed, the objects may be located in one or more storage systems in close proximity to the object server or at a remote site or even use storage managed by some other organization. The use of handles allows objects to be securely resolved, easily migrated or replicated to different locations or storage systems, transferred to different owners and accessed using multiple protocols. Using handles helps ensure that the access protocol, data type, owner and current location aren't part of the object's identifier.

Object Identifiers as Primary Keys

Because the Digital Object Architecture is based on the use of persistent identifiers, components can be designed to use the object IDs for all interactions with objects. Whether the object changes location or uses a different access protocol (say https vs. http) the object ID stays the same, and interactions with the object are consistent from the user or application point of view.

Knowing that the identifier for an object is persistent makes it safe to use those identifiers in resources that refer to the object. Metadata and citations can reference the object ID without worrying whether the current location/URL of the object will change, whether due to a leased DNS name expiring, or the server administrator switching from ASP to JSP or from one cloud service to another. RDF files and other resources that specify relationships between objects will be much more useful if the object IDs they reference have a higher probability of not breaking over time. Not having to rely on HTTP redirects to determine the new location (or even identifier) of an object results in a namespace that is much more stable and requires little or no added redirection to access objects or normalize identifiers.

Using handles as persistent identifiers creates a namespace for digital objects that is shared across all repository servers and is not sensitive to the context in which references occur. In a traditional HTTP-based repository one might see a reference such as "/pid:1234/metadata.xml", which references the object relative to the current repository. This will work within the context of that repository and provide the benefit that the resource may be resolved from either the file system or as an HTTP URL; however, the reference will break as soon as the resource containing the reference is moved to a different context. By ensuring that all references to digital objects are persistent and non-context-sensitive, we ensure that moving resources to different locations will maintain both the references to and from that resource. The shared namespace also facilitates references between objects in different repositories, allowing, for example, having metadata in multiple locations.

Many repositories currently use handles to resolve to the current URL for an object as well as to reference other digital objects, whether they are local or not. DORS goes one step farther by establishing a protocol in which all operations are defined by their IDs and the operations are performed on the digital object referenced by its ID. In other words, the digital object ID can be resolved to find the repository server through which the object may be accessed; moreover, once a client is communicating with that repository server, the object ID is used in all interactions with that object. This simple concept increases the prominence of a persistent global object ID and reduces the role of the 'current location' URLs which currently dominates most repositories.

Uniform Interfacing to Structured Data or Services

The DOR's role as an access point for performing abstract operations on objects allows it to act as an interface for nearly any kind of structured data collection. Arrays, mappings and lists are the simplest of data structures that can be represented by enabling a certain set of operations in a DOR.

CNRI has recently implemented and deployed DORS instances that allow access to information in selected network storage services, as well as access to files from a Subversion repository, all represented as digital objects. In each case, the DORS maps a set of standard operations (for example, getting attributes, setting attributes, storing data, getting data) to the native operations of the selected storage systems.

With a standard DORS interface available at different repository servers and with a standard set of internal operations to map to different types of storage systems, we gain interoperability across many different types of information systems. This common interface is flexible enough to be applied to either static repositories such as those accessed through FTP or basic HTTP protocols, or services with more advanced features such as versioning, replication or flexible access control. It is also possible to make dynamically generated data accessible as digital objects, by either mapping the dynamic data into a standard DO data structure or by defining additional operations to access the dynamic data.

By providing a secure, efficient, flexible and non-proprietary interface to structured data, the DOR could do for heterogeneous repositories what TCP/IP did for previously non-interconnected computer networks. That is, the Digital Object Architecture and DOR implementation can bring interoperability to a currently disparate set of information architectures.

Uniform Object Structure

The structure of most information currently on the Internet consists of a few bits of metadata (e.g., filename, timestamp, MIME type) and a single sequence of bytes (e.g., a file). With the CNRI DORS we have implemented a set of operations that expose a digital object structure that is flexible, scalable and extensible. This data structure allows an object to have the following:

This relatively simple data structure allows for the simple case, but is sufficiently flexible and extensible to incorporate a wide variety of possible structures, such as an object with extensive metadata, or a single object which is available in a number of different formats. This object structure is general enough that existing services can easily map their information-access paradigm onto the structure, thus enhancing interoperability by providing a common interface across multiple and diverse information and storage systems. An example application of the DO data model is illustrated in Figure 1.

Associating Metadata with Digital Objects

A repository of structured data requires some mechanism to associate metadata with the relevant object. One approach is to embed metadata in each object as is done in HTML, PDF or Word files. Another is to keep a registry of metadata in a secondary database that is part of the access mechanism, such as a file that maps file extensions to MIME types, as is done in most web servers. Additional metadata often comes from sources such as file system timestamps of the data.

There are benefits to metadata that is essentially contained within each object. Keeping the metadata with the object provides the flexibility to move or copy objects to a different context without losing information in the process. An archive of objects that does not include metadata for those objects would not be a complete archive. However, many current systems maintain the metadata for their objects in separate databases or services, which raises the possibility that linkages between the different types of information can be severed and thus the connections will be lost. The structure of objects in the DOR provides a mechanism for easily associating metadata with each object. Meanwhile, the use of external metadata is also enhanced by the shared namespace of persistent object identifiers, which provides for greater scalability and portability in references between an object and any external metadata.

A kind of metadata that is often overlooked is that relating to access control. Who or what is allowed to access an object is often defined at the repository level or as a combination of object attributes and repository settings. Information as to who can access the object would be part of the metadata, as when a document is marked as 'classified' or 'top secret'. In the DOR, access rights are specific to each object, but can also default to the access control of the relevant repository. We believe this provides a balance between extremely fine-grained access control and ease of maintenance.

Digital Objects can accommodate existing file formats. It is possible to keep multiple representations of an object's data within each object along with descriptions of each representation. Each representation can be tagged with information such as the MIME type or the Uniform Type Identifier (UTI) for that representation. [6] This allows the DOR to work across typing systems, and potentially even to dynamically translate data formats on the fly, transparently to clients.

Strong Authentication Based on PKI and Handles

Strong authentication on the Internet is largely non-existent. Where it does exist, it might consist of sending a username and password over a TLS/SSL connection that is dependent upon the other side of the connection having a DNS name that was verified by one of at least 50 different certificate authorities, few to none of which are generally known to end users. This method of authentication often unnecessarily shares the user's password with the other side of the connection and either involves a password that is used on many other unrelated sites or requires a browser extension or separate application for managing all of the user's passwords. Technologies such as OpenID and client X.509 certificates improve matters somewhat, but have their own limitations. OpenID is intrinsically web-based and cannot be used for non-HTTP services (such as IM, Skype, email, file sharing, etc). Although the technology can support it, average users simply do not have an easy path towards establishing identity using X.509 certificates, and, if they did, there are few services that would allow them to use that identity for access purposes.

A solution is needed that combines the ease of implementation and global namespace of OpenID with the (mostly) protocol-agnostic public-key-based security of X.509 or certificate-based identities. The Handle System provides a global namespace of securely resolvable identifiers that easily fulfills the role of a global persistent identity management system and PKI. By establishing an OpenID interface to handle-based public key (or secret key) authentication, digital object identifiers could be used to natively authenticate with DORs as well as any web site that accepts OpenID authentication, thus ensuring interoperability with the growing OpenID-enabled landscape as well as non-HTTP-based services.

Automatic replication

The Digital Object Repository provides for automatic replication of data across multiple repository servers. This provides for redundancy of data storage and thus safety from data loss or lack of access due to hardware failure. Replication also ensures that the system scales to large numbers of clients. Since it is an intrinsic part of the system, there is no need for an external tool or special configuration on the part of either server administrators or clients to take advantage of replication.

Digital Object Protocol

At its core, a Digital Object Repository Server is a network service through which clients can interact with DOs by performing operations on specified digital objects. A client, once connected to a DOR and (optionally) authenticated, can submit operation requests. Each operation request contains the following:

Upon receiving an operation request, the DOR will determine if the user, or caller, has permission to perform the specified operation, authenticating the user if necessary. If authentication fails or the user does not have permission, then the request is refused. If the request succeeds, the repository performs the specified operation. The operation may be a standard operation on the common data structure (e.g. getting or setting the value of an attribute, or reading or writing the contents of a data element), or it may be an operation specific to this repository, e.g., a special computation of some sort that most repositories would not be capable of performing. The parameters specify features of the operation; and the input and output streams are a potentially bi-directional communication as the operation occurs. For example, the set-attribute operation takes the name and value for the attribute as parameters, and uses the output stream to report success or failure. The read-data-element operation takes the name of the data element as a parameter, and returns its contents over the output stream. This communication format is highly general and allows for a wide variety of potential uses.

Clients can connect to a DORS through multiple interfaces. The native Digital Object Protocol is the most general, but requires an application that speaks the protocol. It is also possible to interact with a DORS using a simple HTTP interface, which is provided by a standard web browser. It is also possible to send DOP messages over a secure TLS connection including that provided by HTTPS. This last possibility allows much of the full strength of the native DOP protocol even through firewalls and is widely supported by proxy servers.

Conclusion

The Digital Object Repository provides a flexible, scalable, reliable, and secure framework for maintaining and accessing digital information. By using persistent, globally unique identifiers, as provided by the Handle System, it offers enhanced flexibility in how objects are stored, moved, replicated, and referenced. The Handle System PKI provides strong authentication. Metadata encapsulation allows metadata to easily travel with an object whether it be to local copies, mirrors, or a new primary storage location. A uniform generic interface to objects in terms of performing operations, along with a uniform data structure, allows the repository to provide access to a wide variety of structured data and services. Replication is built into the system and provides redundancy and scalability.

Additional information and a freely available implementation of the DORS can be found at dorepository.org.

Acknowledgements

The authors would like to acknowledge the contributions of the following individuals to the design and development of the DORS: Dr. Robert Kahn for the original Digital Object Architecture concept, as well as significant input into the design of the current implementation; and Christophe Blanchi for the previous incarnation of the CNRI Repository, and input into the current version.

References

Arms, William Y., Christophe Blanchi, Edward A. Overly, "An Architecture for Information in Digital Libraries", D-Lib Magazine, February 1997 [ doi:10.1045/february97-arms ].

Blanchi, Christophe, Sandra Payette, Carl Lagoze, Edward A. Overly, "Interoperability for Digital Objects and Repositories: The Cornell/CNRI Experiments", D-Lib Magazine, May 1999. [ doi:10.1045/may99-payette ]

About the Authors