D-Lib Magazine
July/August 1998
ISSN 1082-9873
Directions for Defense Digital Libraries
Ronald L. Larsen
US Defense Advanced Research Projects Agency (DARPA)
Arlington, Virginia USA
rlarsen@darpa.mil
Introduction
The role of the Department of Defense (DoD) has shifted dramatically in the 1990's. In the forty years prior to 1990, the DoD engaged in three domestic and seven foreign deployments. In contrast, the Department engaged in six domestic and 19 foreign deployments between 1990 and 1996, most of which were for peacetime activities such as humanitarian assistance and disaster relief. This represents a 16-fold increase in the rate of deployment. Whereas major deployments in the pre-1990 period occurred over substantial periods of time, allowing adequate time for preparation, the post-1990 deployments are characterized more by their rapid and transient nature. Analysts and planners face increasing demands to interpret rapidly unfolding situations and to construct alternative responses. The Department now speaks in terms of an "OODA" loop (Observe, Orient, Decide, Act) and strives to increase the rate at which this loop can be traversed, in order to be more responsive to the increasing frequency and pace of events. Much of the activity conducted within the OODA loop is itself information intensive, for placing the specific task at hand within a regional or situational context amplifies the need for a timely, accurate, and comprehensive world view. Digital library technologies are critical to addressing these information intensive, time-critical situations effectively.
The 1990's has also witnessed the explosive growth of the World Wide Web (WWW), expanding dramatically the information which is accessible and potentially usable to analysts and strategists. But identifying, acquiring, and interpreting that information which is critical to understanding a particular event or situation in a timely manner, out of the sea of data of which the WWW is composed, poses enormous problems. This is a problem of major interest to the DoD. While an exponentially growing volume of information is potentially available to one with copious time and substantial diligence, DARPA's digital library research intends to make this information accessible and usable to those confronting difficult decisions without the luxury of time. This is, in large part, the motivation behind DARPA's Information Management (IM) program (http://www.darpa.mil/ito/research/im/index.html).
DARPA's Information Management Program
The IM program envisions digital libraries within a global information infrastructure, in which individuals and organizations can efficiently and effectively identify, assemble, correlate, manipulate, and disseminate information resources towards their problem-solving ends, regardless of the medium in which the information may exist. It makes no assumptions about commonality of language or discipline between problem solver and the information space, but, instead, provides tools to navigate and manipulate a multilingual, multidisciplinary world. It does assume, however, that task context, user values, and information provenance are critical elements in the information seeking process.
The IM program seeks major advances in acquiring and effectively using distributed information resources to provide the Defense analyst with a comprehensive ability to assess a rapidly changing situation. Its products are scalable, interoperable middleware:
- to manage exponentially growing information resources,
- to focus the analyst's attention on highly relevant materials,
- to organize information for rapid exploitation in unpredictable circumstances, and
- to provide superior ability to evaluate all aspects of a given situation to inform rapid decision processes appropriately.
The accelerating pace of world events coupled with the expansion of the WWW conspire to clarify the urgency of developing adaptive technology to rapidly acquire, filter, organize, and manipulate large collections of multimedia and active digital objects in a global distributed network to provide the ability to investigate and assess time-critical, multifaceted situations. DoD's information management requirements typically push the current boundaries by two orders of magnitude in quantitative parameters such as numbers of coordinated repositories, sizes of collections, sizes of objects, and timeliness of response. In addition, qualitative improvement is required in the creation, correlation, and manipulation of information from multiple disciplines and in multiple languages.
Directions and Challenges
The DARPA IM program is narrower and more sharply defined than related federal research programs with which it collaborates (e.g., NSF's Knowledge and Distributed Intelligence [KDI] program (http://www.nsf.gov/pubs/1998/nsf9855/nsf9855.htm) and Digital Libraries Initiative,
Phase 2 [DLI-2] program) and Digital Libraries Initiative, Phase 2 [DLI-2] program (http://www.nsf.gov/pubs/1998/nsf9863/nsf9863.htm). The specific directions and challenges addressed by the IM program relate to the role of digital libraries in situation (or context-dependent) understanding for real-time planning and decision making. As such, the research that the program supports has implications for other contexts (e.g., medical, business) in which rapid but informed decisions based on multiple and dispersed sources of information are critical. The primary areas of interest, described in more detail below, are information retrieval, information space navigation and visualization, automated categorization and correlation, scalability, and interoperability.Today's information retrieval systems rely largely on indexing the text of documents. While this can be effective in bounded domains in which the usage and definition of words is shared, performance suffers when materials from multiple disciplines are represented in the same collection, or when disparate acquisition or selection policies are active. Rather than being the exception, however, this is typically the rule (especially, on the Web). Techniques for mapping between structured vocabularies begin to address this problem for disciplines which are fortunate enough to have a formalized vocabulary (http://www.sims.berkeley.edu/research/metadata/).
Techniques are needed that can look beyond the words, however, to the meaning and the concepts being expressed. Automated techniques for collection categorization are required, and some success has been recently reported using statistical approaches on large corpora (http://www.canis.uiuc.edu/interspace/).
Query languages and tools seek to identify materials in a given collection which are similar to the characteristics expressed in a given query. But these characteristics focus on the information artifact and have yet to consider non-bibliographic attributes which might serve to focus a search more tightly, such as types of individuals who have been reading specific material, the value they associated with it, and the paths they traversed to find it (http://scils.rutgers.edu/baa9709/).
The navigational metaphor has become ubiquitous for information seeking in the network environment, but highly effective and facile tools for visualizing and navigating these complex information spaces remain to be fully developed. Incorporation of concept space and semantic category maps into visualization tools is a promising improvement. Concept spaces and semantic category maps are illustrative of statistically-based techniques to automatically analyze collections, to associate vocabulary with topics, to suggest bridging terms between clusters of documents, and to portray the clustered document space in a multidimensional, navigable space, enabling both high level abstraction and drill-down to specific documents. (The previously-identified URL for the Interspace project at the University of Illinois, [http://www.canis.uiuc.edu/interspace/], provides more detail on these techniques.) Additional approaches, including alternatives to the navigational metaphor are needed.
Scalability and interoperability continue to be major challenges. The objective is to build scalable repository technology that supports the federation of thousands of repositories, presenting to the user a coherent collection consisting of millions of related items, and to do this rigorously across many disciplines. As the size and complexity of information objects increases, so also does the bandwidth required to utilize these objects. Real-time interactivity is required for the time-critical assessment of complex situations, pushing the bandwidth requirements yet higher. As this capability emerges, broadband interoperability becomes feasible, in which the user's inputs are no longer constrained to a few keystrokes, with the return channel carrying the high volume materials. Research is required to explore the nature of such broadband interoperability and the opportunities it brings to raise the effectiveness of the information user.
The analyst's attention has become the critical resource. The technological objective is to get the most out of the analyst's attention in the least amount of time by providing a powerful array of tools and automated facilities. The analyst's job, by definition, is to rapidly and effectively understand the full dimensions of an unfolding situation. Real-time correlation and manipulation of a broad array of information resources is critical to this task. Correlation of geographical information (e.g., maps and aerial imagery) with event-related materials (e.g., documents and news reports) is becoming increasingly important. The "GeoWorlds" project (http://www.isi.edu/geoworlds/) is integrating geographically-oriented digital library technology with scalable collection analysis to demonstrate this evolving approach to crisis management in a collaborative setting.
Deriving a comprehensive and accurate assessment of an international situation currently also draws heavily on the skills of translators and linguists. Translingual aids have the dual potential of enabling analysts to perform substantial filtering of multilingual information (thus relaxing their reliance on translators), while concurrently focussing the precious skills of translators on those tasks where their skills are essential.
It will come as little surprise to library professionals that the IM research agenda can be broadly structured into context- or task-independent repository-based functions and user- or usage-dependent analysis activities. This is, after all, largely the way libraries have traditionally divided their activities. The DARPA IM program further decomposes each of these two areas into three tracks:
Repository functions:
- Registration and security provides the registration, access controls, and rights management facilities required to support Defense-related applications in an open network environment.
- Classification and federation advances the capability to automate the acquisition, classification, and indexing of information resources among distributed repositories.
- Distributed service assurance addresses the vital concerns of matching user interaction styles and needs to system performance capabilities. This work also pushes the boundaries of interactivity over broadband networks.
Analysis activities:
- Semantic interoperability strives to extend the analyst's ability to interact with diverse information from distributed sources at the conceptual level.
- Translingual interaction builds on recent successes in machine translation to provide the information user the facility for identifying and evaluating the relevance and value of foreign language materials to a particular query, without assuming the user has any proficiency in the foreign language.
- Information visualization and filtering focusses on the development of improved tools for visualizing and navigating complex multidimensional information spaces, and on user-customizable value-oriented filters to rank information consistent with the context of the task being performed.
IM Program Objectives
Nine objectives quantitatively and qualitatively characterize the goals of the IM Program:
1. Advance the technologies supporting federated repositories from the present state, characterized by the Networked Computer Science Technical Reference Library (NCSTRL), in which more than a hundred independent repositories are federated using custom software, to a state where generic software is commonly available and supports thousands of distributed, federated repositories (http://www.ncstrl.org/).2. Enlarge the effective collection capacity of a typical repository from thousands to millions of digital objects, including scalable indexing, cataloging, search, and retrieval.
3. Support digital objects as large as 100 megabytes, and as small as 100 bytes.
4. Reduce response times for interaction with digital objects to sub-second levels, striving for 100 milliseconds, where possible. High duplex bandwidth coupled with low response time provides the opportunity to explore new modes of interacting with information, referred to here as broadband interoperability, in which the traditional query could be reconceived to include a much richer user and task context.
5. Expand the user's functional capabilities to interact with networked information from the present play and display facilities of the WWW, to the correlate and manipulate requirements of a sophisticated information user engaged in network-based research and problem solving.
6. Raise the level of interoperability among users and information repositories from a high dependence on syntax, structure and word choice, to a greater involvement of semantics, context, and concepts.
7. Extend search and filtering beyond bibliographic criteria, to include contextual criteria relating to the task and the user.
8. Reduce language as a barrier to identifying and evaluating relevant information resources by providing translingual services for query and information extraction.
9. Advance the deployed base of general purpose content extraction beyond forms and tagged document structures to include extraction of summary information (e.g., topics) from semi-structured information sources.
Conclusion
Perhaps one of the biggest mixed blessings confronting the Defense analyst is the reality that information resources are growing exponentially in number and size, and that they are increasingly coupled to accurate situation understanding and mission effectiveness.
The analyst's attention has become the critical resource. The objective of the DARPA Information Management program is to provide the technological capability to get the most out of the analyst's attention in the least amount of time. The Information Management program strives to broadly increase the Defense analyst's ability to work with a diverse and distributed array of information resources in order to understand and develop an appropriate response to time-critical, crisis-driven situations. In short, the program envisions the rigor and organization normally associated with a research library to be virtually rendered and extended in the networked world of distributed information.
Top | Magazine
Search | Author Index | Title Index | Monthly Issues
Previous Story | Next Story
Comments | E-mail the Editorhdl:cnri.dlib/july98-larsen