Research in Support of Digital Libraries
at Xerox PARC

Part II: Paper and Digital Documents

Marti Hearst, Gary Kopec, and Dan Brotsky
Xerox PARC
{hearst,kopec,brotsky}@parc.xerox.com

D-Lib Magazine, June 1996

ISSN 1082-9873

Introduction

Part I of this article outlined what some PARC researchers have to say about the wide range of issues in digital libraries that link the social with the technical. In this second part, we focus on technical matters, discussing some of the recent research at PARC related to digital libraries. This includes technology for the capture, analysis, and presentation of document images, derived from their original paper form, as well as information search, access, summarization, and visualization, and middleware to support document services and collection management.

A full-fledged, end-to-end digital library project must include a target collection whose contents are to be preserved and maintained over time, a user test bed, intermediaries that help determine the layout and organization of the materials, and finally technologies to support these activities. PARC is participating as an industrial partner in several of the NSF/DARPA/NASA-sponsored digital library projects, working most closely with those of Stanford and UC Berkeley. These projects provide an opportunity for the exploration and demonstration of how support technologies can be used in the digital library context.

This article presents a sample of PARC digital library technologies and introduces a new PARC project, UPrint1, that contains many, although not all, elements of a full-fledged digital library project. These technologies are organized as follows:

Capture, analysis, and presentation of document images, including document image decoding, image search and summarization, and creation of new paper presentations that combine information from multiple sources.
Information access and visualization, including search, browsing, and visualization of large text collections, summarization, and automatic detection of thematic structure.
Middleware for the support of document services, including a system architecture to support connectivity of distributed document services and a uniform programming interface to document management systems.
UPrint1, a new project exploring the practicality of on-demand printing of single copies of books in response to customer orders.

Document Capture, Analysis And Presentation

Much of the technical research at PARC centers around the interface between paper and electronic documents. Since its invention thousands of years ago, paper has served as one of our primary communications media. Its inherent physical properties make it easy to use, transport, and store, and cheap to manufacture. However, much research is still needed for building bridges between paper and the digital world. The final report for a recent digital library workshop [Lynch and Garcia-Molina] begins by noting that ``Digital libraries will, for the foreseeable future need to span both print and digital materials.''

The problem of capturing and converting scanned paper materials has received considerable attention at PARC, from a number of perspectives. We will illustrate the range of activities with three examples:

Document Image Decoding (DID)
Document Image Summarization
Interfaces for Accessing Both Paper and Electronic Documents

Document Image Decoding

The Document Image Decoding (DID) Project is developing a theoretical and algorithmic framework for the rapid development of document recognition systems that are tailored to the fonts and structuring of specific document collections [Kopec and Chou]. The use of document-specific recognizers is appropriate when very high character recognition accuracy or specialized tagging (e.g. SQL or SGML markup) is required. Automatically recovering tagged information is particularly challenging because such structures are inherently document-specific and the range of possibilities is open-ended. An example of the type of document and markup the DID project aims to support is the following fragment of a scanned report from the UC Berkeley Environmental Digital Library Project that describes the characteristics of California dams,

and the corresponding SQL code for a relational database entry:

            
INSERT INTO dams (            
  dam_name,dwr_num,owner,county,            
  stream,loc_section,loc_township,loc_range,loc_base_mer,            
  national_id,dam_type,stor_capacity,drainage_area,reserv_area,            
  parapet_code,parapet_ht,crest_elev,crest_len,height,            
  total_frbd,oper_frbd,crest_width,volume,year_comp,            
  lat,long            
)            
VALUES            
('FOLSOM','9000-148','U S BUREAU OF RECLAMATION',            
'SACRAMENTO','AMERICAN RIVER',24,'10N','7E','MD',            
'CA10148','GRAV',(1010000,1245834),(1885.0,4882.15),            
(11450,4634),NULL,(NULL,NULL),(480.5,146.5),(26670,8129),            
(275,84),(62.5,19.1),(5.1,1.6),(36,11),(13970000,10680763),            
'1956-01-01',(38.0,42.5),(121.0,9.4));

It is clear that the names of the database record fields and the representations of the fields in the image are highly specific to this particular document. A body of information about fish species, for example would have a completely different logical and typographic structure.

The motivation for creating the type of semantic markup illustrated above, which goes beyond simple document structure tags such as chapter, section, footnotes, etc., is that the fundamental task of many digital library users is to retrieve information to answer a query, rather than to retrieve documents per se. The answer to a query may not be in a single document, or even in textual form. Supporting semantic information retrieval requires that the collection be indexed and accessed on the basis of content, rather than simply document structure.

The DID approach to creating document-specific tags is to support the automatic generation of custom recognizers from declarative specifications, in a manner analogous to the way LEX and YACC generate character string parsers from language grammars. The overall vision is summarized in this figure:

The input to a decoder generator is a document model that describes aspects of the document content and appearance relevant to extracting the desired information. Typical elements of this model include specifications of the language, information structure, layout, character shape and degradation processes. The decoder generator converts the model into a specialized document recognition procedure that implements maximum a posteriori (MAP) image decoding with respect to the model. In the current implementation, the specialized decoders are in-line C programs that are compiled and linked with a support library.

The DID project is actively involved with the UC Berkeley Environmental Digital Library Project and has been developing specialized decoders for documents in the project testbed. These advanced structured document examples include a table of California Dams and descriptions of Delta Fish Species . Additional information about the application of DID to the digital library can be found in [Kopec].

Document Image Summarization

The Document Image Summarization (DIMSUM) activity is developing methods for creating summaries of scanned documents without performing optical character recognition (OCR). DIMSUM is motivated by the observation that performing OCR is often much more computationally demanding that preparing a summary using the recognized text, The DIMSUM strategy is to directly extract images of sentences and phrases that together communicate a sense of the document. To identify a set of summarizing excerpts, word boxes are extracted from the images and then word-box equivalence classes are formed. Based on word proximity and statistics on word frequency within the document, a set of summarizing excerpts is constructed. The following is an example of a five sentence summary of a three page article on rocket engine development created using DIMSUM:

One application of DIMSUM is in generation of single-page summary sheets that can be used later for retrieval of the full document. Additional information can be found in [Chen and Bloomberg].

Paper User Interface and Protofoil

The Paper User Interface and Protofoil are two approaches to providing interfaces that bridge the paper and digital worlds. Paper User Interface ([Johnson et al.]), moves the user interface beyond the workstation and onto paper itself. This technology uses paper forms, often placed as cover sheets in front of paper documents, to invoke electronic behavior or to control how documents are processed during and after scanning. Because paper forms eliminate the necessity of using a workstation interface for many tasks, users can perform a wide range of operations remotely, often in a manner that decouples their time from the system's activity. For example, a user can fill out forms for processing documents on a plane trip, fax the stack of documents (separated by forms) to his paper server at home for storage and/or distribution to others.

To support the exploration of the use of paper user interface, a paper infrastructure was developed that provides mechanisms for defining and managing forms, and for processing incoming streams of digital images from scanners and faxes, and for invoking document services provided in the electronic world.

The second interface for linking paper and digital documents is Protofoil [Rao et al. 94], a system for storing, retrieving, and manipulating scanned paper documents using an electronic filing cabinet metaphor. Protofoil is intended to support the filing needs of the individual information worker and has been deployed and evaluated at a lawyer's office as part of an extensive ethnographically-motivated design study. [See D-Lib Magazine, May, 1996 for more information about PARC work practices research related to digital libraries.] Protofoil allows users to store and retrieve document images and to invoke various document services on the stored documents. The system consists of three major components: (i) software for scanning documents and interpreting paper user interface forms as instructions (see below), (ii) a database for storing and archiving the document images and associated descriptions or auxiliary renderings, and (iii) a graphical user interface for retrieving and manipulating stored documents. Protofoil integrates components from a number of other PARC projects including the Text Database (TDB) [Cutting et al. 91] statistical content analysis engine.

Information Access and Visualization

The emergence of digital libraries promises to provide massive amounts of diverse types of digital information. This brings to the forefront the necessity of building new, more powerful, user workspaces for finding and using information in this increasingly rich and varied world.

Researchers at PARC have developed a variety of theoretical and computational tools for search over and retrieval from large collections of online documents (usually natural language texts, but increasing multimedia), as well as for helping the user understand and navigate the contents of the collections.

Some of the work in this area has recently been discussed in a digital library context in another publication [Rao et al. 95]. The discussion below focuses on a somewhat different subset of this work, and at different levels of detail. The main areas are:

Three novel Text Search and Browsing techniques: Scatter/Gather , TileBars , and Murax Question Answering ,
An architecture for 3D Information Visualization and a digital-library-oriented example of this technology, called Butterfly , and
Frameworks for the analytical and empirical characterization of information-intensive work.

This discussion omits important related work also underway at PARC including automatic text summarization [Kupiec et al.] (as opposed to image summarization discussed above), document filtering and classification from a pre-defined set of categories [Schütze et al.], automatic category assignment when no pre-existing categories are present [Sahami et al.], and automatic generation of thesaurus terms [Schütze].

Text Search and Browsing Paradigms

Information Access research at Xerox PARC focuses on amplifying the users' cognitive abilities, rather than trying to completely automate them. This framework emphasizes the participation of the user in a cycle of query formulation, presentation of results, followed by query reformulation, and so on. This framework is intended to help the user iteratively refine a vaguely understood information need. Since this work focuses on query repair, the information presented is typically not document descriptions, but rather intermediate information that indicates relationships between the query and the retrieved documents.

If a user of an information access system issues a query that retrieves a very large number of documents, that user cannot be expected to have the time and patience to read through a large set of titles. Instead, the information access system should provide the user with tools to facilitate the assimilation of the results. One possibility is to help the user reformulate the query by suggesting alternative terms. Another possibility, explored in the examples below, is to provide tools to aid the user in the navigation of the retrieval results.

Scatter/Gather Document Clustering

Scatter/Gather uses the metaphor of a dynamic table-of-contents to help the user navigate a large collection of documents. Initially the system uses document clustering to automatically scatter the collection into a small number of coherent document groups, and presents short summaries of the groups to the user. Based on these summaries, the user selects one or more of the groups for further study. The selected groups are gathered, or unioned, together to form a subcollection. The system then reapplies clustering to scatter the new subcollection into a new set of document groups, and these in turn are presented to the user. With each successive iteration the groups become smaller, and therefore more detailed.

The document clustering algorithm is optimized for speed, to encourage interaction, rather than to guarantee accuracy. The current system uses a linear-time clustering algorithm for ad hoc document collections and a constant-time algorithm for stable, preprocessed collections. The linear-time algorithm can organize 5000 short documents in under one minute on a SPARC20 workstation.

The cluster summaries are designed to impart general topical information. Clusters are summarized by presenting their size, a set of topical terms , and a set of typical titles. The topical terms are extracted from the document profiles , or weighted bag-of-words representations, of the documents included in the cluster and are intended to reflect the terms of greatest importance in that cluster. The typical titles are the titles of documents closest to the cluster centroid.

Here we demonstrate the use of Scatter/Gather on the TIPSTER collection of over 1 million newswire, newspaper, magazine and government articles, dating mainly from the late 1980's. We also make use of one of the TREC queries and its associated relevance judgments. For this query, the task is to find all documents that discuss the following abbreviated version of Topic 87: Criminal Actions Against Officers of Failed Financial Institutions.

We formulated a query containing the terms bank financial institution failed criminal officer indictment and instructed the system to retrieve the 500 top-ranked documents according to a standard weighting algorithm, which are then gathered into five clusters; below the resulting sizes and topical terms are shown.

Cluster 4 stands out for the purposes of the query in that it contains terms pertaining to fraud, investigation, lawyers, and courts. Since we know the system has retrieved documents that pertain to financial institutions, we can assume that the legal terms occur in the context of financial documents. It turns out that out of these 500 retrieved documents, only 21 had been judged relevant to the query by the TREC judges and 15 of these relevant documents appear in Cluster 4. The user can now select this cluster (or several clusters) and re-scatter it to see its contents in more detail, or access the documents directly by clicking on their titles. See [Hearst and Pedersen], [Cutting et al. 92b] and [Cutting et al. 93] for more for more information.

TileBars

The TileBars interface [Hearst] allows the user to make informed decisions about which documents and which passages of those documents to view, based on the distributional behavior of the query terms in the documents. The goal is to simultaneously and compactly indicate (i) the relative length of the document, (ii) the frequency of the term sets in the document, and (iii) the distribution of the term sets with respect to the document and to each other. Each document is partitioned in advance into a set of subtopical segments. Below we show an example run on a query about efforts at technology transfer of research at Xerox and PARC, run the TREC/TIPSTER collection of over 1 million newswire, newspaper, magazine and government articles, dating mainly from the late 1980's. The query consists of three Term Sets, where each set of terms is meant to correspond to a topic of the query:

The TileBars for the most relevant looking cluster (from the results of a Scatter/Gather) are shown. The ranking reflects criteria specific to this interface: the documents are ranked first by overlap: how many segments have hits for all termsets, second by total number of hits, and third by the ranking from a similarity search. The number shown is the original similarity search ranking.

Each large rectangle indicates a document, and each square within the document represents a coherent text segment. The darker the segment, the more frequent the term (white indicates 0, black indicates 8 or more hits, the frequencies of all the terms within a term set are added together). The top row of each rectangle correspond to the hits for Term Set 1, the middle row to hits of Term Set 2, and the bottom row to hits of Term Set 3. The first column of each rectangle corresponds to the first segment of the document, the second column to the second segment, and so on.

In this example we can see at a glance that all three topics are discussed in at least one segment in the first 16 documents, but that the last four documents discussion only Xerox and research, with no discussion of business or technical transfer. We can also see the relative lengths of the documents and how strongly the three topics overlap within the documents. The score next to the title shows what a standard ranking algorithm would produce.

A version of TileBars has been implemented in Java as part of the UC Berkeley Digital Library project and can be experimented with at http://elib.cs.berkeley.edu/tilebars.html.

Question Answering

Murax [Kupiec] examines the domain of closed-classed questions, that is, those questions with specific answers. In contrast to conventional information retrieval systems, the desired response to a closed-classed question is a fact, not a document. Murax approaches this problem by combining robust shallow natural language analyses with heuristic scoring of ``answer hypotheses'' to yield candidate answers with text fragments as supporting evidence. This work has makes use of an on-line encyclopedia.

For example, in answer to the question What New York City borough was the setting for Saturday Night Fever? the system responds as follows:

Brooklyn

Article: Travolta, John

The actor John Travolta, b. Englewood, N.J., Feb. 18, 1954, was launched to stardom by his portrayal of Tony Manero, a Brooklyn disco king, in the film Saturday Night Fever (1977)...

Article: Brooklyn

Brooklyn, a borough of New York City with a population of 2,230,936 (1980), is located on southwestern Long Island...

Evidence is knitted together from noun phrases taken from two different articles. Brooklyn was suggested as the answer because it appears in conjunction with the phrase Saturday Night Fever in the Travolta article, and with borough and New York City in the Brooklyn article.

Question answering of this type should play an increasingly important role as more and more information becomes available electronically.

Information Visualization

The Information Visualization project has for many years been exploring the application of interactive graphics and animation technology to the problem of visualizing and making sense of larger information sets. This work is based on the premise that many complex information tasks can be simplified by offloading complex cognitive tasks onto the human perceptual system.

The Information Visualizer (IV) [Robertson et al.] is based on a 3D Rooms metaphor to establish a large workspace containing multiple task areas. In addition, other novel building blocks were developed to support a new user interface paradigm. The IV architecture has enabled the development of a set of animated information visualizations for hierarchical information, including the Cone-Tree , the Perspective Wall and the Table Lens. The figure below shows examples of a cone-tree and a perspective wall. Many of these techniques use the display uniformly, assigning a large number of pixels to a focus area, and retaining contextual cues in less detail. These techniques allow the display of larger number of items than could previously be put on the screen. For example, the top 600 nodes of the Xerox organization chart could be seen all at the same time even though it otherwise requires an 80-page paper document.

The Butterfly Citation Browser

An application of IV that is of particularly relevance to digital libraries is Butterfly , an Information Visualizer application for collections of scholarly papers, which have rich patterns for visualization including people, time, place, and citation relationships [Mackinlay et al]. Butterfly allows the user to quickly navigate citation relationships among scholarly papers (available from DIALOG's Science Citation databases) by interacting with virtual 3D objects representing papers and their citation relationships. Network information often involves slow access that conflicts with the use of highly-interactive information visualization. Butterfly addresses this problem, integrating search, browsing, and access management via four techniques:

visualization supports the assimilation of retrieved information and integrates search and browsing activity,
automatically-created ``link-generating'' queries assemble bibliographic records that contain reference information into citation graphs,
asynchronous query processes explore the resulting graphs for the user, and
process controllers allow the user to manage these processes.

Experience with the Butterfly implementation has allowed the proposal of a general information access approach, called Organic User Interfaces for Information Access, in which a virtual landscape grows under user control as information is accessed automatically.

Full Size Image

Analytical and Empirical Characterization of Information-Intensive Work

Information retrieval has often been studied as if it were a self-contained problem (e.g., library automation). Yet from the user's point of view, information retrieval is almost always part of some larger process of information use. Researchers at PARC are engaged in a set of empirical and theoretical studies to characterize information-access-intensive work in a way that leads to the design and evaluation of digital library and related systems.

There are currently three thrusts. First is the characterization of accessible information in terms of its cost structure [Card and Pirolli]. Information retrieval can be thought of as just the rearrangement of this cost structure. Second, is the application of concepts from the literature of optimal foraging theory from biology to information access. Based on cost and benefit parameters, this allows us to understand the ecology of various information strategies, such as direct retrieval vs automatic dissemination [Pirolli et al.]. Third is the development of a theory of sensemaking [Russell et al.], which articulates the methods by which raw information is combined to produce new information products and insights. All three of these components are being explored in the context of the World Wide Web and information access.

Middleware

Middleware refers to software and systems architecture that helps knit together the various pieces of a distributed information delivery system, including information repositories, search and retrieval tools, and client-side workspace tools such as visualizations. Important emerging issues are the need to develop a network-oriented object-oriented mechanism for system interoperability, and interfaces that make use of this mechanism to expose the capabilities of the underlying sub-systems.

PARC work in interface and middleware development for digital library integration has focused on three components:

Language and system interoperability (ILU)
Standardization of interfaces to repositories (DMA)
Integration of multiple search services over collections (Metro).

The Inter-Language Unification System (ILU)

There are currently two primary distributed object-oriented programming paradigms available as open standards in the commercial market, Microsoft's Distributed Component Object Model (DCOM) and OMG's Common Object Request Broker Architecture (CORBA). While there are many differences between the two, and debates about their relative merits are intense, there is no question that both are powerful enough to serve as the infrastructure on which to build digital library systems. The Inter-Language Unification System (ILU) is an implementation of the CORBA standard that was developed at PARC and is distributed freely over the Internet. (See ftp://ftp.parc.xerox.com/pub/ilu/ilu.html.)

ILU is a multi-language object interface system designed to provide interconnection between components of a distributed system. The object interfaces provided by ILU hide implementation distinctions between different languages, between different address spaces, and between operating systems. For example, ILU allows modules written in Common Lisp, C++, and Python to be combined. It also automatically provides networking to interconnect parts of the system running on different machines, thereby relieving application programmers of the need to write networking code. Finally, it can be used to define and document interfaces between the modules of non-distributed and distributed programs using ILU's Interface Specification Language. Currently several of the NSF-sponsored Digital Library Initiative projects are making use of ILU.

The Document Management Alliance (DMA)

Xerox joined with Novell in May 1994 to propose a standard framework for distributed document repositories called Document Enabled Networking (DEN). This standard was both competitive and complementary with the standard proposed by the Shamrock Document Management Coalition, so in April 1995 they were merged to produce the Document Management Alliance (DMA), a standards group under the auspices of the Association for Information and Image Management (AIIM). The proposed DMA standard specifies an object system and interfaces for repository-based document management, including storage and retrieval of documents of various types, version management services, catalog information and search services, and integration of services over multiple repositories. All objects in DMA are self-describing using mechanisms built into the standard.

The DMA standard is currently in draft form, with a version 1 release expected by third quarter of 1996. PARC and other Xerox researchers are actively involved with DMA standards and development activities. In particular, the 1996 AIIM trade show in Chicago featured a demonstration of DMA-based interoperability between a number of client and repository systems, one of which was developed by Xerox researchers in Webster, New York. In addition, Xerox research in El Segundo, California, has been involved in porting the XSoft DMA middleware (available free to DMA members) from Win32 to Unix, and is currently working on extending the middleware to support distributed repositories.

The Metro Project

A DMA-related effort at PARC is the Metro research project, which provides ILU-based interfaces and middleware that allow the integration of multiple search services over DMA-style repositories. The integrated services may provide quite different analyses of the underlying repository documents (e.g., image vs. text retrieval); Metro allows each of the services to be developed independently and then integrates their operation and search interfaces. The Metro implementation allows for arbitrary distribution of the service and repository system elements, as well as for optimizations in cases where they are co-resident. In future, Metro functionality is intended to include query persistence (result set updates as repository contents change).

UPrint1

Currently when books are published, a large number of them are printed all at once and distributed to bookstores and other merchants, with the remainder stored in warehouses. If publishers greatly overestimate the number of books desired they must absorb the expense of unsold books. On the other hand, print runs using standard printing equipment are expensive unless done in bulk.

The UPrint1 project is exploring whether it is practical to print books one at a time, at the time the customer wishes to purchase them [Bruce et al.]. (This assumes that the trend toward digitizing printed material will not make books obsolete, but rather will make bookstores obsolete. Printed books will stay around because paper has high contrast and resolution and books don't require batteries or power cords.) This may be more practical today than in the past due to the recent availability of high-end, high-volume publishing systems such as the Xerox DocuTech Network Publisher. If this effort does prove to be practical, it may dramatically change the book publishing industry. For instance, books would not need to go out-of-print, and could be customized when printed to have large type, use a special kind of paper, and so on. As another example, small-run, specialized books, such as academic monographs, should become less expensive to produce.

The UPrint1 project makes use of an online, or virtual, bookstore. After making a selection at the virtual bookstore, one copy is printed for the customer, either at home if an appropriate printing device is available, or at a print shop. Some advantages of a virtual over a physical bookstore are: it's contents are searchable, browsing can be done in multiple ways instead of the static shelf arrangement of a physical bookstore, a larger selection of books can be available, the bookstore is accessible anytime from anywhere, and it can provide auto-recommendation from an appropriate community of experts. On the other hand, a physical bookstore provides a place for people to meet and talk, allows for the reading and browsing of physical copies of the books, and often has a coffee shop next door.

A virtual bookstore without a UPrint1 service is no different from existing web bookstores in which users find what they want and publishers fill their orders from a warehouse and ship the physical books. Thus books that are out of print, or that are temporarily sold out are unavailable without a mechanism like UPrint1. Futhermore, UPrint1 avoid the lag time between ordering and receipt of the book.

There are many aspects to making UPrint1 a reality. Currently the project is focusing on the question of how to entice people to visit an online or virtual bookstore that has no physical books. One idea is to give the visitor to a virtual bookstore a recommendation service they can't get at a physical bookstore. As a sample scenario, imagine you visit the bookstore and search for books on the Java programming language. You will get many hits: how do you know which book you should buy? If you have previously indicated some of your favorite computer books, the system can match you with readers of similar taste, and suggest a Java book that these readers found helpful. Work on this aspect of the project is currently ongoing.

Summary

This article has surveyed some of the recent activities at PARC related to digital libraries, focussing on document capture, information access and visualization, and middleware. The set of specific projects described here is representative of the range of ongoing activities, but is by no means an exhaustive list of all relevant PARC projects. Missing, for example, are significant activities in web-based authoring, high-resolution displays, AAA (authentication, authorization, accounting), multilingual technology, and electronic commerce-based document services. Information about some of these can be found via the PARC home page.

Acknowledgements

Bill Janssen, Ramana Rao, and Hinrich Schütze contributed material to this article. Kris Halvorsen and Larry Masinter read and commented on early drafts.

References

[Bruce et al.] R. Bruce, J. Foote, D. Goldberg, J. Pedersen, K. Petersen, ``UPrint1,'' Xerox PARC Internal Report 1996.

[Card and Pirolli] S. K. Card and P. Pirolli. ``The Cost-of-Knowledge Characteristic Function: Display Evaluation for Direct-Walk Dynamic Information Visualizations,'' Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, April 1994.

[Chen and Bloomberg] F. Chen and D. Bloomberg, ``Extraction of Thematically Relevant Text from Images,'' Fifth Annual Symposium on Document Analysis and Information Retrieval, April 15 - 17, 1996, Las Vegas, Nevada.

[Cutting et al. 91] D. Cutting, J. Pedersen, and P.-K. Halvorsen. ``An Object-Oriented Architecture for Text Retrieval,'' Proceedings of RIAO'91.

[Cutting et al. 92] D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun. ``A Practical Part-of-Speech Tagger,'' Proceedings of Applied Natural Language Processing, Trento, Italy, 1992.

[Cutting et al. 92b] D. Cutting, D. Karger, J. Pedersen, and J.W. Tukey. ``Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections,'' Proceedings of the 15th Annual International ACM/SIGIR Conference, 1992.

[Cutting et al. 93] D. Cutting, D. Karger, and J. Pedersen. ``Constant Interaction-Time Scatter/Gather Browsing of Large Document Collections,'' Proceedings of 16th Annual International ACM/SIGIR Conference, Pittsburgh, PA, 1993.

[Kaplan and Kay] R. Kaplan and M. Kay. ``Regular Models of Phonological Rule Systems,'' Computational Linguistics, 20 (3), pp.331-378, September 1994.

[Hearst] M. A. Hearst, TileBars: ``Visualization of Term Distribution Information in Full Text Information Access,'' Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, Denver, CO, ACM, May 1995. http://www.acm.org/sigchi/chi95/Electronic/documnts/papers/mah_bdy.htm

[Hearst and Pedersen] M. A. Hearst and J.O. Pedersen ``Re-examining the Cluster Hypothesis: Scatter/Gather on Retrieval Results,'' Proceedings of the 19th Annual International ACM/SIGIR Conference, Zurich, 1996.

[Johnson et al] W. Johnson, S. K. Card, H.D. Jellinek, L. Klotz, R. Rao, ``Bridging the Paper and Electronic Worlds: The Paper User Interface,'' Proceedings of INTERCHI, ACM, April 1993. pp. 507-512.

[Kopec] G. Kopec, ``Document image decoding in the Berkeley digital library project,'' in Document Recognition III, L. Vincent and J. Hull, editors, Proc. SPIE vol. 2660, pp. 2--13, 1996.

[Kopec and Chou] G. Kopec and P. Chou, ``Document image decoding using Markov source models,'' IEEE. Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 6, June, 1994.

[Kupiec] J. Kupiec. ``MURAX: A Robust Linguistic Approach For Question-Answering Using An On-Line Encyclopedia,'' Proceedings of 16th Annual International ACM/SIGIR Conference, Pittsburgh, PA, 1993.

[Kupiec] J. Kupiec, J. Pedersen, F. Chen, ``A Trainable Document Summarizer,'' Proceedings of 18th Annual International ACM/SIGIR Conference, Pittsburgh, PA, 1995.

[Lynch and Garcia-Molina] C. Lynch and H. Garcia-Molina, Interoperability, Scaling, and the Digital Libraries A Report on the May 18-19, 1995 IITA Digital Libraries Workshop Reston, VA, August 22, 1995. http://www-diglib.stanford.edu/diglib/pub/reports/iita-dlw/main.html

[Mackinlay et al.] J. D. Mackinlay, R. Rao and S. K. Card. ``An Organic User Interface For Searching Citation Links,'' Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, Denver, CO, May 1995. http://www.acm.org/sigchi/chi95/Electronic/documnts/papers/jdm_bdy.htm

[Pirolli] P. Pirolli and S. K. Card, ``Information Foraging in Information Access Environments,'' Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, Denver, CO, ACM, May 1995. http://www.acm.org/sigchi/chi95/Electronic/documnts/papers/ppp_bdy.htm

[Rao et al. 94] R. Rao, S. K. Card, W. Johnson, L. Klotz, and R. Trigg, ``Protofoil: Storing and Finding the Information Worker's Paper Documents in an Electronic File Cabinet,'' Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, April 1994.

[Rao et al. 95] R. Rao, J. O. Pedersen, M. A. Hearst, et al., ``Rich Interaction in the Digital Library,'' Communications of the ACM, 38 (4), 29-39, April 1995.

[Robertson et al.] G. G. Robertson, S. K. Card, J. D. Mackinlay. ``Information Visualization Using 3D Interactive Animation,'' Communications of the ACM, v.36, n.4, 1993.

[Russell et al.] Dan M. Russell, Mark J. Stefik, Peter Pirolli, Stuart K. Card. `` The Cost Structure of Sensemaking,'' Proceedings of ACM InterCHI '93. April 1993.

[Sahami et al.] M. Sahami, M. Hearst, and E. Saund, ``Applying the Multiple Cause Mixture Model to Unsupervised Text Category Assignment,'' Proceedings of the 13th International Conference on Machine Learning, Bari (Italy), July 3-6th, 1996.

[Schütze] H. Schütze, ``Dimensions of Meaning'', Proceedings of Supercomputing, pages 787-796, Minneapolis MN, 1992.

[Schütze et al.] H. Schütze, D. Hull and J. O. Pedersen, ``A Comparison of Classifiers and Document Representations for the Routing Problem,'' Proceedings of the 18th Annual International ACM/SIGIR Conference, pages 229-237, 1995. ftp://parcftp.xerox.com/pub/qca/sigir95.abs.html

June 5, 1996