Recent Developments in GALEN II

Evolution of a Digital Library for the Health Sciences

John A. Kunze
Manager, Advanced Technology Group
jak@ckm.ucsf.edu

Brian N. Warling
Manager, Digital Library Operations
warling@library.ucsf.edu

Library and Center for Knowledge Management
University of California, San Francisco
San Francisco, CA 94143-0840

D-Lib Magazine, March 1996

ISSN 1082-9873



Introduction

GALEN II is the digital library of University of California, San Francisco (UCSF). GALEN II is based on the World Wide Web, which provides us with the first real distributed network technology on which to build a digital library for the health sciences. The goal of GALEN II is to provide UCSF faculty, staff and students with the tools they need to create, disseminate, organize and locate biomedical information. GALEN II also provides the UCSF community with seamless, integrated access to the world's biomedical knowledge base, including databases, locally generated information and the published scientific literature.

Recent Advances

In August 1995, the Library released GALEN II, version 1. This release primarily included a number of important "infrastructure" elements, such as context-sensitive help, directed comments, and fielded searching. New features included on-line information and consulting requests, a growing set of Internet- based "knowledge resources," and an interface to the ARCHIE service.

To meet its goal in facilitating the dissemination of biomedical information, GALEN II also provides a publication platform for the UCSF and other relevant communities. In July 1995, the now infamous Brown and Williamson Company tobacco documents were published through GALEN II. Earlier in 1995, the UCSF Investigators' Handbook, an important campus resource, was published. The latest GALEN II publication is Trials Search: California HIV Clinical Trials. More electronic publications are planned for the future.

Providing access to vital biomedical knowledge resources is another important component of GALEN II. Using collection development guidelines specific to electronic resources, a team of librarians selects important resources in the biomedical fields. Criteria such as authenticity, accuracy, currency, and utility to the UCSF community are used in evaluating resources. Selected resources are categorized and then added to the GALEN II Knowledge Resources section along with evaluative information. Future challenges involve streamlining the content management task and facilitating resource selection and evaluation among our peer institutions.


A World Wide Web Interface to MEDLINE

The next version of GALEN II -- version 1.5 -- is scheduled for release in the Spring 1996. The focus for this version is the development a forms-based interface to the University of California's MELVYL® MEDLINE® PLUS database utilizing the Z39.50 protocol. MELVYL MEDLINE PLUS is a bibliographic database containing citations and abstracts to articles from over 4000 journals in medicine, life science, and health administration. It contains records from the MEDLINE and Health Planning & Administration (HEALTH) databases, both produced by the U.S. National Library of Medicine. The MELVYL system is maintained by the UC Division of Library Automation (DLA), and provides the UC community with access to a number of databases besides MEDLINE, including PsycINFO and INSPEC. Creating an interface to MELVYL MEDLINE PLUS was an easy decision, since UCSF is a graduate health sciences campus, and 80% of all MELVYL system use from UCSF comes from the MEDLINE database.

The current MELVYL character-based search interface, while powerful, is difficult to use, especially for the novice and occasional user. The learning curve is rather steep. Through input from the faculty, we have learned that a simpler interface to MEDLINE is be a highly desired feature. Faculty would also like direct access to electronic versions of journals. One of the most exciting aspects of this project is the direct linking from retrieved MEDLINE citations to the full articles via links to the journals in the Red Sage system.

For over two months at the end of 1995, a team consisting of librarians and programmers met on a weekly basis to develop the functional specifications for the interface. The task was divided into three main components: (1) query formulation, (2) interface design, and (3) display formats. Given the relatively short development time frame and the potential for unforeseen technical roadblocks, the team decided to limit the system to a core set of features. There were a number of important pieces of functionality that did not make it into the final specification, such as automatic stemming, search set manipulation and current awareness. These features will be phased in at later times. User testing should reveal any limitations in the current design and also aid in prioritizing the development and incorporation of new features.


Under the Hood: Technical Challenges

Creating a smoothly functioning World Wide Web ("the web") interface to MEDLINE presents several technical challenges. A network query language (protocol) has to be selected that complements the elegant but limited Hypertext Transport Protocol (HTTP), the dominant protocol on the web. A translation mechanism must be created that converts HTTP requests into database searches and then converts search results into HTTP responses. Finally, a mechanism has to be written that translates citation records from their native database format into the web's information format, Hypertext Markup Language (HTML). This section describes the GALEN II approach to each of these challenges.

The Choice of Z39.50

A database the size and complexity of MEDLINE demands a significant programming commitment from the builder of the user interface. Since the two main direct avenues to MEDLINE are through (a) remote login (telnet) to a user-driven command-line interface and (b) a software-driven Z39.50 interface, it made sense to choose the more predictable Z39.50 interface, if only to protect our programming investment.

The computer-to-computer information retrieval protocol, Z39.50, is especially useful in this situation. Unlike the telnet interface, the Z39.50 interace has precisely defined, machine readable outputs (e.g., records, diagnostics) that convey more than enough structural and semantic information to provide foundation support for a variety of user interfaces. Because software does not tolerate output changes in the user-optimized telnet interface as well as human beings do, the telnet interface is not a stable foundation.

At the same time, Z39.50 has user-oriented features that make it preferable to a typical remote database query language. One of these features is server-side result sets, which allow users to avoid the network transfer of search results when and, especially, if they are wanted. Another feature is Explain, which returns both human and machine-readable descriptions of remote system objects such as databases, indexes, and record formats. There are other Explain features that support special library functions in a standardized way, such as term list scanning and document ordering.

Another benefit of using Z39.50 is that the programming cost for an interface can be amortized by re-applying it to other Z39.50 databases on the network. Once the user interface that interoperates with MEDLINE via Z39.50 is built, only a small fraction of extra effort is needed to re-use it against another database.

The Web Supplies Most of a User Interface

The web, of course, has its own unique advantages, making a marriage of it and Z39.50 especially appealing. One of the primary advantages of the web is that almost every user with Internet access also has a web browser (a user-interface for accessing the web). This means that without any software distribution cost, a web (HTTP) server can cause a relatively sophisticated user interface to appear on any library patron's computer that connects to it.

All that is required in the simplest case is a knowledge of HTML, the web's standard notation for expressing document text, together with links to other documents (text + links = hypertext). For more complex applications, HTML forms are required. HTML forms can be viewed as a very simple way to specify more advanced user interface components such as dialog boxes, radio buttons, check boxes, and pull down menus.

While the graphical layout of these components is relatively easy to achieve and test in HTML, connecting user manipulation of them (e.g., "checking" a box or entering some text) to HTTP server actions requires programming. Each manipulation by the user is recorded in the HTML form until a special "submit" button causes the "filled-out" form to be sent to the server. What happens then depends on whether or not there is a server gateway.

The Gateway: Stateless vs. Stateful Protocols

Servers do not normally respond to the filled-out form directly because the actions required are highly non-standard (unlike the Z39.50 case) and may be specified arbitrarily by the interface builder. Instead, the usual practice is to program the actions in a module external to the server, known as a gateway. The only other option -- modifying the server code directly -- can improve response time, but makes it harder to keep up with new releases of base server software and generally will not work with any other server software.

For these reasons, GALEN II uses the gateway option. In particular, it uses the Common Gateway Interface (CGI) so that it may interoperate with any conforming server base. This means that the MEDLINE program modules will run on a wide variety of servers, and that upgrading the GALEN II server can be done without disrupting the MEDLINE interface.

Our gateway's job is to perform two kinds of translation. It takes user input from a filled-out HTML form, converts it into Z39.50 queries or retrieval requests, and sends them on to the MELVYL MEDLINE server. It also takes Z39.50 responses from the MELVYL server, converts them into HTML forms and sends them back to the browser for display to the user. Put another way, the GALEN II gateway translates HTTP protocol messages into Z39.50 protocol messages, and vice versa.

One of the advantages of the web is the stateless nature of HTTP, which is explained as follows. A server using a stateless protocol (such as HTTP) treats each request as if from a client with which it has never communicated before; in other words, it maintains no memory, or state, regarding the client. In contrast, a stateful protocol (such as Z39.50) is conducted over a session for which the server keeps track of things like user identification and search results as they accumulate over the course of the session.

A particular technical challenge was how to make a series of stateless HTTP fetch requests connect to the same Z39.50 session and have the HTTP responses reflect the continuity of the corresponding series of stateful Z39.50 operations. In GALEN II this is done by making each HTTP request/response contain an HTML form into which a session identifier is inserted using a so-called "hidden field". The semantics of hidden fields in an HTML form dictate that they are not displayed to the user but are simply sent back unchanged when the filled-out form is sent to the server. The HTTP server effectively asks the browser to remember the context in which the form was created, and to remind the server when the next request is sent in.

The session's state is thus passed back and forth instead of being held at the server. While this preserves server simplicity, it merely shifts the burden onto the protocol interaction. The solution is not especially satisfactory and underscores a weakness of HTTP, yet it works well enough to obtain the tremendous leverage of using existing web browser interfaces.

Generating HTML on-the-Fly

Almost all of the gateway responses consist of HTML forms that are created by program code after a request is received. While this means that the exact HTML responses cannot be known ahead of time, GALEN II uses HTML form templates that specify most of the response in advance but contain substitutable SGML-style entities. To generate a response form, a template is selected (e.g., for search results) and scanned, replacing entity references with appropriate values, including hidden fields. The response form is then sent back to the browser.

At the heart of form generation is a program module that converts MARC records into HTML suitable for display or into text for downloading. This currently involves a table-driven procedure that steps through each field of the MARC record and consults a rule set that specifies how to display each field and combine it with other fields. The output is a complex stream of text and HTML table codes needed to display the rich set of information carried in a MEDLINE record.


Future Initiatives and Challenges

GALEN II will continue to evolve as campus needs change and as internal and external pressures force us to re-examine how best to provide important information services. A number of new GALEN II products are on the horizon. In mid-1996, the Library/CKM, in partnership with the University of California Press, will publish the electronic version of The Cigarette Papers, an important new book that will analyze the content of the Brown and Williamson Company tobacco documents. The print version will be published by UC Press. More electronic publications are also anticipated. The Library and Center for Knowledge Management will also be making decisions concerning the next major phase of GALEN II development. One possible new feature is a web interface to the Library's on-line public access catalog. User evaluation is another vital component that will be addressed in the near future, as will the system's overall design.
MELVYL is a registered trademark of the Regents of the University of California
Copyright © 1996 Regents of the University of California


D-Lib Home Page |  D-Lib Magazine Contents Page | Comments
Next Story

hdl://cnri.dlib/march96-warling