The Future is a Complex Place
Imagine an information landscape in which all the items we wanted in digital form were already converted. What would remain for us to work on? One obvious answer is the thorny problem of heterogeneous information -- even in an environment in which objects have been encoded as sequences of bits with order and structure. Why?
One reason is purely practical: the continued existence of legacy systems and data. Already, vaults, stacks, and discs contain information in a variety of digital formats of varying obsolescence -- from satellite telemetric imagery to census data to bibliographic records -- while the information technologies are themselves growing explosively. What is experimental today seems to become standard tomorrow, and then dated, if not obsolete, in a year or two. Dedicated word processors had hardly supplanted electric typewriters when they were themselves replaced by word processing programs running on personal computers. And the story seems to have been re-told in everything from chip design to spreadsheet and database management programs. Thus, a price of rapid change is continuing creation of legacy systems and data.
A second reason is the heterogeneity of the objects themselves. One definition of information is a set of data that possesses meaning, and meaning can be highly context dependent. Consider some of the issues that arise in a digital library of art and architecture. Through virtual reality technologies, we may be able to represent a cathedral's observable characteristics: height, footprint, mass, facades, elevations, decorative features, and so on. But there are extrinsic descriptions, such as "English Perpendicular Gothic", that are not encoded in the bits but that are critical to humans who may wish to search for the representation. This distinction between meaning that we recognize in objects and attributes that physically define the object poses an important set of issues for digital libraries, particularly distributed, interoperable libraries containing objects from which users will assemble collections. Thus, even if the storage formats are platform independent, objects encoded according to uniform rules will, nonetheless, possess heterogeneous characteristics that may matter very much to the humans who use them.
Such extrinsic information as "English Perpendicular Gothic" has traditionally been captured in cataloging records associated with the representation of the object, but detailed cataloging is a labor-intensive, value-added service. One of several tools developed for coping with similarity of content, others are thesauri for standardized indexing of materials and formulating queries, collection management strategies that isolate collections of related subjects (such as an art and architecture library), and the socialization of higher education, where we learn the vocabularies and values of our chosen subject domains. Embedding thesauri, such as the Art and Architecture Thesaurus, in indexing and information retrieval systems is one proposed solution, but thesauri still must be updated and maintained because, as Caroline Arms points out, the content of these domains continues to evolve as does the language in which we describe them. Research at the National Library of Medicine clearly shows that maintenance is neither easy nor trivial even in a domain in which the language is relatively controlled. A casual search on the phrase, "infantile paralysis", in HyperDoc's excellent collection of historic photographs returned 25 responses. A second search on the word "polio" recovered 5 different items, and a third search on the technically correct term for the disease, "poliomyelitis", yielded another 20 matches.
This messy issue of meaning is related to the decontextualization of objects that necessarily occurs when the information is stored. Compared with storage of a physical print document, deconstruction of the virtual or electronic item begins to occur when the object is stored as a sequence of code and data -- as happens on my personal computer every day. To recover a story exactly as I may have written and displayed it requires users to have the same or interoperable word processor (or browser). In distributed libraries, metadata, which may help users reassemble and re-discover the original object, can also be stripped out and again stored elsewhere.
At the same time, the process of storing the object must somehow come to terms with the context of the original. In the examples given by Andreas Paepcke in this issue and Jon Hujsak in January, project staff found interim information from earlier projects useful. This kind of information, which may range from web annotations to penciled sketches, places a curious burden on the user. Suppose an engineer pursuing design ideas for space craft comes across individually-stored designs and log entries about O-rings. Files dated 1985 would not include analyses of the Challenger accident on January 28, 1986, and web annotations, notes, logs, and other project-related ephemera are in their very nature incomplete. The full collection may not include the post-disaster reports or perhaps the search did not turn them up. Presumably, this fictive researcher would think to pursue performance data, since engineering design phases are fairly well established. But similar taxonomies of phases of work may not exist for all domains. Moreover, a searcher might not even look for performance data if the context of the later re-use (say, marketing) differs from the context in which the objects were created.
This notion of context is related to the idea behind terms, such as "English Perpendicular Gothic", that capture agreed-upon collections of characteristics. It is closer to the preservationists' concept of "provenance", which means the circumstances of an item's creation, discovery, or use. But even then, provenance information might serve only to alert users to a potential interpretive issue or set of issues. Still, the concept of provenance does suggest a class of human-computer interactions to guide users to know what questions to ask.
Such processes of re-use, re-discovery, and reinterpretation of documents are common place to lawyers, manuscript catalogers, and editors, and the notion of context may yet become a discussion item in the digital library research agenda. Still, the "wetware" in front of the machine counts, and in our first story this month, Joshua Lederberg suggests that the need to select, vet, and evaluate information is likely to become more important in the electronic world where information of all sorts is proliferating. As an editor, I hope that he is right. As a former college professor, I know that the values of critical thinking can be taught. And from my catbird seat at the revolution, I think that heterogeneous information will continue to be a feature of the landscape.
Amy Friedlander
Editor
hdl://cnri.dlib/may96-friedlander