Nigel Kerr
Digital Library Production Services
nigelk@umich.edu
Harlan Hatcher Graduate Library, Room 301
University of Michigan
Ann Arbor, Michigan
The Humanities Text Initiative (HTI) is an SGML creation and support unit at the University of Michigan and is part of the university's Digital Library Production Service (DLPS). Created in 1994, the HTI's origins are in the University Library's 1988 efforts to create an Internet based "textual analysis" capability through a service then known as UMLibText. With a wider variety of collections and a broader user base than UMLibText, the HTI was designed as an umbrella organization for the creation and maintenance of online text, and as a mechanism for furthering the University's capabilities in the area of electronic text. Since its creation, the HTI has amassed perhaps the Internet's largest and certainly richest collection of materials in SGML (as of the date this article was written, there are almost 2 million pages of encoded text using 14 different DTDs online). As well as creating text collections available to the Internet community and working with scholars on the creation of new electronic editions, the HTI supports the delivery of externally-created SGML collections and has collaborated with publishers and with other academic operations to design and build local access mechanisms for their titles. In 1996, the HTI launched the SGML Server Program, which leverages Michigan's years of experience with electronic texts to assist in the development of SGML support at other academic institutions.
Works were selected for inclusion in the American Verse Project after consulting standard bibliographies, anthologies, and histories of American literature, including the 1993 Columbia History of American Poetry, Spiller's Literary History of the United States, Waggoner's American Poets from the Puritans to the Present, and Mattheissen's 1950 Oxford Book of American Verse. These were supplemented by specialized bibliographies of writing by American women and people of color. The list was expanded to include poets of special interest to American literary historians from Michigan's Department of English. They emphasized the extent of current scholarly interest in eighteenth and nineteenth century popular poetry, poetry by women, and African-American poetry. A list of nearly 400 American authors of poetry was assembled.
Working from this list, a survey of books held by the University Library was made and an electronic bibliography of print and electronic versions was constructed. Several hundred titles from the Michigan collection were evaluated to determine whether they were within the scope of the project; texts were selected and prioritized based on their scholarly interest and their physical properties (e.g., extent of deterioration and "scanability").
The volumes selected for inclusion in the American Verse Project are scanned without being disbound; currently, the HTI is using the Xerox 620 scanner and its Xerox Scan Manager software for batch scanning; a Fujitsu 3096 scanner and BSCAN have also been used. TypeReader is the software package primarily used for optical character recognition (OCR); it has worked very well for the recognition of the older typefaces in the nineteenth century material and has an unobtrusive, easy to use proofing interface. ScanWorx, a UNIX package available from XIS, has been used less frequently; because it can be trained to recognize non-standard characters, such as long s, it is useful for the oldest volumes in the collection. Prime OCR, a package developed by Prime Recognition that uses up to five OCR engines to dramatically improve OCR accuracy, is being evaluated for possible use. The HTI gives a great deal of attention to accuracy in the digitization process, with the assumption that access to reliable electronic texts is most important.
After a volume is in electronic form, automated routines are run to provide a first layer of SGML markup, identifying obvious text structures (lines of poetry, page breaks, paragraphs) and possible scanning errors, such as omitted pages. Careful manual markup occurs in the next stage, using SoftQuad's Author/Editor SGML editing package and the TEI's "TEILite" DTD. Ambiguous markup unresolved or introduced by the automatic tagging process is cleared up and markup too sophisticated for automated routines is done by the HTI's encoding staff.
After encoding, a formatted, printed copy of the text is used to proof against the original volume and for review of the markup by a senior encoder. All images found in the original volume are scanned and indicated in the encoded text; an image of the title page and verso is also included. Finally, full bibliographic information, including a local call number and the size of the electronic file is included in the header of the electronic text. The header is reviewed by a cataloger and a record for the electronic text is created for Michigan's online library catalog.
Although some of the texts for the CME have been scanned and OCRed by HTI staff, Middle English presents more problems in this area than does American poetry. Two characters no longer used in English, yogh and thorn, appear frequently throughout Middle English texts. In addition, some editions preserve the scribal abbreviations used in manuscripts -- letters with curls on them or bars through them. Although ScanWorx can be trained to recognize these characters, it makes the OCR process and subsequent proofing tedious and dramatically increases the amount of time spent in making corrections to the text. Most texts being created for the CME have been sent out for contract keyboarding, with excellent (1 error in 20,000 characters) results.
The encoding process for the CME is again similar to that for American Verse, with a few adaptations for the differences in material. Texts are automatically pretagged before being passed to the encoding staff. The TEILite DTD is used, but with additions from the TEI's DTD for the transcription of primary resources. These additions are necessary to encode the changes in scribal hand, additions, and deletions that have been made in Middle English manuscripts over time, which are captured in the editions selected for inclusion. Proofing, markup review, imaging, and review of the header complete the text creation process for the CME.
After the texts have been completed, they are moved to the HTI's web server and indexed using OpenText's search engine software. Web pages with search interfaces appropriate to the collection are written as HTML forms and "middleware" CGI programs to interact between the HTML forms, the search engine, and the SGML text are written by HTI's programmers. The middleware takes the search information submitted on the web page, formats it into a syntax understood by the search engine, and submits the search. It then retrieves the text found by the search engine, transforms the SGML into HTML, and sends it to the user's web browser. Once the middleware is in place, the collections are tested to make sure they work as expected and their availability is announced.
Preparing the American Verse and CME texts for Internet delivery is relatively straightforward; the HTI staff knows the DTD and the way in which the volumes have been encoded, the SGML is already validated, and the material and its potential uses are already familiar. Writing search interfaces and middleware for these collections is simplified by the knowledge that has been gained by working the texts through the entire creation process.
However, in addition to serving texts that have been created in-house, the HTI also delivers SGML created by outside sources, such as publishers (Grolier, InSo, McGraw-Hill, Oxford University Press) or electronic text vendors (Chadwyck-Healey, Intelex). In these cases, the processes involved in creating search interfaces and middleware for these texts are still much the same as for American Verse or CME, though the staff must first familiarize themselves with the contents of the work, its DTD, how the DTD was applied, and the potential uses of the text. For example, the HTI worked with the DLPS to offer the University of Michigan community access to Physician's GenRx, a reference work of pharmaceutical information published by Mosby. The work itself, the field it covers, and the audience it serves were unfamiliar to the HTI staff; before development work could begin they had to acquaint themselves with the printed version of the text and consult with librarians in the health sciences to get a better understanding of the contents and the potential uses of Physician's GenRx. Then the DTD and the text could be assessed for encoding structure, likely fields for restricted searches, and reasonable portions for retrieval -- is an average article so large it will crash a user's browser if delivered whole? Will it be possible to allow the user to browse the brand names of drugs? Are there cross-references to other articles that need to be linked? Questions like these need to be answered for every new collection of SGML the HTI receives from an outside source.
The staff tries to create interfaces for all the collections that will make the most of the users' familiarity with their browser and general principles of online searching, regardless of content. While certain restrictions and features will not be available in every collection due to differences in content and encoding, the basic look and operation of the search forms is consistent to the user. When dealing with texts created by others, the reviewing and testing phase is often longer and involves more input from other librarians, faculty, and interested users before the text is made available and announced to the Michigan user community.
The Collaboratory for the Humanities is a sub-unit of the HTI. Funded by a 1995 University of Michigan Presidential Initiatives grant, the Collaboratory allows the HTI staff to work extensively with three faculty fellows (one each from English, History, and Classical Studies) on the creation of new electronic editions in SGML. The Collaboratory also provides technical assistance and support for other Michigan faculty members working on text transcription and creation projects. Beyond the Collaboratory, HTI staff have assisted faculty at other institutions with DTD creation and adaptation and encoding assistance as they have worked on their own projects involving various levels of SGML and text-processing sophistication.
In order to provide online support for SGML-encoded text and reference collections to other institutions, the SGML Server Program (SSP) was formed. Operated by the HTI and DLPS staff, the SSP provides participants with the middleware to work with the OpenText search engine and assistance with processing and indexing the SGML collections covered by the program. Institutions can choose to access collections on a server located at the University of Michigan, or to have the texts installed on their own servers. Ongoing development, training, documentation, and online support are also offered to SSP participants.
The HTI also provides SGML services to external organizations; The Medieval Review and the Human Relations Area Files are two examples. The Medieval Review (formerly the Bryn Mawr Medieval Review) is an electronic journal of scholarly book reviews freely available to the Internet community; the Human Relations Area Files collection is a large corpus of anthropological texts, distributed to members in annual installments since the 1940's, available to organizations paying annual dues. Each has its own special needs.
TMR consists of articles from reviewers, submitted by the editors of the journal. As documents, they are fairly simple: bibliographic information about the work reviewed, information about the reviewer, and the review itself, with varying kinds of other bibliographic references. Previously these had been available as plain text files via a listserv and through a gopher server. With the advice and assistance of HTI staff, the backlog of articles was converted to SGML using a fairly straightforward version of the TEILite DTD. New articles are encoded in original SGML using the same DTD by the TMR staff. The database and middleware are rather similar to American Verse and the CME, with a few significant variations. The formatting and display are unique to TMR's editors' desires, and as TMR is not located at the University of Michigan, the editors submit the articles to the database via an email robot that receives and validates their articles. Like other HTI collections, TMR is browsable and searchable and is available via the web as both SGML and HTML.
HRAF is somewhat different. Some years ago, the staff of HRAF decided to begin to provide what had long been a paper, then microfiche, resource in an electronic form. SGML seemed to suit their needs; their data consists of numerous indexed texts and images for each installment of the HRAF. After extensive consultation with HRAF's DTD expert and testing and revision of the user interface, the web eHRAF will go online this summer for member institutions. The interface and display have been discussed and agreed upon so that searches and formatting suit HRAF's understanding of how the works are used. The eHRAF process continues each quarter with a review of the interface and usage by patrons, and an annual installment of new materials.
The Feasibility of Wide-area Textual Analysis Systems in Libraries: A Practical Analysis, presented at Literary Text in an Electronic Age: Scholarly Implications and Library Services, the 31st Annual Clinic on Library Applications of Data Processing (University of Illinois at Urbana-Champaign), April 10-12, 1994. Published in the Proceedings of the Clinic. Also available on the WWW at http://jpwilkin.hti.umich .edu/pubs/dpc.html
Text Files in Libraries: Present Foundations and Future Directions, Library Hi Tech, Consecutive Issue 35 (1991) 7-44. Pre-publication version available on the WWW at http://jpwilkin.hti.um ich.edu/pubs/hi-tech.htm
Text Files in RLG Academic Libraries: A Survey of Support and Activities, Journal of Academic Librarianship 171 (March 1991) 19-25.
Warner, Beth Forest, and David Barber. Building the digital library: the University of Michigan's UMLib Text Project, Information and Technology and Libraries 13:20-4 Mar 94.
hdl:cnri.dlib/july97-powell