SGML Creation and Delivery

The Humanities Text Initiative

Christina Kelleher Powell
Humanities Text Initiative
sooty@umich.edu

Nigel Kerr
Digital Library Production Services
nigelk@umich.edu

Harlan Hatcher Graduate Library, Room 301
University of Michigan
Ann Arbor, Michigan

D-Lib Magazine, July/August 1997

ISSN 1082-9873

What is the HTI?
Text Creation in SGML
- The American Verse Project
- Corpus of Middle English Prose and Verse
SGML Delivery via the Web
SGML Services to Others
Future Directions
Related Readings

What is the HTI?

The Humanities Text Initiative (HTI) is an SGML creation and support unit at the University of Michigan and is part of the university's Digital Library Production Service (DLPS). Created in 1994, the HTI's origins are in the University Library's 1988 efforts to create an Internet based "textual analysis" capability through a service then known as UMLibText. With a wider variety of collections and a broader user base than UMLibText, the HTI was designed as an umbrella organization for the creation and maintenance of online text, and as a mechanism for furthering the University's capabilities in the area of electronic text. Since its creation, the HTI has amassed perhaps the Internet's largest and certainly richest collection of materials in SGML (as of the date this article was written, there are almost 2 million pages of encoded text using 14 different DTDs online). As well as creating text collections available to the Internet community and working with scholars on the creation of new electronic editions, the HTI supports the delivery of externally-created SGML collections and has collaborated with publishers and with other academic operations to design and build local access mechanisms for their titles. In 1996, the HTI launched the SGML Server Program, which leverages Michigan's years of experience with electronic texts to assist in the development of SGML support at other academic institutions.

Text Creation in SGML

The creation of electronic versions of printed texts is one of the primary activities of the HTI. Text creation is centered around two main areas of activity -- American poetry and Middle English. An examination of the processes involved in these creation projects provides an excellent overview of HTI activities. A detailed description follows; a flowchart itemizing the steps in the process is also available.

The American Verse Project

In 1995, the staff of the HTI began work on the American Verse Project, an electronic archive of American poetry. The majority of the works are from the nineteenth and early twentieth centuries, although a few eighteenth century works will be included. The full text of each volume is converted to electronic form, marked up in SGML using the Text Encoding Initiative (TEI) Guidelines for Electronic Text Encoding and Interchange, and made available via the World Wide Web in both SGML and automatically generated HTML. The collection is both searchable and browsable. Users who simply want to scan the list of available texts and read a poem can, and many do; some of the works included are difficult to find outside of large academic libraries or are in very poor condition and do not circulate, and their availability on the web is a great boon to readers and researchers. The ability to search the collection is useful for activities as simple as finding a poem that begins with the line "Thou art not lovelier than lilacs" or as complex as comparing examples of flower imagery in early American poems in general.

Works were selected for inclusion in the American Verse Project after consulting standard bibliographies, anthologies, and histories of American literature, including the 1993 Columbia History of American Poetry, Spiller's Literary History of the United States, Waggoner's American Poets from the Puritans to the Present, and Mattheissen's 1950 Oxford Book of American Verse. These were supplemented by specialized bibliographies of writing by American women and people of color. The list was expanded to include poets of special interest to American literary historians from Michigan's Department of English. They emphasized the extent of current scholarly interest in eighteenth and nineteenth century popular poetry, poetry by women, and African-American poetry. A list of nearly 400 American authors of poetry was assembled.

Working from this list, a survey of books held by the University Library was made and an electronic bibliography of print and electronic versions was constructed. Several hundred titles from the Michigan collection were evaluated to determine whether they were within the scope of the project; texts were selected and prioritized based on their scholarly interest and their physical properties (e.g., extent of deterioration and "scanability").

The volumes selected for inclusion in the American Verse Project are scanned without being disbound; currently, the HTI is using the Xerox 620 scanner and its Xerox Scan Manager software for batch scanning; a Fujitsu 3096 scanner and BSCAN have also been used. TypeReader is the software package primarily used for optical character recognition (OCR); it has worked very well for the recognition of the older typefaces in the nineteenth century material and has an unobtrusive, easy to use proofing interface. ScanWorx, a UNIX package available from XIS, has been used less frequently; because it can be trained to recognize non-standard characters, such as long s, it is useful for the oldest volumes in the collection. Prime OCR, a package developed by Prime Recognition that uses up to five OCR engines to dramatically improve OCR accuracy, is being evaluated for possible use. The HTI gives a great deal of attention to accuracy in the digitization process, with the assumption that access to reliable electronic texts is most important.

After a volume is in electronic form, automated routines are run to provide a first layer of SGML markup, identifying obvious text structures (lines of poetry, page breaks, paragraphs) and possible scanning errors, such as omitted pages. Careful manual markup occurs in the next stage, using SoftQuad's Author/Editor SGML editing package and the TEI's "TEILite" DTD. Ambiguous markup unresolved or introduced by the automatic tagging process is cleared up and markup too sophisticated for automated routines is done by the HTI's encoding staff.

After encoding, a formatted, printed copy of the text is used to proof against the original volume and for review of the markup by a senior encoder. All images found in the original volume are scanned and indicated in the encoded text; an image of the title page and verso is also included. Finally, full bibliographic information, including a local call number and the size of the electronic file is included in the header of the electronic text. The header is reviewed by a cataloger and a record for the electronic text is created for Michigan's online library catalog.

Corpus of Middle English Prose and Verse

The processes used for the American Verse Project generally apply to the Corpus of Middle English Prose and Verse (CME), the HTI's other major text creation project. For the CME, the Middle English Dictionary's Plan and Bibliography is primarily used as a selection tool, though texts of special interest to Middle English scholars at Michigan not in the Plan and Bibliography are also included.

Although some of the texts for the CME have been scanned and OCRed by HTI staff, Middle English presents more problems in this area than does American poetry. Two characters no longer used in English, yogh and thorn, appear frequently throughout Middle English texts. In addition, some editions preserve the scribal abbreviations used in manuscripts -- letters with curls on them or bars through them. Although ScanWorx can be trained to recognize these characters, it makes the OCR process and subsequent proofing tedious and dramatically increases the amount of time spent in making corrections to the text. Most texts being created for the CME have been sent out for contract keyboarding, with excellent (1 error in 20,000 characters) results.

The encoding process for the CME is again similar to that for American Verse, with a few adaptations for the differences in material. Texts are automatically pretagged before being passed to the encoding staff. The TEILite DTD is used, but with additions from the TEI's DTD for the transcription of primary resources. These additions are necessary to encode the changes in scribal hand, additions, and deletions that have been made in Middle English manuscripts over time, which are captured in the editions selected for inclusion. Proofing, markup review, imaging, and review of the header complete the text creation process for the CME.

SGML Delivery via the Web

The electronic texts created by the HTI are made available via the World Wide Web in SGML (for people with SGML browsers like Panorama, which is available as both a helper application and a plug-in for Netscape). They are also delivered as HTML created dynamically from the SGML when it is requested by a user; the HTI does not store static versions of the texts in HTML, only in SGML. To allow the text collections to be searched and individual works retrieved to be displayed in HTML, the HTI staff prepares the collection, indexes it, and creates the necessary web interfaces and programs.

After the texts have been completed, they are moved to the HTI's web server and indexed using OpenText's search engine software. Web pages with search interfaces appropriate to the collection are written as HTML forms and "middleware" CGI programs to interact between the HTML forms, the search engine, and the SGML text are written by HTI's programmers. The middleware takes the search information submitted on the web page, formats it into a syntax understood by the search engine, and submits the search. It then retrieves the text found by the search engine, transforms the SGML into HTML, and sends it to the user's web browser. Once the middleware is in place, the collections are tested to make sure they work as expected and their availability is announced.

Preparing the American Verse and CME texts for Internet delivery is relatively straightforward; the HTI staff knows the DTD and the way in which the volumes have been encoded, the SGML is already validated, and the material and its potential uses are already familiar. Writing search interfaces and middleware for these collections is simplified by the knowledge that has been gained by working the texts through the entire creation process.

However, in addition to serving texts that have been created in-house, the HTI also delivers SGML created by outside sources, such as publishers (Grolier, InSo, McGraw-Hill, Oxford University Press) or electronic text vendors (Chadwyck-Healey, Intelex). In these cases, the processes involved in creating search interfaces and middleware for these texts are still much the same as for American Verse or CME, though the staff must first familiarize themselves with the contents of the work, its DTD, how the DTD was applied, and the potential uses of the text. For example, the HTI worked with the DLPS to offer the University of Michigan community access to Physician's GenRx, a reference work of pharmaceutical information published by Mosby. The work itself, the field it covers, and the audience it serves were unfamiliar to the HTI staff; before development work could begin they had to acquaint themselves with the printed version of the text and consult with librarians in the health sciences to get a better understanding of the contents and the potential uses of Physician's GenRx. Then the DTD and the text could be assessed for encoding structure, likely fields for restricted searches, and reasonable portions for retrieval -- is an average article so large it will crash a user's browser if delivered whole? Will it be possible to allow the user to browse the brand names of drugs? Are there cross-references to other articles that need to be linked? Questions like these need to be answered for every new collection of SGML the HTI receives from an outside source.

The staff tries to create interfaces for all the collections that will make the most of the users' familiarity with their browser and general principles of online searching, regardless of content. While certain restrictions and features will not be available in every collection due to differences in content and encoding, the basic look and operation of the search forms is consistent to the user. When dealing with texts created by others, the reviewing and testing phase is often longer and involves more input from other librarians, faculty, and interested users before the text is made available and announced to the Michigan user community.

SGML Services to Others

This model for creating and delivering SGML has been useful and flexible. Collections of SGML, generated in-house or elsewhere, can be provided through the World Wide Web, a readily-accessible medium for most computer users. The HTI is in a position to leverage its experience to assist others beyond the University of Michigan library to achieve similar ends.

The Collaboratory for the Humanities is a sub-unit of the HTI. Funded by a 1995 University of Michigan Presidential Initiatives grant, the Collaboratory allows the HTI staff to work extensively with three faculty fellows (one each from English, History, and Classical Studies) on the creation of new electronic editions in SGML. The Collaboratory also provides technical assistance and support for other Michigan faculty members working on text transcription and creation projects. Beyond the Collaboratory, HTI staff have assisted faculty at other institutions with DTD creation and adaptation and encoding assistance as they have worked on their own projects involving various levels of SGML and text-processing sophistication.

In order to provide online support for SGML-encoded text and reference collections to other institutions, the SGML Server Program (SSP) was formed. Operated by the HTI and DLPS staff, the SSP provides participants with the middleware to work with the OpenText search engine and assistance with processing and indexing the SGML collections covered by the program. Institutions can choose to access collections on a server located at the University of Michigan, or to have the texts installed on their own servers. Ongoing development, training, documentation, and online support are also offered to SSP participants.

The HTI also provides SGML services to external organizations; The Medieval Review and the Human Relations Area Files are two examples. The Medieval Review (formerly the Bryn Mawr Medieval Review) is an electronic journal of scholarly book reviews freely available to the Internet community; the Human Relations Area Files collection is a large corpus of anthropological texts, distributed to members in annual installments since the 1940's, available to organizations paying annual dues. Each has its own special needs.

TMR consists of articles from reviewers, submitted by the editors of the journal. As documents, they are fairly simple: bibliographic information about the work reviewed, information about the reviewer, and the review itself, with varying kinds of other bibliographic references. Previously these had been available as plain text files via a listserv and through a gopher server. With the advice and assistance of HTI staff, the backlog of articles was converted to SGML using a fairly straightforward version of the TEILite DTD. New articles are encoded in original SGML using the same DTD by the TMR staff. The database and middleware are rather similar to American Verse and the CME, with a few significant variations. The formatting and display are unique to TMR's editors' desires, and as TMR is not located at the University of Michigan, the editors submit the articles to the database via an email robot that receives and validates their articles. Like other HTI collections, TMR is browsable and searchable and is available via the web as both SGML and HTML.

HRAF is somewhat different. Some years ago, the staff of HRAF decided to begin to provide what had long been a paper, then microfiche, resource in an electronic form. SGML seemed to suit their needs; their data consists of numerous indexed texts and images for each installment of the HRAF. After extensive consultation with HRAF's DTD expert and testing and revision of the user interface, the web eHRAF will go online this summer for member institutions. The interface and display have been discussed and agreed upon so that searches and formatting suit HRAF's understanding of how the works are used. The eHRAF process continues each quarter with a review of the interface and usage by patrons, and an annual installment of new materials.

Future Directions

Despite the HTI's many years of experience (or perhaps because of them), there are still many avenues to pursue in the realm of creation and delivery of SGML. Among the many ways to improve the functionality of the online collections, the HTI will be exploring these described below.

Cross-Collection Searching

Many users desire the ability to search across several different, though related, collections at once; for instance, the various collections of English verse, drama, and prose at the HTI. Users would like to view matches of a given search in those different databases without having to visit the individual search pages for each collection. Comparing results from different databases and viewing works that are related (but happen to be located in different databases because of their different literary genre or country of origin) are just two of the possible benefits. This is also a potential boon for inter-institutional searching; a single middleware program could visit a number of similar databases living on different machines at different sites. Databases of poetry encoded using the TEILite DTD, but residing on servers at different institutions, for example, might be able to be searched by one interface that is aware of all the available collections. The HTI staff intends to pursue the technical and interface issues behind these problems.

Multiple Representations of Data

Once data has been retrieved from an SGML database, there is not really any one "proper" way to format or display it to the user. A format for display must be chosen at some time; currently, the HTI staff decides on one display format in HTML that is used for a collection. If information about a user could be collected (by having the user set preferences that are passed to the middleware, or by determining that a user comes from an institution that has set its own group preferences), one of many possible formats could be chosen. The formats could range from simple HTML to more complex displays reflecting more structure and incorporating related information. This would allow public service librarians and user interest groups to define and request special treatments of the data for their institutions' particular needs, or allow individual users to tailor their results to suit their interests.

More Focus on Humanities

As the Digital Library Production Service at Michigan has grown, the HTI's position within it and its relationship to other units in the DLPS have been refined. In the future, there will be greater opportunities for the HTI to focus more completely on work with texts in the humanities. Support for reference works such as encyclopedias and dictionaries, collaboration with social sciences groups like HRAF, and creation of strictly image-based databases will decreasingly be the work of the HTI. This will allow more collaboration on faculty-initiated text-based projects, such as the transcription and translation of manuscripts and creation of other resources in the humanities. One such project is the Middle English Compendium (MEC), which has just received funding from the National Endowment for the Humanities. The development of the MEC is a joint effort on the part of the DLPS, the University of Michigan Press, the English Department, and the Office of the Vice President for Research, and will involve the creation of an electronic version of the Middle English Dictionary, a HyperBibliography (or electronic bibliography) of the MED, and an associated network of computer based medieval resources, including a large collection of Middle English texts. Work on the MEC will bring together the HTI's experience in converting print resource to electronic form, collaborating with publishers, creating new resources in SGML, and delivering them via the web as an integrated package. A demonstration website for the MEC is available.

SGML Creation and Delivery

The Humanities Text Initiative

ISSN 1082-9873

Copyright © 1997 Christina Kelleher Powell, Nigel Kerr