The Perseus Project and Beyond: How Building a Digital Library Challenges the Humanities and Technology

D-Lib Magazine
January 1998

ISSN 1082-9873

The Perseus Project and Beyond
How Building a Digital Library Challenges the Humanities and Technology

Gregory Crane
Editor-in-Chief
Associate Professor of Classics
Tufts University
Medford, Massachusetts
gcrane@tufts.edu

For more than ten years, the Perseus Project has been developing a digital library in the humanities. Initial work concentrated exclusively on ancient Greek culture, using this domain as a case study for a compact, densely hypertextual library on a single, but interdisciplinary, subject. Since it has achieved its initial goals with the Greek materials, however, Perseus is using the existing library to study the new possibilities (and limitations) of the electronic medium and to serve as the foundation for work in new cultural domains: Perseus has begun coverage of Roman and now Renaissance materials, with plans for expansion into other areas of the humanities as well. Our goal is not only to help traditional scholars conduct their research more effectively but, more importantly, to help humanists use the technology to redefine the relationship between their work and the broader intellectual community.
Introduction
The Perseus Project is a digital library that has been under continuous development since the spring of 1987.[1] Our initial goal was to assemble a critical mass of heterogeneous materials focused on a single, reasonably discrete subject -- the classical Greek world. From the beginning, however, our long term goals extended beyond Greek and even beyond the Greco-Roman world to the general problems of humanistic study and research. We see in the digital environment a possibility to realize more fully the vision which animated the creation of public libraries in the print world: our long-term goal must be to make accessible, both physically and intellectually, to every human being on this planet the complete record of humanity. But, of course, as John Maynard Keynes pointed out, in the very long run we are all dead. We all focus on smaller projects and on tasks that we can actually accomplish, but it is necessary to keep the long term goal in mind -- in our day-to-day work -- to have any hope of realizing that larger vision. Keeping the future in mind in this manner affects our work at the Perseus Project in two fundamental ways. First, we design every piece of our digital library to work as smoothly as possible with every other piece and with any other electronic collections. Second, we try, insofar as this is possible, to avoid drawing lines between general and specialist, research and pedagogical, disciplinary and interdisciplinary. Even if we cannot be all things to all people, we in the humanities can certainly be more things to more people than we have been in recent times.
Even now, as our modest digital library on ancient Greek culture finds its way into homes, schools and offices where traditional scholarly publications have not reached, we can see by the patterns of use and the mail that we receive the stirrings of a vast audience, hungry for ideas and for that practice of thought to which we, professional academics, have been privileged to dedicate our lives. Ten year olds read about the ancient Olympics; military officers at foreign posts read Thucydides; bankers examine Greek vases during lunch time pauses in their work, and adult learners in the kitchens of rural homes look up words in our electronic Greek lexicon as they work their way through Plato. Our experience is not unique: our colleagues with World Wide Web (WWW) sites on Gender in Antiquity, Galileo and other topics report evidence of a similarly widening audience and with it a quickening of society's intellectual life. Elementary and secondary education do not figure in the mission statement for the Library of Congress, but its American Memory Project has drawn such broad interest from the general population and from K-12 education that the Library of Congress has found itself challenged to redefine its role in society.
Perhaps a million people will not soon learn to read Sophocles in classical Greek, but we know that the suite of Greek and Latin texts, linguistic analysis tools, and lexica now available on the Perseus Web site is already allowing a wider range of people to read ancient languages than would otherwise be the case. As we further develop the tools and resources, and as we develop a far-flung community of people interested in this subject, more people will pursue the arduous and rewarding task of studying Greek and Latin. Greek and Latin are, of course, important only as examples of an intellectual pursuit that we can make much more widely accessible. If people do not choose classical languages, they may pursue Shakespeare, African Art, Chinese history or some other topic. Some who may not have the time or, initially, the inclination to take advantage of such resources will surely scorn such pursuits, but others will be deeply moved that their friends, loved ones and children can now exploit these new resources. The problem with "high culture" in a democratic society is not the concern that few can participate fully -- after all, the percentage of Americans who could hold their own on an NBA court or NFL field is virtually zero. Some, however, fear that the institutions of "high culture" hold those outside in contempt and that "high culture" is simply a subterfuge whereby a self-interested elite excludes others. The National Endowment for the Humanities (NEH), for example, costs the individual tax payer virtually nothing. It drew fire and suffered terrible cutbacks not because it represented a significant drain on the budget or even simply because humanists were critical of American society as a whole (who isn't?), but because many American citizens felt that the NEH was supporting humanists who had no interest in and felt nothing but disdain for them.
We need to rethink the fundamental relationship between the subjects that we study and the rest of society. We need to do this for political reasons, so that society does not cut us off entirely, although political pressures have arisen only to the extent that we have failed in our larger mission. Insofar as we are humanists, we are not developing knowledge of immediate use, such as new medical procedures, managerial techniques -- or even methods for structuring digital libraries. Insofar as we are humanists, we are not even pursuing Truth for its own sake (like basic scientists): we may, like Plato, believe in the beauty of transcendent truth, but for the humanist as humanist, transcendent truth is a means to an end. Insofar as we are humanists, our work only assumes meaning when we touch and improve the lives of human beings. We may associate this improvement with some religious program (as in Islamic and Christian thought), with political enlightenment (as in traditional Marxism and much contemporary criticism) or with some secular intellectualist model (as with Aristotle). We may castigate one another, each claiming that the other is really using the notion of improvement to advance a self-interested and deceitful program -- and so we should, since human beings, left to their own devices, can and will subvert any system of ideas -- but we are all trying to have some sort of tangible, constructive impact on others and, indeed, ideally on those very much "Other" in outlook than ourselves.
To a large extent, we have allowed the constraints of print to define our sense of ourselves and our aspirations. If a traditional academic publication finds its way only into research libraries or, if we are lucky, becomes "adopted" in formal academic courses, then we have every reason to write primarily for our colleagues and those invested in our fields. Why write for the wider audience when the big chains like Waldenbooks will almost never make our publications available? Worse, there are even strong currents of resentment in academia against those who do reach a wider audience: a friend reported to me that, during a faculty meeting, a colleague leaned over to whisper condolences on the fact that the History Book Club had made his recent book a selection of the month, thus supposedly reducing its intellectual credibility. Such attitudes are unacceptable. They reflect a leveling coercion, whereby the many seek to restrain and exclude the few. Of course, we should criticize our colleagues if they reduce their ideas to "edutainment" and nothing more. We need to challenge our audience -- we want people to exercise their brains. The path can be steep but it must be passable and it must begin where the greatest possible number can find it.
It is easy to visualize at least one model for a new kind of library. In the fall of 1997, A&E produced "Hail Caesar" week with shows on five Roman leaders for its signature "Biography" series. Each of these videos was well written, informative, and entertaining, but each canned presentation was, in some sense a dead end, for the video format did not provide its audience with access to further resources. Viewers should be able to pause the show, click on each visual included to learn more about what they were seeing, query each and every statement in the voice-over to get the sources for these sweeping statements, and call up different points of view. There should be a seamless set of links connecting the finished video presentation, a range of complementary or competing interpretative materials, and a library that contains every text, every inscription, every map and every piece of primary evidence upon which our knowledge of Julius Caesar, for example, is based. For modern figures, such an archive may be impractical, but it is much more within reach for earlier characters such as Caesar.
It is as easy for the scholars to criticize video productions as slick and superficial as it is for the producers to castigate historians and literary critics for burying their ideas in turgid prose that only a Phd could endure. Each criticism is equally valid, because professors and producers have each adapted themselves to the limitations of broadcast and academic media. Professors and producers must both learn and re-adapt if they are to take full advantage of the new media. Producers need to design spectacular documentaries that serve as entry points to further thought: the 51 minute broadcast video on Julius Caesar may rest on top of several hours of additional video and thousands of links available to the interactive user. Professors need to reexamine the ways in which they write and organize their ideas. Technical terms and jargon are sometimes necessary to facilitate conversation, but we often, consciously or not, use "in-terms" instead of conventional English to demonstrate the facility with which we "talk the talk." The clever use of jargon may ingratiate us with peer reviewers and facilitate publication, but it can virtually eliminate our potential audience. It is easier in the short run to write for one's colleagues, but if we only, or even primarily, write for each other, we may find ourselves with no research fellowships, no library budgets, no students and no field.
Work at the Perseus Project
At present, our funded work is advancing along the following complementary paths.
Software Development
While several of us spend at least part of our time programming in a variety of languages, Perseus has never had more than one programmer on its staff and, until 1994, had never had any staff member with formal training in computer science. We do as little software development as possible -- we have since the earliest stages of our work seen ourselves as providers of content. Software development is an extraordinarily time-consuming and expensive process, and virtually all user software is, by the standards of scholarship, ephemeral, needing constant updates and refinement. Our programming tasks break down into the following categories:
1) Data Structuring and Conversion: This is by far the most important job that we do, since the underlying organizing of any information constrains what people can and cannot do with it. The relational databases, TEI conformant SGML texts, images, and other well defined products will outlast any given delivery system. Much of our work has gone into structuring data -- whether that data has been created for this project or derives from a preexisting source (e.g. print text, archaeological plan or drawing, existing slides). This work can range from standard database programming to elaborate analysis of preexisting data. The latter work can be challenging but immensely productive: we have been able to infer enough of the underlying (and thus unmarked) structure from complex reference works (e.g., a 40 mbyte Greek-English Lexicon with more than 500,000 source citations) so that the electronic version becomes, in effect, a fundamentally more useful work than its print counterpart.
2) Core domain specific software tools. We do not have programming expertise to waste and we cannot invest time writing custom software that provides only marginal benefits over what is commercially available. Before Perseus began, I worked on a multi-lingual text editor. When the Macintosh appeared, I realized that, although the Mac had its own limited operating system and dealt with fonts (rather than languages), and although our multilingual editor was, in theory, superior to MacWrite, the advantages of our approach did not justify the labor required to create a whole new editor. We have steered clear of solving generic problems ever since, realizing that by the time we delivered finished products, others would (and should) probably have released their own solutions. This strategy, although incurring many short-term disadvantages, has been extraordinarily helpful to us over the years. While we are many things to many people, we are not a software production shop, and this clarity of role has allowed us to avoid many tasks that others are more inclined to perform.
The one elaborate piece of software that we have developed and maintained for more than ten years illustrates the kind of programming that members of a discipline need to do for themselves. We developed a rule-based system to analyze the morphology of inflected Greek words -- no small task, since a Greek word can appear in millions of different forms and since even literary Greek contains a number of dialectical variations. While a good deal of work in linguistics has focused on morphology in recent years, none of the computational models that we surveyed a decade ago were suited to mine from Greek words as much morphologically encoded information as we wanted. The system that we developed required expert knowledge not only of software design but also of scholarship on classical Greek. To determine what the analyzer should do, we needed to have a strong vision in our own minds of what our users -- from introductory students of classical Greek to professional classicists -- would want to do. The system that we produced has been a strategic resource for our work, allowing us to generate links of various types between words that have no obvious relationship (e.g., oisô and ênenkon are both forms of the verb ferô, "to carry"). By limiting the programming tasks that we undertook, we were able to concentrate our limited resources on this and similar, if less elaborate, tasks which no commercial firm would soon provide.
3) Delivery Systems: Data has little value if its audience lacks the software to use it. In an ideal world, we would use an existing package and do no programming at all. In practice, we could probably achieve this goal if we were willing to expect our audience to invest several thousand dollars in several software packages -- but this is unacceptable if our goal is to expand the audience for classics in particular and the humanities in general. We therefore decided early on to adapt existing software so that we could provide a basic delivery environment that would run on low-end machines and that would not drive up the cost to our audience.
In our earliest planning (1985-1987), we had assumed that we would work with Unix workstations. We were ourselves comfortable with Unix but we used Unix only because it provided the only viable environment for our work. Few humanists or students would have access to these machines, but we felt that we needed to start, at least, with this environment until either Unix became more popular in our target audience or some better delivery system appeared.
The appearance of Hypercard and of the Mac II in 1987 constituted a huge breakthrough for our work. Macs, because of their ability to handle languages, became very popular among our colleagues. Hypercard, for all its ultimately unrealized potential, provided us with a free delivery environment that we could enhance with Xcmd-s that we wrote ourselves. Yale University Press published for us Hypercard based CD ROM versions of the Greek digital library in 1992 and 1996.
Although we used Hypercard as a delivery vehicle, we did none of our real work in or for Hypercard. Creating a new version of Perseus required porting the data from a variety of more specialized formats (e.g., SGML texts, relational databases, the GRASS Geographic Information System, Postscript files etc.) into the Hypercard environment. Because we always saw the data as our primary focus, we insisted on maintaining a strict separation of data from delivery environment.
The emphasis on data structure (rather than software development) and the separation of data from delivery system required a great deal of work in the early stages of development, but it paid off handsomely later. When the World Wide Web emerged, we were able to create a Web version of Perseus, providing access to major portions of our library within a week, and creating a reasonably full version within less than a year, even while were still working on Perseus 2.0-- despite the fact that we had not planned (i.e., budgeted) for this work. Hypercard and Web Perseus each has distinct strengths and weaknesses which reflect the characteristics of Hypercard and the Web (and can provide the basis for an interesting analysis), but each provides most of the core functionality that we require.
A new software envelope for the Perseus digital library is being developed in Tcl/Tk, a platform independent language developed by John Ousterhout and maintained now by Sun Microsystems. This will allow us to create a version of the Perseus digital library that not only runs under Windows, Unix, and the Macintosh OS, but that is also "Web-aware": users will be able to retrieve information either from their desktop, a local intranet or the central Perseus WWW site. Those working intensively with one part of the library (e.g., the texts and lexica or the 10,000 images of Greek vases) will be able to store that segment on their local disk for consistently rapid performance, calling on networked servers for less commonly used resources.
Creating better tools for disciplinary and interdisciplinary work.
The grand hopes articulated at the start of this article arise in part from the experience that the needs of specialized and general users are often complementary. In computer design as a whole, the rise in popularity of graphical user interfaces -- once scorned by many -- illustrates that the same tools can serve novice and expert, but we at Perseus have experienced similar effects within our own discipline. We now aggressively explore those ways in which we can create cross-over resources that serve more than one audience simultaneously better than the more rigid tools that had preceded.
One example struck us early in our work. When we commissioned detailed new color photography of museum objects, for example, we were ostensibly serving the need of specialists in art history -- who else, we were asked, could possibly need one hundred views of a single complex Greek vase, painted with many complex figures and scenes? Some of those who worked primarily with texts initially objected to the investment that we made in our image archive. When the work was done, some of the same critics objected to the inclusion of art objects for which we had not done new photography and for which we did not have the same large number of pictures. But the detailed photography has proven invaluable both to students and to experts. Conversely, some of the art historians grumbled that we included complete texts, rather than a selection, and questioned the need for Greek original as well as English. Some of these subsequently found the combination of Greek, English and dictionary lookup tools extremely useful as the found themselves better able to work with a wide range of Greek texts than before. Most importantly, the detailed photography and the elaborate text system both have proven invaluable to students and general readers. Meeting the demands of the specialists allowed us to develop a robust and solid foundation that ultimately served a far wider audience.
In recent years, we have begun work in the history of science, one of the most challenging and fascinating areas of research in the humanities. This extraordinarily demanding field requires not only a full range of linguistic and cultural expertise but also a thorough understanding of the mathematical and/or scientific principles involved. The student of Euclid or Archimedes should thus, ideally, be both a mathematician and a classicist. Even if one is able to bridge this divide, it is virtually impossible for the historian of mathematics (or astronomy or biology or logic or any other discipline) to master the entire range of cultural areas that longitudinal studies would require. One example of this work is an edition of Euclid that we have published on the WWW that includes both an elaborate commentary and preexisting images and links to Java scripts (see for example one proof from book five) with dynamic geometric diagrams from David Joyce's Web-based edition of Euclid. Further work and collaborations with historians of science are being planned.
Since classical background and a knowledge of Latin underlies much of European science, art, literature and culture through at least the eighteenth century, our digital libraries for Greece and Rome provides the logical platform on which to build general resources for those who wish to juxtapose a Renaissance or early modern work with classical sources. We have begun work on an edition of Shakespeare's Julius Caesar that links the play not only to its immediate source (North's translation of Plutarch) but also to classical sources on Caesar (e.g., his own accounts of the Gallic Wars and Civil Wars, Cicero's letters describing Caesar's rule and the aftermath of his assassination).
Translating and Transmuting the Form of Print Resources
It is important to do more than simply replicate the printed page in electronic form -- especially since the printed page has evolved to maximize the strengths and minimize the weakness of the codex form. On the other hand, an electronic document must match, insofar as this is possible, the functionality of print. Many electronic publications, for example, are not true exemplars of the genre because they still assume that the reader has access to a fuller printed version. Following preexisting practice of other textual databases in classics, we therefore chose not to include variant readings with our texts: the added utility did not merit the substantial increase in costs; those who needed the variants would be precisely those users who would have access to print libraries. The result is that many of our texts remain attached by a thin electronic umbilical cord to their paper originals.
In one case, however, we paid close attention to the problem of variants, not only encoding all the information that came from our printed source: our on-line edition of the works of Christopher Marlowe, still under development by its editor, Hilary Binda, not only records the variants from the more than twenty different editions of the play but makes these variants serve a function that is not feasible in print. A reader can reconstruct any of these editions dynamically and see highlighted those places where edition X differs from edition Y. Rigorous electronic representation of the information from the printed page converts ink descriptions into active procedural knowledge that a computer can manipulate. The books are thus not only transferred but transformed into something qualitatively new.
In most instances, we enter the entire text of a print document, but it is not always clear how much tagging is worth the effort -- a determined worker can spend years using TEI-conformant tags to add structure to a given document. Determining the appropriate level of tagging is the major decision that any editor must make, and there are no simplistic answers. The editor in charge of a grant-funded, time-sensitive project to encode twenty million words has different constraints from the editor who, as a part of his or her personal research, sets out to spend years creating a new electronic edition of a two-hundred page prose work or a two-thousand line play. The proficient editor of the large scale project must learn how to tag as much content automatically as possible and should have access to elaborate programming expertise. The editor of the small scale project might well also benefit from programming skills but will be in a much better to position to do a great deal of work by hand. And, of course, the first editor can lay the groundwork for the second: editors will use many of the texts that we have encoded into TEI-conformant SGML form for Perseus as the foundations for subsequent hand-crafted editions.
While we need to rethink the way in which we write if we are to reach the new audience that the Web opens up for us, we can already add quite a bit of functionality to a properly tagged TEI conformant document when it is published in HTML form on the Web. Collaborating with Professor Leonard Muellner of Brandeis University, we created TEI-conformant SGML versions of well known books by Gregory Nagy. We were able to add two categories of links to these.
First, we added to these documents a certain level of "intelligence" particular to our discipline. The reader puzzling through any of the quoted Greek can click on individual inflected Greek forms (e.g., oisete) to determine their morphological form (e.g., 1st person singular future indicative active) and then call up their proper dictionary entry (e.g., ferô, "to carry, bear"). Such links make the quoted Greek a good deal more accessible to non-specialists with some knowledge of Greek. (Of course, we could also use the tagging structure so that readers could hide or display the Greek quotes at will, tailoring the document to their immediate needs.) We are able to fashion links between inflected forms (e.g., oisete) and their dictionary entries (e.g. ferô) because we invested years in developing the rule-based system for the analysis of Greek morphology (already mentioned above). This illustrates one instance where experts in the given discipline need to provide elements of the technological infrastructure.[2]
A second layer of structure has nothing to do with classics per se but reflects the application of consistent reference schemes. We tagged all of the citations to primary materials in each book. Thus, if Nagy cites a particular line in Pindar's first Olympian Ode, the reader can call that text up in the Perseus digital library by clicking on the citation -- thus, in effect, automating a task that could in theory be performed by hand (assuming that a text of Pindar was at hand).
Much more importantly, however, the links are now bi-directional. Thus, the reader of an on-line Olympian 1 can now see all those places where Nagy's books (or any properly tagged document) cites Olympian 1.
Such automatic links will have a profound impact on what scholars do. The reader of the Iliad or Hamlet will soon find a bewildering array of links pointing into a given passage. Filtering software will help, but we will also need to change the way we write: scholars who want readers to go backwards from Hamlet to their publication will have to label their links in some way so that automatic filtering systems will be better able to rank the value of their publication for interested readers. And, of course, the human commentator -- whose job is to filter and summarize information -- will enter a new golden age, more focused on upon sifting and structuring rather than simply tracking down the data. The more heavily studied a canonical work is, the more editors will compete with one another to produce the most authoritative such commentaries.
But, of course, most works are not so closely studied as the Iliad or Hamlet. In probably 80%-90% of the documents on-line, the number of links will not be a major problem. The scholar trying to work with materials somewhat off the beaten path -- where in fact much of the best scholarship is done and the foundations laid for new readings of the famous canonical works -- will probably be happy to get as many automatically generated links as possible and rely upon only the most rudimentary filters (e.g., "show me links into the opening of Plato's Republic from the past two years" or "display links from the following journals or from publications by the professors X, Y, and Z").
Much of our current research now focuses upon the effect that such new interconnections between texts, commentaries, grammars, dictionaries, translations, and other reference tools have upon the experience of reading. The experience of textuality is different in this hypertextual environment -- especially when we can rapidly integrate thousands and thousands of links developed in print by earlier scholars -- but we are only beginning to understand the implications of this new functionality.
Training a New Generation of Humanists
Those of us with established careers and busy schedules are unlikely to redefine ourselves from the ground up or to carry the banner of revolutionary change. Platitudes about middle aged ossification aside, we generally sit on too many committees, yawn or yammer our way through too many faculty meetings, have too many outstanding commitments for publications, are asked to review too many articles, books, tenure/promotion cases, go to too many soccer games and violin lessons, to drop everything and really think about what VRML and the latest GIS system can do for our work. There are exceptions to this rule -- George Landow, for example, did not encounter hypertext until he was a full professor and has now emerged a leading analyst of hypertext and literary theory -- but, by and large, we must look to the rising generation of humanists to transform our fields.
We have always viewed our work as threefold: aside from providing useful tools and research on how such tools can be built, we have used the work that we do to expose young humanists to the new technology. Some of our former collaborators came to us from college and moved on to graduate school; others came to us with doctorates virtually complete and moved to tenure track jobs or positions as curators; others have come to us mid-way through their graduate careers, eager to synthesize the possibilities of the new technology with their "traditional" disciplinary training (which, from my point of view includes any print-based learning, whether classical philology or post-colonial theory). Others have effectively created new job profiles, integrating knowledge of the humanities with constellations of expertise (e.g., programming, digital imaging) -- it is not clear where they could go for a graduate training that would develop both their humanistic and technological sides at once. In each case, work on the Perseus Project has allowed these collaborators in some measure to transform themselves. Those who stay within the humanities constitute permanent human resources who will contribute for decades to come.
We therefore take on projects both as much to hire and train new people as to do research or create new electronic resources. This has one major impact on the way that we structure our work: wherever possible, we do work "in-house," preferring to use our resources to train and support young humanists rather than supporting outside professional contractors. This strategy has been a tremendous success over the years. We have found that those who love the subject matter can master the technology much more effectively than technologists can understand our problems in the humanities. Not all of our collaborators focus on the same skills: some develop a love for the most demanding and rigorous software engineering; others may concentrate on the humanistic content more directly, but master a suite of demanding software applications; all, however, develop between a single pair of ears a cohesive vision of how the traditional goals of the humanities and the possibilities of the new technology interact and redefine each other.
In the 1998/1999 academic year, we expect to have three postdocs and two full-time graduate assistants participating in the project. We hope to expand the number of graduate students and postdocs working with us in the coming years, bringing together collaborators from throughout the humanities.
At the same time, we hope to develop the infrastructure to work with not only graduate students and faculty, but also our colleagues who work as librarians and who teach in secondary schools. If we are to develop a digital library that truly endures over time, we need to establish formal ties with the library community, since libraries are the logical long-term home for any digital library. If we are to develop libraries that reach beyond the university, we must work with and learn from those who teach outside of traditional higher education.
Expanding the Breadth and Depth of our Digital Library
On the one hand, we need to expand our coverage of Greek culture. Thanks to a grant from the Getty Grant Program, we are actively expanding our coverage of classical Greek sculpture. With support from the National Endowment for the Humanities, we are developing a new network of interlinked databases for the study of Greek language, including comentaries, grammars, lexica (both Greek and Latin), and other documents. A grant from the National Science Foundation has allowed us to begin work on Greek Science, adding the rest of Aristotle, Euclid's Elements and other authors to our digital library.
But if we could confine ourselves to Greek, we felt that it was important for us to expand into new areas as well. We never intended to develop a digital library for Greek. We always envisioned Greek as a starting point and as an initial case study. The best way to improve the model that we have for representing classical Greek culture is to test that model against other bodies of cultural materials and thus to generalize the problems that we face. After all, the overall goal is not to create a single library on one subject, but to help develop the protocols for a vast, cross-cultural virtual library spanning many cultures and periods.
At the same time, we have begun expanding in other directions as well. Our initial hypothesis was that the model we developed for our Greek library would, if properly designed, translate to other cultural domains. The obvious first step was, of course, Roman culture. A grant from the Teaching with Technology Program at NEH has allowed us to begin work on a "Roman Perseus," of which the first results are now available. In some ways, Roman Perseus fits neatly into the model developed for Greek Perseus -- Greek culture had tremendous influence on Rome and many of the same cultural categories shape both.
Nevertheless, this expansion allowed us to put some of our assumptions to the test: while much the same data structure work with Roman coins as with Greek, Latin and Greek are different languages. We had to adapt the morphological analysis system that we had developed for Greek to Latin. Latin morphology is a good deal simpler than that of Greek and we had hoped that Latin would, in practice, involve a subset of problems that we had already solved for Greek. In fact, this proved to be largely the case: the morphological analysis engine for Greek took at least six months of programming to develop; adapting this engine to Latin took less than forty hours. Thus, even the work that we invested in Greek language -- the most idiosyncratic and localized aspect of our digital library -- gave us a tremendous advantage with Latin. Of course, we enjoyed this benefit because Greek and Latin are both highly inflected languages with similarly organized morphologies -- moving to classical Chinese or Arabic would have proven a good deal more challenging.
There are three directions in which we can at this point expand our digital library beyond Greco-Roman culture.
1) Having developed a growing library on Greco-Roman culture and a set of tools for Greek and Latin, we now have an environment in which to publish later (e.g., Renaissance) materials that assume familiarity with Greco-Roman culture or are simply written in academic Latin (such as Newton's Principia Mathematica). We therefore decided to begin work on English Renaissance materials: the subject has wide appeal (thanks largely, but not exclusively, to the popularity of Shakespeare); the problems of early modern texts are challenging but similar to those of classical texts; the culture is different and the predominant language is English, but there are many Renaissance texts written in Latin and many direct links between Renaissance and classical (mainly Latin) literature.
In late 1996 we received initial support from the Tufts Provost Office and Arts&Sciences Research funds, to begin developing our first English Renaissance Project: an electronic edition of the works of Christopher Marlowe. Since then we have begun work on an electronic edition of Shakespeare's Julius Caesar, with links to extensive commentaries (the Furness Variorum Julius Caesar and Kittredge's edition of the play), to the primary Renaissance source (North's Plutarch) and to classical materials about the historical Julius Caesar (e.g., Caesar's own writings, Cicero's letters on Caesar's rule and assassination).
The experience that we have gained in these initial projects has laid the foundation for additional work. Since we are comfortable with tagging large bodies of complex texts, we would like to create a badly needed database of Renaissance source materials. We also hope to foster the development of additional electronic editions for major literary texts, developing closer connections between our work and similar projects elsewhere.
2) We would like to see the development of a Perseus-style digital library for a culture separate from the Greco-Roman European cultural tradition. The representation of geographic, archaeological and art historical data poses many of the same issues, whatever the culture at hand. Identifying those issues where the problems of an archaic Greek archaeological site differed from, say, a Mayan site would be a valuable exercise, allowing us to understand better how to manage both Greek and Mayan.
Language, however, raises even more challenging issues, since Greek and Latin fundamentally differ from classical Chinese, Sumerian, classical Arabic, Mayan (insofar as we understand it) and even another Indo-European language such as Sanskrit. Nevertheless, our extended work with Greek and now with Latin has given us a much clearer idea of which elements in a digital library are language specific and which are not. UNICODE promises to standardize and thus ultimately to simplify the encoding of multiple languages at the character level, but it does not address higher level issues such as the morphology problem. We could now describe what was necessary in a Perseus style digital library for a non-European culture, complete with the standard Perseus links between inflected forms, morphological analyses and lexica. Even in languages such as Chinese, with little or no morphology, the links between words in a text (in this case, Chinese characters) and a lexicon would dramatically affect what readers could and could not do.
Contributing to a non-European cultural digital library would be an elegant and challenging task. The goal would be to divide generic programming and software infrastructure tasks, to identify heretofore unrecognized cultural dependencies built into data design and to refine our understanding of domain specific issues (such as the treatment of different languages).
3) We are very near the point where we would feel comfortable supporting the development of individual, initially somewhat isolated editions, walk-throughs, or databases in a variety of areas not directly contiguous to the work that we have done so far. The danger of creating docu-islands in a field without a systematically organized supporting digital library is that the earlier production may not connect readily with subsequent resources. The work that we have done so far has, however, given us concrete experience by which to evaluate the general issues involved in any humanities library.
Consider, for example, an electronic edition of Thucydides' History of the Peloponnesian War, published in a digital library with a thorough electronic Atlas of the ancient world. The reader should be able to generate a map of all those geographic locations mentioned in any arbitrary subset of the text (e.g., "generate a map displaying all the places mentioned in book 3 or in the first twenty chapters of book 3 of Thucydides' History"). For the text to support this function, however, the editor should disambiguate ambiguous toponyms -- i.e., indicate whether the "Salamis" in a given passage refers to the Salamis near Athens (where the battle of Salamis was fought) or the Salamis in faroff Cyprus. Likewise, anyone taking the trouble to create a new edition should surely disambiguate personal names (e.g., is the Cicero mentioned in a given passage the famous orator or his brother Quintus?). Given the amount of labor that a conscientious editor puts into a critical edition, such tagging, even if done by hand, should add relatively little to the total sum of work but can vastly enhance the ways in which the electronic edition will interact with other digital resources.
Whether or not an electronic atlas or biographical database or similar resources actually exist yet is relatively unimportant. A serious edition that take three to five years to produce will have a "shelf-life" that extends over many years and even decades. Conscientious editors must think long and hard about designing their work in the most forward looking way. The question that any editor now faces is anticipating, insofar as possible, what the expectations of an audience will be ten or twenty years from now. Of course, technology is so fluid and the course of events is ultimately unpredictable, but unreflectively replicating the structure of print editions is obviously unacceptable.[3] An international team of collaborators spent years developing the Text Encoding Initiative guidelines, designed precisely to help those who wish to go beyond print, but these standards are both incomplete (they do not solve every problem) and too rich (an editor could spend an open-ended amount of time adding tags and structures that would probably would not enhance the study of a particular text. The problem of new content is not so very different from that of transforming print documents into an electronic form: we need realistic trade-offs that will serve us for a reasonable length of time.
Some elements of a modern edition are clear: proper names should be tagged and disambiguated; variant readings should be labeled in such a way that a reader can generate and easily compare an arbitrary number of editions; information should, where possible, be structured so that readers of differing backgrounds can find what they need with minimum of fuss. A great deal of linguistic tagging can, of course, take place, and in some scholarly editions, where editors will spend a fair amount of time pondering every single word, they should probably encode a great deal of linguistic data. The editor of a new Moby Dick, where the language is less problematic and the text is quite long, may find the benefits of such extensive tagging less compelling.
In general, electronic publication should not radically alter the amount of time that an editor spends. Rather, the editor should decide some relationship between the amount of labor that the edition will absorb and the amount of time that will be spent tagging. Readers will develop realistic expectations (as they do now for editions of various kinds). The editor does not want to spend ten years producing a "three volume" commentary on a two thousand line play, but then fail to add tags that would radically enhance the value of the work that was done. We have different expectations for the three volume and one hundred page commentaries on a given work in print; we simply need to develop new expectations for their electronic equivalents.
One reason to develop the Perseus digital library on Greek culture was to allow us to see what an integrated digital library might look like. Now that we have created this environment, we can use it to help us prototype true electronic editions. Much of our work is therefore now turning from wholesale transfer of print into electronic form and towards instead the creation of new materials.
Exploring Publications Not Feasible in Print
Our work with electronic editions is, of course, only a special case of a more general problem: we need to develop exemplary new forms of publication that address the strengths and weaknesses of the electronic medium. For humanists, this is especially important: our resources are limited and we need to be conservative, developing reference tools and publications that can be used over a long period of time. We cannot assume that our research will be so out of date in five to ten years that our content can fall victim to technical obsolescence of a system or standard. We need to establish robust forms of publication as soon as possible so that we can be confident that our ideas will be useful over a period of decades.
Obviously, many disciplines will benefit enormously from the flexibility of electronic media. Archaeologists have never been able to publish the full range of materials that they have uncovered: the ability to publish detailed 3D models, VRML walk-throughs, databases of objects, detailed surveys, and simply thousands of color pictures (instead of hundreds or dozens of BW illustrations) all open up major new possibilities. Sebastian Heath (once a programmer at the Perseus Project) has, for example, made progress in developing a model for publishing the results from Pylos Regional Archaeological Project in Southern Greece, and many other efforts are in progress. Likewise, museums can provide virtual tours that allow users not only to explore their galleries (and thus enhance their experience in the museum itself) but also to view "virtual galleries" that showcase the many objects hidden in the back rooms or to view dynamically generated galleries that allow users to juxtapose the same objects for a variety of possible themes (e.g., objects relating to daily life, the iconography of Dionysus, athletics in Greek art etc.).
But even venerable document types that are well understood in print may require fundamental rethinking in the electronic environment. Consider, for example, the scholarly grammar. Print Greek grammars provide written descriptions of how a language works and they are essential tool for any textual based scholarship. The student reads the grammar, converts the written specifications into an internalized format that the brain can then learn to apply. In an electronic environment, however, we can represent a substantial portion of the knowledge -- that specifying the complexities of Greek morphology -- in a format that an automated system can manipulate. If we represent the fact that -amen is the first person plural indicative active ending for first aorist verbs and we have a database of first aorist stems, then it can recognize elusamen ("we freed"), epempsamen ("we sent") and similar regular formations. The grammatical model that classicists had developed for the workings of Greek morphology takes on a life of its own so that an electronic system can apply this knowledge without further human intervention. The results of this shift have already begun to make themselves felt in the study of Greek and Latin, and similar electronic knowledge will ultimately affect all disciplines that grapple with complex language systems. Thus, even the first generation linguistic tools that we have at our disposal make it clear to us that every authoritative grammar should be designed both for human readers and to drive as many rule-based systems as possible.
Language is, of course, only one element of publication. Electronic media obviously open up radically new modes of publication -- archaeologists can produce walk-throughs of sites, historians of science can publish simulations for the Ptolemaic vs. Copernican systems, art historians can produce multiple reconstructions of missing or damaged objects. For example, my colleague Neel Smith of Holy Cross College has converted Ptolemy's Geography into a database of toponyms and coordinates. He is now able to use a modern Geographic Information System (GIS) to visualize Ptolemy's model of geographic space, examining correlations between ethnicity and location or simply comparing the coordinates of places in Ptolemy's system with their real world locations.
We are at present beginning a three-year project, funded by the Fund for the Improvement for Postsecondary Education (FIPSE), and co-directed by Ross Scaife of the University of Kentucky at Lexington. Our goal is to collaborate with our colleagues in classics and ultimately throughout the humanities to extend a variety of innovative projects already underway. Our larger purpose is to create new electronic publications that not only serve the traditional academic audience but also exploit the technology to bridge the gap between academic publications and the wider audience.
Institutionalizing Electronic Scholarship
It is not enough to create new materials or to participate in developing new kinds of publication. These new publications need to become institutionalized in such a way that they can be produced at high quality, distributed broadly and maintained over time. To this end, we are pursuing the following directions.
1) We are constructing an editorial process to foster the creation, and attest to the quality, of publications that take fundamental advantage of electronic media (e.g., databases, architectural models or reconstructions, simulations, electronic editions etc.). Our FIPE support allows us bring together a number of those currently developing serious Web based resources and to pay for a postdoc to help bring existing publications to a new level (e.g., converting Web sites to TEI conformant SGML documents that can generate HTML).
2) Tufts University has committed to developing a local Center for Humanities and Technology to integrate emerging technologies and humanistic research. This center will support lecture series, interdisciplinary courses for undergraduates and graduate students, faculty research initiatives, training for librarians and staff, and programs for high school teachers. While this center will be located at Tufts' Medford Campus five miles outside of Boston, it will build on the range of local consortia to which the university already belongs to reach out to students and faculty in neighboring institutions as well.
3) Perseus has a long history of close collaboration with museums in both Europe and North America, having provided more than 20,000 original slides of high quality to image archives at various institutions. We are currently working with the Department of Ancient Art at the Museum of Fine Arts to create extensive new coverage, photographic and documentary, of their holdings, with an emphasis on high quality digital imaging, QTVR walk-throughs of the galleries, and extensive links between objects and supporting data in our digital libraries. We plan to extend these collaborative projects to include subjects outside of Greco-Roman antiquity, both helping museums learn how to use the new technology and to help facilitate ever closer relationships between higher education and the museum world.
4) Publishers, especially North American University Presses, operate on thin margins and are cautious about the distribution of their materials. Many publishers are therefore uncomfortable with "free" Web access to their materials. We are developing a model whereby at least part of the materials available in our digital library will be available for a reasonable subscription so as to protect the interests of rights holders. Given the programming, photography, documentation, research and other tasks that go into digital library work and that, if they are to be done in a consistent, extensive and timely fashion, require professional support, a modest income stream would certainly help our efforts. On the other hand, any kind of charges must be weighed against our strategic mission as a whole. Any prices that we should impose should be sufficiently modest as to impose the minimal possible economic barrier. The optimal strategy seems to be a sliding scale in which major research libraries pay a substantially greater individual share while k-12 schools and public libraries subscribe for a token sum (that might, in aggregate, help us advance our work).
5) I stressed earlier the potential value of strategic relationships between broadcast media and academics. Each side would need to adapt the make such relationships effective -- certainly academics must rethink their work and the way they present themselves if they are to reach a wider audience. Building our initial digital library on Greek and then extending the infrastructure to accommodate other domains has absorbed most of our energy to date, but we are looking to begin developing key partnerships with colleagues in broadcast, tailoring or extending our existing materials to provide follow-up and background in depth for widely publicized materials. A logical project would be video programming with dense hyperlinks into one of our digital libraries: e.g., a biography of Cicero with views of his home town, links to his writings and to sources about him, etc.
6) Libraries occupy perhaps the most important niche in the long term ecology of information. Scholars produce materials. Publishers contribute editorial assistance, supervise production, publicize and (traditionally) distribute the finished product, but academic books often go out of print within a few years. We look to librarians for long term access to print materials and we need to look to them as well for the long term maintenance of our digital publications. Although libraries contain many CD ROMs, CD ROMs have been too idiosyncratic in structure and uncertain in their physical medium to become a firm pillar of library infrastructure. We need produce documents in well defined, stable standards such as TEI-conformant SGML that librarians will be able to manage in large numbers over a long period of time. Scholars who need the documents that we produce now need to be able to find these documents in their "libraries" decades from now.

Conclusion
Ten years may be a long time in the brief history of digital libraries, but all work in digital libraries still at a very early stage of development. For us at Perseus, we are only now beginning to have enough material to study the effects of a large, heterogeneous but coherent digital library. Nevertheless, we can already see benefits to our traditional audience (faculty and full time students) and to emerging new audiences for the humanities. Rapidly developing technology may challenge us to rethink what we can do on a daily basis, but new systems have begun to allow us to realize more perfectly many of the larger intellectual goals that we have pursued since writing emerged and scholarship began thousands of years ago. Our goal at Perseus, as we contemplate our second decade of work, is to contribute to the developments around us, both as a group and in collaboration with our colleagues throughout the world. With the rise of electronic media, we in the humanities enjoy exhilarating new opportunities not only to extend our own intellectual range but also to reach a wider and more diverse audience than has ever been feasible for us before.
Notes
[1]Perseus has received generous support from a wide range of sources including the Annenberg/CPB Projects, the National Endowment for the Humanities, the Fund for the Improvement of Postsecondary Education, the National Science Foundation, the Getty Grant Program, the National Endowment for the Arts, the Mellon Foundation. Initially developed at Harvard University, Perseus followed its editor-in-chief, Gregory Crane, to Tufts University in 1993. Perseus is a fundamentally collaborative enterprise that has engaged the efforts of many humanists at a variety of institutions over the course of the past decade. Our major collaborators in work currently underway include Thomas Martin and Neel Smith of Holy Cross College, Ross Scaife of the University of Kentucky at Lexington, the Ancient Art Department of the Museum of Fine Arts, Boston, and Gary Marchionini of the University of Maryland at College Park.
[2]While a good deal of work in linguistics has focused on morphology in recent years, no system with which we were familiar ten years ago had an underlying model capable of analyzing Greek morphology with the precision that we required.
[3]Most work (including most of ours at Perseus) has focused on developing electronic formats at least comparable in functionality to those in print: see, for example, the MLA Guide for electronic editions.

Copyright © 1998 Gregory Crane

hdl:cnri.dlib/january98-crane

D-Lib MagazineJanuary 1998

ISSN 1082-9873

How Building a Digital Library Challenges the Humanities and Technology

Conclusion

Notes

Copyright © 1998 Gregory Crane

D-Lib Magazine
January 1998