D-Lib Featured Collection March 2006: The Perseus Digital Library

D-Lib Magazine
March 2006

Volume 12 Number 3

ISSN 1082-9873

The Perseus Digital Library

Contributed by
Gregory Crane
Tufts University

New content and services for 19th century American documents

The Perseus Digital Library is releasing a collection that introduces significant new content and brings new technologies and services to a mainstream digital library. All XML source texts of the collection are available under an open source license. All software is likewise designed as open source.

XML Tagging

All of the markup in the example shown in Figure 1 was automatically added. Buell and Corinth are correctly interpreted George P. Buell and Corinth, MS. General John Pope – a dominant figure in this text – is never mentioned by his first name. The system thus mistakenly links Pope to "Curran Pope," named twice elsewhere in the document. The XML tagging includes both the rule used ("most common") and the number of times that Curran has been mentioned in the text (2).

Figure 1.

Sample Page View

Figure 2 shows a representative page from the 19th century collection as seen in the Perseus digital library interface. The right hand side contains a range of added services (designed to become Fedora disseminators) relevant to the text. In this case, we see a summary of people, places and dates on a particular page of text. The digital library system can chunk documents in various ways and could, in this case, provide a chapter at a time.

Figure 2.
View Larger Version

Personal Name Searching

Searching for people with the last name Webster calls up the list shown in Figure 3. Since Daniel and Noah Webster are pre-eminent figures in their domains, Webster often appears in documents without a forename. We have chosen to defer automatically aligning possible name variants (e.g., Daniel Webster and D. Webster) since we were not satisfied with initial results. We expect to reactivate this feature in a more sophisticated form in the future.

Figure 3.

Frequency Searching

Almost all of the Richmonds in our collection denote Richmond, VA. The example shown in Figure 4 illustrates the results for Richmond, KY: the results include two figures. In the document with the most potential references to Richmond, KY, the system reports that there may be as many as 71 references but probably at least 16. The lower figure, which maximizes precision, reports phrases such as "Richmond, Kentucky", where the language makes it highly likely that we have the right Richmond. The higher figure includes all those Richmonds that could be Richmond, KY. Given the pervasive references to Richmond, VA, many of the 71 will be false positives.

Figure 4.

Our named entity identification system tuned for 19th century historical materials has extracted from the 55 million word test collection 1.5 million personal names, 1 million place names, 600,000 dates, 500,000 organization names, and 12 million auomatically generated annotations. While we have extracted dates and places in the Perseus Digital Library since 2000, the new system is optimized for the challenges posed by American English, which contains both semantic ambiguity (e.g., Washington can be a person or a place) and which uses the same names over and over again (there were more than 150 places named Washington in mid nineteenth century America). This system lays the foundation for a general web service to tag named entities in documents that could process up to 1 billion words in a day. The current database of named entities from the initial test collection can be expanded to cover all pre-twentieth century American English publications. For the technical details and general performance of the system, see Crane and Jones.

A new release of the Perseus Digital Library System includes an initial interface whereby users can browse and search semantically:

Semantic searching: In the American collection, you can now search for Washington as a person or as a place. Likewise, you can search for a particular place. Thus, you can search only for [Richmond in KY], minimizing the far more numerous references to Richmond, VA.

Automatic metadata generation: Listing the most commonly cited people, places, dates and other organizations provides a window into the content of a document. This automated technology provides a scalable approach that can be applied to millions of documents. See a list of people, places and dates that J. B. Jones, A Rebel War Clerk's Diary," cited in September 1863.

Collection Description

Sources on 19th century American culture and especially the US Civil War: This release includes a test collection of 55 million words, with a number of core works on the Civil War. Our purpose was to create a testbed whereby we could study the problems and opportunities presented by various genres of 19th century American English. Content includes:

thirty-seven volumes of the Southern Historical Society, sample volumes from the massive Official Records of the Union and Confederate Armies (Shiloh; Atlanta Campaign), complete sets of the Confederate Military History (e.g., South Carolina), Battles and Leaders of the Civil War and Photographic History of the Civil War, memoirs of Civil War participants (e.g., Grant, Porter, Early, Longstreet, Alexander, Beatty), and early histories of the conflict (e.g., the complete Rebellion Record including the diary and volumes of source documents and poetry, Swinton (1866); Pollard (1876); Greeley (1866); Comte de Paris (1876)).

local history, with an initial focus on the Boston area including town histories (Cambridge (Paige 1877), Medford (Brooks 1855), Arlington (Cutter 1880), Waltham (Nelson 1879)), multi-volume publications of local historical societies (30 volumes of the Medford Historical Society Papers, and all 8 volumes of the Somerville Historic Leaves), city directories (Cambridge 1857), guide books (Mount Auburn Cemetery, 1839), personal memoirs (Cambridge Sketches (Merrill 1896), Olde Cambridge (Higginson 1900) and Cambridge Sketches (Stearns 1905)), accounts of local monuments (Cambridge Civil War memorial), and miscellaneous publications (e.g., Boston events: a brief mention and the date of more than 5,000 events that transpired in Boston from 1630 to 1880 (Savage 1884)).

the overlap of local and Civil War history with an emphasis on Massachusetts: histories of individual regiments (e.g., Mass 19th (Adams 1899); Mass 54th (Emilio, 1894), 2nd Mass Battery of Light Artillery (Whitcomb 1912); Bennett (1886); NY 121st (Best 1921)), histories of Massachusetts in the Civil War (Schouler (1868); Schouler (1871); Higginson (1895-1896) vol. 1 Mass regiments, officers and men who died; preliminary narrative and vol. 2), and biographies of the fallen from particular communities (e.g., Harvard Memorial Biographies (Higginson 1866)).

multiple biographies of individual figures such as Charles Sumner (Lester 1874, Nason 1874; Pierce 1877 vols 1, 2, 3, 4), William Lloyd Garrison (Francis Jackson Garrison 1885-1889: vols 1, 2, 3, 4; Crosby (1905); Chapman (1921)), and Ulysses S. Grant (Crafts (1868), Badeau (1885); Badeau (1887); Wister (1901); Lyman (1922)). These are designed to illustrate problems of aligning disparate narrratives about the same individuals.

the primary published works of Thomas Wentworth Higginson (e.g., Army Life in Black Regiment, 1870, Women and Men, 1888, The New World and the New Book, 1891) not only bring to light the work of a remarkable anti-slavery activist and intellectual but allow us to study the problems of creating a virtual comprehensive edition of a well-published author.

the works of John Greenleaf Whittier, illustrating not only the problems of representing an existing print edition as part of a larger digital collection but also of applying a named entity identification system, initially optimized for historical works, to a literary corpus.

George Bancroft's 10 volume History of the United States (1859), which reflects the nation's image of itself at the dawn of the Civil War and illustrates the challenges of managing a history that covers several centuries.

surveys of American Literature representing the state of thought in late 19th/early 20th century: Cambridge History of American Literature (vol. 1, 2, 3), Reader's History of American Literature (Higginson 1903); Short Studies of American Authors (Higginson 1880); Perry (1921))

reference works such as Dyer's Compendium of the Civil War (with coverage of Battles, Union regimental histories commands, and a machine generated list of officers and their commands, Fox's Regimental Losses (1888); Knight's Mechanical Encyclopedia (1877), Rowell's American Newspaper Directory (1869), and Harper's Encyclopedia of United States History (1902) provide contemporary information and document ideas of the period.

Acknowledgements

The foundations for this work were laid with support from the National Endowment for the Humanities and the National Science Foundation under the DLI-2 Initiative. Tufts University provided support for much of the data entry, and the open source policy championed by its president, Lawrence Bacow, has allowed us to make the XML source texts accessible. The Institute for Museum and Library Services and our collaborators at the University of Richmond in their Civil War Newspapers Project provided much of the support for developing the backend named entity analysis system and the digital library named entity searching and browsing environment. Gregory Crane scanned most of the American books, supervised their OCR, post-processing into TEI-conformant XML and was primarily responsible for the named entity analysis system. Adrian Packel built the named entity searching and browsing environment. Lisa Cerrato, Alison Jones, David Mimno, and Gabe Weaver all made major contributions to the work.

Top | Contents | Home

doi:10.1045/march2006-featured.collection

D-Lib Magazine March 2006

Volume 12 Number 3 ISSN 1082-9873