Don't Leave the Data in the Dark: Issues in Digitizing Print Statistical Publications

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
January 2006

Volume 12 Number 1

ISSN 1082-9873

Don't Leave the Data in the Dark

Issues in Digitizing Print Statistical Publications

Julie Linden
Government Documents & Information Center
Yale University
<julie.linden@yale.edu>

Ann Green
Digital Life Cycle Research and Consulting
<green.ann@gmail.com>

	Introduction Digitization has the potential to transform scholarly use of data found in print statistical publications. While presenting images of statistical tables in a digital library environment may be desirable, the full potential of such material can be realized only if the resulting digital objects are easy to search and manipulate and are accompanied by sufficient metadata to support extraction of numbers from tables and comparison of numbers across tables. The Economic Growth Center Digital Library (EGCDL), funded by The Andrew W. Mellon Foundation, addressed these issues in a project that brought together the perspectives of digital libraries and data archives. In EGCDL, PDF reproductions of statistical abstracts co-exist with manipulable Excel files of tables from the abstracts. Rich descriptive metadata can be leveraged to provide discovery of and context for digital objects at varying levels of granularity – from a statistical series to a single number in a table cell. Thus derivative digital objects – the manipulable table or even a cell from it – can be traced back to the original source or a faithful digital reproduction of that source. Challenges in transforming print statistics to digital objects The Economic Growth Center Library Collection (EGCLC) at Yale University focuses on materials relating to statistics, economics and planning in over 100 developing countries. One of the most comprehensive collections of its kind in the United States, it provides a historical perspective to current research in globalization, urban studies and development policies. Faculty and students from a variety of disciplines, but especially economics and political science, use the EGCLC statistical data for their research and teaching on education, health, social expenditures, economic development and labor. The collection, which has been painstakingly acquired over more than 40 years, has unrealized potential for supporting interdisciplinary and global research not only at Yale but also at other research institutions in the United States, in the countries from which the data originated, and in other countries throughout the world. While the EGC collection contains data useful to faculty and students doing comparative studies of demographic, social and economic characteristics of developing countries, it is underutilized primarily because of inadequate discovery tools. Many of the publications lack a table of contents and an index; often the only way to discern the full descriptive information about the content of tables within a particular statistical series is to examine the physical volumes. Yale University's Social Science Research Services and Social Science Libraries and Information Services, with funding from The Andrew W. Mellon Foundation, built a prototype statistical digital library called the Economic Growth Center Digital Library (EGCDL) by digitizing selected material from the EGC print collection [1]. Our initial tasks were to identify what digital objects and accompanying metadata should be found in the EGCDL and to develop a structure that would allow direct access to specific statistical values within tables and the accompanying metadata. Digitizing statistical publications and making them web-accessible in image files (or image + text PDF files) is a modest step toward improving access. Researchers no longer need to examine physical volumes at the library but can examine the images on the web. However, the problems imposed by the volumes' physical format still remain in digital images. Only through visual inspection can researchers locate specific statistics in image files, which is as cumbersome, if not more so, than using the original tables in print. In cases where text legibility is good enough to allow OCR processing, searchable text files can be created. Users may be able to find relevant tables more quickly, but navigating search results can be difficult, particularly if results are found in disparate PDF files. Thus, images and text files offer only a limited solution to the digitization challenge. The U.S. Census Bureau has digitized several years of Statistical Abstract of the United States [2], but the PDF files are difficult to use for the reasons described above, and libraries with those volumes in paper may actually direct researchers to the print versions for ease of use. Problems also arise when selections of data are moved from image files into derivative products (reports, publications, digital compilations, etc.). One must manually type or awkwardly copy and paste the numbers into another application. This task is not only error-prone, but also susceptible to separating the data from its context. The researcher who naturally finds it tedious to locate, copy, and key or paste the desired numbers into a spreadsheet may find documenting those numbers equally onerous, consequently failing to cite data sufficiently in secondary publications. Users run the risk of losing essential metadata (information about the limitations of the data, units of measure, data collection conditions, caveats, errors, etc.) that impacts the interpretation and use of the numbers. For example, FRASER®, the Federal Reserve Archival System for Economic Research [3] presents historical statistical publications in PDF format. Some tables contain footnote reference marks, but the footnotes themselves are published in separate PDF files, rather than in the same file as the table image, making the footnotes easy to miss. Whether someone needs a single number or a handful of numbers, is making comparisons across geography or time, or is undertaking intensive analysis and modeling, "metadata provides the bridges between the producers of data and their users and conveys information that is essential for secondary analysts" [4]. Metadata frame the statistical information with context and support the interpretation of statistics. Metadata are needed to guide calculations and describe transformed data [5]. Consequently, the data and metadata need to be linked and travel together to the users' workspace and into derived digital objects: data files and databases, reports, publications, visualizations, maps, models, etc. Although the EGCDL project did not present a technical solution to the challenges of linking data and metadata, the project did develop digital objects and metadata that could support such a solution. Numerous government and academic institutions face the challenges of moving printed statistical tables into digital collections and of minimizing the resultant separation of data from their context. For example, the U.S. Government Printing Office (GPO) "is working with the library community on a national digitization plan for converting the tangible resources held in depository libraries 'legacy materials' beginning with the Federalist Papers forward" [6]. This legacy collection contains a wealth of statistical material, and GPO has indicated interest in providing these materials in manipulable formats (i.e. not just PDF) [7]. In the United Kingdom, a JISC-funded project to digitize historical census reports plans to present images of the reports, searchable text, and "many tables...in machine-readable format" [8]. Our EGCDL project addressed issues that these and other statistical digitization projects will confront; our findings can help inform planning for statistical digitization and suggest further avenues for research. EGCDL: digitizing Mexican state statistical abstracts The EGCDL relies on best practices and emerging standards in digital library production, but goes beyond most digital libraries' focus on images of documents or structured renditions of physical documents to address issues of searching, retrieving, and documenting numeric values within tables that were not born digital. The initial phase of the EGCDL project involved digitizing the annual statistical abstracts for all 31 states of Mexico from 1994 to 2000 [9]. This collection comprises 221 physical volumes for a total of 103,115 pages. The Mexican series, Anuarios Estadísticos de los Estados, contains a wealth of statistical data at the state and municipal levels, including population, industrial, service and commercial censuses, annual and quarterly economic indicators, trade, financial, and production statistics. In its totality, the series provides a very detailed statistical portrait of Mexico over several years. Each individual volume provides annual statistics for a single state. A researcher could use several volumes for a single state over many years, building time series statistics for that state; or a researcher could use many volumes for a single year, thus composing a topic-oriented picture of many states at one point in time. Each volume is thus related to the others in the series by either time (year of publication) or geography (state). Internally, each printed volume is divided into subject-based chapters; each chapter is comprised of several tables presenting statistics on that subject. A researcher needing a single number from a table in the series must drill down through this hierarchy – finding the correct volume in the series (by year and state), locating the correct chapter, determining the desired table within the chapter, and then isolating the exact cell within the table. EGCDL: multiple output types for multiple purposes Archival: As noted above, mere images of statistical tables are not sufficient to fully open up these materials to research; nonetheless, a case can be made for including digital image equivalents of the original print documents in a digital statistical library, and in fact EGCDL produced both TIFF and PDF/A [10] versions of the Anuarios. Converting the documents to archival TIFFs, while not providing usable substitutes for reading or skimming a book, does provide preservation masters, which may need to be invoked in case of corruption or destruction of either the print originals or subsequent digital derivatives. Representations of the printed volume with chapter level files: The EGCDL contains a PDF image + text file for every chapter of each statistical abstract volume, following the chapter structure in the table of contents. The 5,662 PDF files are described with a subset of the Dublin Core Metadata Element Set, which provides sufficient information to index and retrieve files by state, topic, and/or date. The Dublin Core records allow users not only to collate chapters by desired states, years, and subjects, but also to reconstitute an entire volume for a single state and year. Strategic file naming at production time allowed us to automate the creation of these Dublin Core records with programming scripts. PDF versions of the volumes can be searched for words or phrases (although searching accuracy is compromised because we did not contract for OCR correction) and provide online users with the sense of encountering the physical volumes' organization and layout. Data in formats compatible with statistical packages: The transformative potential of digitizing statistical materials was realized through the conversion of images of selected tables into individual Excel files. The EGCDL Mexico collection now contains 16,635 Excel spreadsheet files for the tables in the socio-economic and demographic chapters for the years 1994, 1996, 1998 and 2000 [11]. Metadata describing the content and structure of statistical tables: For the Excel files, we produced highly structured descriptive metadata records in the Data Documentation Initiative (DDI) metadata format [12]. The Excel tables include vendor-supplied tags to demarcate footnotes as well as the column and row headings; we wrote scripts to parse text from sections of the tables into the appropriate fields in the DDI records. As with the Dublin Core records, the DDI records were produced entirely through automated methods. Each DDI record includes a bibliographic citation for the table it describes as well as information about the source from which the table was produced. As with the Dublin Core records, the DDI records can be used to locate common tables across the collection (e.g. tables from the same original source chapter or volume, tables from different volumes for a single state or year, etc.). EGCDL thus provides mechanisms for pulling together similar data across several different dimensions that are not easily done using the printed volumes. The DDI records, which go a step further than the Dublin Core records, allow researchers to bring together all tables with specific keywords (from the table titles, column and row headers, and footnotes) because of the wealth of detail they contain. The resultant DDI records are rich with structured information about each table, allowing for the potential of mining the full text of the table in ways that allow scholars to determine the dimensions of the table (e.g. details about the rows and columns) and to locate content using text strings in specific sections (i.e. from the table titles, column and row headers, or footnotes) [13]. A single cell in an Excel table can thus be fully described by accompanying metadata. Recommendations for digitizing statistical data To ensure that data are not left in the dark, decisions must be made very early in a statistical digitization project about the objects that will be created as outputs of the digitization process: what are all the levels of objects to be digitized (as parts that are or can be isolated from their original context), described (in metadata), and presented (constructed, reconstructed, deconstructed, whether in the digitization project's interface or in the end-users' own interfaces). Jantz and Giarlo's discussion of digital derivatives, while grounded in a digital preservation context [14], articulates concepts we confronted during the EGCDL project when defining the digital objects and the metadata that would support their optimal use. They write, "From a digital preservation perspective, perhaps the single most important design process is to define the architecture or 'model' of the digital object. Each format to be preserved (book, newspaper, journal, etc.) has an architecture appropriate for the unique characteristics of that format." This is no less true from the user interface or metadata perspective. Their example of a historical newspaper includes "presentation (or access) images, digital masters, and OCR-ed text" as components of the object architecture. In the case of digitized statistical publications, the architecture can include faithful digital representations of the print originals (e.g. TIFF or PDF) as well as manipulable numeric files (e.g. CSV or Excel). Jantz and Giarlo also address the issue of transformations in the creation of digital derivatives: "During the life cycle of the object, there are transformations from the original print or non-digital object to the digital representation and subsequent transformations that can occur through editing and migration of the digital content." Indeed, the transformation of print statistical materials to manipulable numeric data files demonstrates the utility added to these digital objects. In transforming print statistical tables to digital objects, the challenge is to remain faithful to the original presentation of the table while transforming digital objects into an optimally useful form [15]. Decisions about what digital objects will be created from print originals and what transformations the digital objects must undergo will be influenced by several factors, including print quality, costs, and user needs. The final report of the EGCDL project offers models that illustrate options for digitizing in relation to these factors; we discuss the factors briefly here [16]. Print quality must be examined carefully because not all print statistical materials will be good candidates for OCR processing. In cases where the degree of legibility is marginal, entering the data manually may be most appropriate. Other times the quality of the text may be sufficient for scanning into image-only PDF files, but not legible enough to use OCR to create searchable PDF files or spreadsheets. Statistical tables offer additional OCR challenges: not only the text and numbers, but also the layout of the table, must be correctly rendered. The complexity of the tables is therefore also a consideration. Complex hierarchical tables (those with nested categories of geography, for example) often require considerable manual editing to align the columns and rows properly. Cost comparisons and output quality can best be determined by testing a sample of tables. Costs are always an important factor in this decision-making process, so areas where repurposing or efficiencies can be gained should be closely examined. For example, it may be desirable to provide a very detailed metadata record so that the lowest-level digital object (a table cell) can be described in its entire context – that is, its original table, the chapter and book in which that table was published, the statistical series to which that book belongs. This very detailed metadata record can be leveraged to provide descriptive and contextual metadata for the higher-level digital objects (the table, chapter, and book) as well. The decisions about the digital objects will have implications for the digitization project workflow. As the original print object is transformed into one or more derivative objects (which themselves have the potential to be further transformed by end-users), what is the most efficient means of achieving those transformations? For those projects that outsource digitization to vendors, it is vital to have in-depth conversations with the vendor about the various digital output formats that are desired and to understand the process by which the print originals are converted to those output formats. It is also important to consider metadata at this stage and to discuss with the vendor ways in which the digital output files can be formatted or named to aid in the automation of metadata creation. The end-user perspective on the digital object will inform this decision-making and build an understanding of implications for user interfaces. What levels of digital objects will the user want to access or extract? What accompanying metadata will the user need to understand and use the object? What is the best way to deliver that metadata in a seamless package with the digital object? Conclusion We have reached a critical point in statistical digitization projects as the number, scope, and scale of such projects continues to grow rapidly. These projects must make investments in adequate metadata and object-oriented design at the point of digitization – otherwise, the data are in danger of losing their context. Many digital library projects accept minimum-level metadata standards for describing digital objects. For statistical data, the minimum level – metadata describing a statistical series – is insufficient because the lowest-level digital object will not be adequately documented. Resources spent on digitizing will have been optimally spent only if the resulting data are optimally usable. In this project, we have been able to add value to the digital renditions without unacceptably altering the original content or structure. For example, the PDF versions of the Anuarios Estadísticos de los Estados in the EGCDL are not produced as entire volumes in a single file but as individual chapters. These are faster to download and because each chapter is described with its own metadata record, it is easier to identify and directly access specific subject content. The Excel tables include special tagging, which changes neither the table values nor the structure of the table. The specially tagged Excel tables provide users with data files that can be analyzed and subset and make it possible to automate the production of detailed metadata records. The metadata records allow users to search for similar tables across the entire collection and provide context by describing and potentially linking to the source of each digital object. EGCDL takes a step toward addressing one of the central challenges of statistical digital collections from print: how to make the lowest-level digital object (a particular cell in a statistical table) "intelligent," so that its content is described enough to be used accurately, even when isolated from its original context. Data and metadata co-exist within the EGCDL interface, and the structured DDI records that EGCDL provides could be deployed within a specific application to "travel" with any extracted number from a table. The next step is to address the technical challenge of linking metadata and data within an application so that they travel together seamlessly from an interface to a user's desktop or derivative product. John McCarthy of Lawrence Berkeley National Laboratory, writing in 1985, described the optimal situation: "Meta-data need to be closely linked to each other and to the data they describe so that users can comprehend them as a coherent whole and programs can use them actively in conjunction with data and one another" [17]. His paper chronicles the lack of software that would adequately integrate data and metadata. Twenty years later, such software that supports the concept of providing context for digital objects is still not adequately developed. Statistical digitization projects must press for the development of software that supports and utilizes data and metadata together and most importantly must not sacrifice the requirement of linking objects and metadata to the inadequacies of available digital library systems. Acknowledgements We are grateful to Chuck Humphrey, University of Alberta, and Jennifer Weintraub, Yale University, for their valuable comments on early drafts of this article. Notes and References [1] "Supporting Economic Development Research: A Collaborative Project to Create Access to Statistical Sources Not Born Digital" (June 2003-April 2005). Project web site, which includes the prototype statistical digital library and links to presentations given at the Digital Library Federation Fall Forum and the IASSIST Conference: <http://ssrs.yale.edu/egcdl>. [2] U.S. Census Bureau, Statistical Abstracts, <http://www.census.gov/prod/www/abs/statab.html>. [3] FRASER® Federal Reserve Archival System for Economic Research, <http://fraser.stlouisfed.org>. [4] Jostein Ryssevik, The Data Documentation Initiative (DDI) metadata specification (n.d.), <http://www.icpsr.umich.edu/DDI/papers/ryssevik.pdf>. [5] Data users need to know a dataset's context (methodology, sampling, etc), source, funding and authoring body, purpose, validity, accuracy, version, orientation in space and time, provenance, relationships to related variables, and potential applications. [6] U.S. Government Printing Office, FDsys Specification for Converted Content (Version 3.0) (June 2005), <http://www.gpoaccess.gov/legacy/FDsys_ccspecs.pdf>, 4. [7] U.S. Government Printing Office, Council Discussion, Questions and Answers from the Fall Meeting 2004: Digitization of the Legacy Collection, <http://www.access.gpo.gov/su_docs/fdlp/pubs/ proceedings/04cbt/digitization_legacy_collection.pdf>, 11. [8] AHDS History, Online Historical Population Reports Project, <http://histpop.org/>, accessed October 26, 2005. [9] Instituto Nacional de Estadística, Geografía e Informática (INEGI), Anuarios Estadísticos de los Estados, 1994-2000. INEGI provides Excel tables from this series, back to 2002, on its web site: <http://www.inegi.gob.mx/inegi/>, accessed November 30, 2005. [10] We selected PDF/A format in anticipation of its suitability for long-term preservation. "Among federal archivists and records managers, PDF-A is viewed as one of two leading data format candidates for preserving future access to electronic records and documents....The proposed PDF-A standard specifies what should be stored in an archived file by prohibiting, for example, proprietary encryption schemes and embedded files such as executable scripts." Florence Olsen, "Archive-friendly PDF in the works," Federal Computer Week (March 15, 2004), <http://www.fcw.com/fcw/articles/2004/0315/news-pdf-03-15-04.asp>. The Library of Congress provides details about PDF/A, including evaluation of sustainability, quality, and functionality factors: Sustainability of Digital Formats: Planning for Library of Congress Collections: PDF/A, PDF for Long-term Preservation, <http://www.digitalpreservation.gov/formats/fdd/fdd000125.shtml>, accessed December 9, 2005. [11] We limited spreadsheet production to this subset for budgetary reasons. Document Solutions, Inc. (DSI) used custom zoned scanning software to process an image file for a particular page and determine the boundaries of the individual table or tables on the page. The text was processed with optical character recognition (OCR) software. DSI staff reviewed and corrected suspected OCR errors and manually corrected the layout. We also contracted for checksum macros, and the accuracy rate for alpha and numeric characters was certified at least 99%. [12] DDI Version 2.1, <http://www.icpsr.umich.edu/DDI/users/dtd/index.html#version2.0>. [13] A sample DDI record is available at: <http://ssrs.yale.edu/egcdl/xml/Aguascalientes/2000/Aguascalientes_2000_03_11.xml>. The same record, rendered with an XSL stylesheet, is available at: <http://webapp.icpsr.umich.edu/cocoon/DDI/SAMPLES /Aguascalientes_2000_03_011.xml>. [14] Ronald Jantz and Michael J. Giarlo, "Digital Preservation: Architecture and Technology for Trusted Digital Repositories," D-Lib Magazine 11 no. 6 (June 2005), <doi:10.1045/june2005-jantz>. [15] Digital Library Federation, "Benchmark for Faithful Digital Reproductions of Monographs and Serials," Version 1 (December 2002), <http://purl.oclc.org/DLF/benchrepro0212>, provides a definition and discussion of "faithful digital reproduction." [16] Ann Green, Sandra K. Peterson, and Julie Linden, "Supporting Economic Development Research: A Collaborative Project to Create Access to Statistical Sources Not Born Digital" (April 27, 2005), <http://ssrs.yale.edu/egcdl/Yale_EGCDL_report_0505.pdf>, 10-16. [17] John McCarthy, "Scientific Information = Data + Meta-data," photocopy of draft to be published in Database Management: Proceedings of a Workshop November 1-2, 1984, held at the U.S. Navy Postgraduate School, Monterey, California (Department of Statistics Technical Report, Stanford University, 1985), 16. Copyright © 2006 Julie Linden and Ann Green

	Top \| Contents Search \| Author Index \| Title Index \| Back Issues Previous Article \| Next article Home \| E-mail the Editor

	D-Lib Magazine Access Terms and Conditions doi:10.1045/january2006-linden