D-Lib Magazine, October 1996
Digital collections will remain viable over time only if they meet baseline
standards of quality and functionality. This paper advocates a strategy
to select research materials based on their intellectual value, and to
define technical requirements for retrospective conversion to digital image
form based on their informational content. In a rapidly changing world,
the original document is the least changeable. Defining conversion requirements
according to curatorial judgments of meaningful document attributes may
be the surest guarantee of building digital collections with sufficient
richness to be useful for the long-term. The authors argue that there are
compelling economic, access, and preservation reasons to support this approach,
and present a case study to illustrate these points.
[Return to top]
The principal discussions associated with digital library development to date have focused on the promise of technology to help libraries respond to the challenges of the information explosion, spiraling storage costs, and increasing user demands. Economic models comparing the digital to the traditional library show that digital will become more cost-effective provided the following four assumptions prove true:
These four assumptions -- resource sharing, lower maintenance and distribution
costs, meeting user demands for timely access, and continuing value of
information -- presume that electronic files will have relevant content and
meet baseline measures of functionality over time. Of course, even if all
digital collections were created to meet a common standard and level of
functionality, there is no guarantee that use would follow. Last year,
the Australian cultural community issued a statement of principles pertaining
to long-term access to "digital objects." One of their working
assumptions is that not all digitized information should be saved, and
that resources should be devoted to retaining digital materials "only
for as long as they are judged to have continuing value and significance."[2]
This statement may reflect a realistic approach and may serve as a good
yardstick against which to measure the acquisition of current and prospective
electronic resources. But it also signals a cautionary note regarding efforts
to retrospectively convert paper- and film- based research library materials
to digital image form.
There has been a great deal of recent activity -- most notably at the
Library of Congress -- to make nineteenth century materials accessible electronically
to students, scholars, and the public as quickly as possible. The Department
of Preservation and Conservation in the Cornell University Library has
been a leading participant in such efforts, having created over 1 million digital images in the past five years.
From our experience, we
have developed a set of working principles -- which we have freely characterized
as our "prejudices" in another forum [3]
-- that govern the conversion of research collections into digital image
form. Among these are the following beliefs.
We believe that digital conversion
efforts will be economically viable only if they focus on selecting and
creating electronic resources for long-term use. We believe that retrospective
sources should be selected carefully based on their intellectual content;
that digital surrogates should effectively capture that intellectual content;
and that access should be offered to those surrogates in a more timely,
usable, or cost-effective manner than is possible with the original source
documents.[4]
In essence, we believe that long-term utility should be defined by the
informational value and functionality of digital images, not limited
by technical decisions made at the point of conversion or anywhere else
along the digitization chain.
This paper defines and advocates a strategy of "full informational
capture" to ensure that digital objects rich enough to be useful over
time will be created in the most cost-effective manner. It argues for the retrospective conversion of
historically valuable paper- and film-based materials to digital image form. Currently, scanning is the most
cost-effective means to create digital files, and digital imaging is the only electronic format that can
accurately render the information, page layout, and presentation of source documents, including text,
graphics, and evidence of age and use. By producing digital images, one can create an authentic
representation of the original at minimal cost, and then derive the most useful version and format (e.g.,
marked-up text) for transmission and use.
Retrospective Conversion of Collections
to Digital Image Form
Converting collections into digital images begins with project planning, and
then ostensibly follows a linear progression of tasks: select/prepare,
convert/catalog, store/process, distribute, and maintain over time. It is
risky, however, to proceed with any step before fully considering the relationship
between conversion -- where quality, throughput, and cost are primary considerations
-- and access, where processibility, speed, and usability are desirable.
Informed project management recognizes the interrelationships among each
of the various processes, and appreciates that decisions made at the beginning
affect all subsequent steps. An excessive concern with user needs, current technological
capabilities, image quality, or project costs alone may compromise the
ultimate utility of digital collections. At the outset, therefore, those involved
in planning a conversion project should ask, "How good do the digital
images need to be to meet the full range of purposes they are intended
to serve?" Once the general objectives for quality and functionality
have been set, the following factors will collectively determine whether
or not these benchmarks will be met:
Michael Ester of Luna Imaging has suggested that the best means to ensure
an image collection's longevity is to develop a standard of archival quality
for scanned images that incorporates the functional range of an institution's
services and programs: "It should be possible to use the archival
image in any of the contexts in which [the source] would be used, for example,
for direct viewing, prints, posters, and publications, as well as to open
new electronic opportunities and outlets."[5]
If it is true that the value of digital images depends upon their utility,
then we must recognize that this is a relative value defined by
a given state of technology and immediate user needs. Pegging archival
quality to current institutional services and products may not be prudent,
considering the fast-changing pace of technological development. Geoffrey
Nunberg of Xerox PARC has cautioned that digital library design should
remain flexible "to accommodate the new practices and new document
types that are likely to emerge over the coming years." For these
reasons, he argues, "it would be a serious error to predicate the
design . . . on a single model of the work process or on the practices
of current users alone, or to presuppose a limited repertory of document
types and genres."[6]
It is difficult to conceive of the full range of image processing and
presentation applications that will be available in the future -- but it
is safe to say they will be considerable. Image formats, compression schemes,
network transmission, monitor and printer design, computing capacity, and
image processing capabilities, particularly for the automatic extraction
of metadata (OCR) and visual searching (QBIC), are all likely to improve dramatically
over the next decade. We can expect each of these developments to influence
user needs, and lead users to expect more of electronic information. As
long as functionality is measured against how well an image performs on
"today's technology," then we would expect the value of the digital
collections we are currently creating to decrease over time.
Fortunately, functionality is not solely determined by the attributes
of the hardware and software needed to make digital objects human-readable.
Image quality is equally important. We believe that a more realistic, and
safer, means for accommodating inevitable change is to create digital images
that are capable of conveying all significant information contained in
the original documents. This position brings us back to the source documents
themselves as the focal point for conversion decisions -- not current users'
needs, current service objectives, current technical capabilities, or current
visions of what the future may hold. By placing the document at the center
of digital imaging initiatives, we provide a touchstone against which to
measure all other variables. In the rapidly changing technological and
information environment, the original document is the least changeable
-- by defining requirements against it, we can hope to control the circumstances
under which digital imaging can satisfy current objectives and meet future
needs.
The "full informational capture" approach to digital conversion is designed to ensure high quality and functionality while minimizing costs. The objective is not to scan at the highest resolution and bit depth possible, but to match the conversion process to the informational content of the original -- no more, no less. At some point, for instance, continuing to increase resolution will not result in any appreciable gain in image quality, only a larger file size. The key is to identify, but not exceed, the point at which sufficient settings have been used to capture all significant information present in the source document. We advocate full-informational capture in the creation of digital images and sufficient indexing at the point of conversion as the surest guarantee for providing long-term viability.[7]
Begin with the source document
James Reilly, Director of the Image Permanence Institute, describes a strategy for scanning photographs that begins with "knowing and loving your documents." Before embarking on large-scale conversion, he recommends choosing a representative sample of photographs and, in consultation with those with curatorial responsibility, identifying key features that are critical to the documents' meaning. As one becomes a connoisseur of the originals, the task of defining the value of the digital surrogates becomes one of determining how well they reflect the key meaningful features. In other words, digital conversion requires both curatorial and technical competency in order to correlate subjective attributes of the source to the objective specifications that govern digital conversion (e.g., resolution, bit depth, enhancements, and compression).
Table 1 lists some of the document attributes that are essential in defining digital conversion requirements; each will have a direct impact on scanning costs. Where color is essential, for example, scanning might be 20 times more expensive than black and white capture.
Table 1. Selected Attributes of Source Documents to Assess for "Significance"[8] |
|
bound and unbound printed materials |
photographs |
|
|
Once key source attributes have been identified and translated to digital requirements, image quality should be confirmed by comparing the full resolution digital images on-screen and in print to the original documents under magnification[9] -- not on some vaguely defined concept of what's "good enough," or how well the digital files serve immediate needs. If the digital image is not faithful to the original, what is sufficient for today's purposes may well prove inadequate for tomorrow's. Consider, for example, that while photographs scanned at 72 dpi often look impressive on today's computer monitors, this resolution shortchanges image quality in a printed version, and will likely be inadequate to support emerging visual searching techniques.
Case study: the brittle book
To illustrate the full informational capture approach, let's consider
a book published in 1914 entitled Farm Management, by Andrew
Boss. This brittle monograph contains text, line art, and halftone reproductions
of photographs. Curatorial review established that text, including captions
and tables, and all illustrations were essential to the meaning of the
book. For this volume, the size of type ranged from 1.7 mm for the body
of the text to 0.9 mm for tables; photographs were reproduced as 150 line
screen halftones. Although many pages did not contain illustrations, captions,
or tables, a significant number did, and resolution requirements to meet
our quality objectives were pegged to those pages so as to avoid the labor
and expense of page-by-page review. We determined that 600 dpi bitonal
scanning would fully capture all text-based material, and that the halftones
could be captured at the same resolution provided that enhancement filters
were used.
For comparison purposes, we scanned two representative pages -- one pure
text, and one containing text and a halftone -- at various resolutions to determine
the tradeoffs amongst file size, image quality, and functionality. We evaluated
image quality on screen and in print, created derivatives for network access,
and ran the images through two OCR programs to generate searchable text.
|
||||
uncompressed file size |
compressed file size |
visual assessment of quality |
OCR result* |
|
300 dpi, 1-bit |
380 Kb |
31 Kb |
legibility achieved |
33 errors |
600 dpi, 1-bit |
1.5 Mb |
61 Kb |
fidelity achieved |
15 errors |
*We used two OCR programs for this case study (Xerox Textbridge
2.01, and Calera WordScan 3.1); the error count refers to word errors.
Table 2 compares the capture of the text page at 300 and 600 dpi,
resolutions
in common use for flatbed scanning today. The smallest significant characters
on the page measure 0.9 mm, and the numbers listed in two tables are 1.6
mm in height (see
Example 1). Note that although the file size for the
uncompressed 600 dpi image is four times greater than the 300 dpi version,
the 600 dpi Group 4 compressed file is only twice as large. There was no
observable difference in the scanning times on our Xerox XDOD scanner between
the two resolutions, and we would expect similar throughput rates from
600 and 300 dpi bitonal scanning.[10]
Rapid improvements in computing and storage, as well as significant developments
in scanning technology in recent years, have been closing the cost gap
between high-quality and medium-quality digital image capture.
Developments in image processing are also moving forward, but at a slower
rate. The OCR programs we used for our examples are optimized to process
300-400 dpi bitonal images, with type sizes ranging from 6-72 points. It
is striking that although the 300 dpi file produces a legible print (see
postscript file in
Example1), the OCR error rate was more than double that
of the 600 dpi file. In the case of the 600 dpi image, we noted that the
majority of errors were generated on the 0.9 mm text, but not on the numeric
characters contained in the chart. We concluded, therefore, that the file
was rich enough to be processed, and that errors were attributed to the
OCR programs' inability to "read" the font size.
Visual distinctions between fidelity (full capture) to the original
and legibility (fully readable) can be subtle. The sample portions
selected for
Example 1 reveal the quality differences between the two scanning
resolutions. Note that in the 600 dpi version, character details have been
faithfully rendered. The italicized word "Bulletin" successfully
replicates the flawed "i" of the original; the crossbar in the
"e" in "carbohydrates" is complete; and the "s"
is broken as in the original. In contrast, the 300 dpi version provided
legible text, but not characters faithful to the original. Note
the pronounced jaggies of "i" in "Bulletin," the incomplete
rendering of the letter "e," and the filled in "s"
in "carbohydrates."
*Neither OCR program used could accommodate images with
resolutions above 600 dpi; only WordScan could process the grayscale image.
For the mixed page, two scanning methods captured both the text and
the halftone with fidelity. The first, 600 dpi bitonal with a descreening
filter, resulted in a lossless compressed file of 127 Kb, while the 300
dpi 8-bit version with JPEG lossy compression produced a file almost five
times as large. Scanning times between the two varied considerably, with
grayscale capture taking four times longer. Depending on equipment used,
higher grayscale scanning speeds are possible, but we know of no vendor
offering competitive pricing for 300 dpi 8-bit and 600 dpi 1-bit image
capture.
For this page, the body text measures 1.7 mm, and the caption text is
1.4 mm. We found that both the 600 dpi bitonal and the 300 dpi 8-bit versions
could produce single-pass OCR text files that would lend themselves to
full information retrieval applications. On the other hand, the 300 dpi
1-bit version resulted in an accuracy rate of 94.1%, slightly below the
95% rate identified by the National Agricultural Library and others as
the threshold for cost effectiveness when accuracy is required.[11]
The difference in visual quality among the digital files is most evident
in the halftone rendering on this page. (See
Example 2, including the postscript
file.) Without descreening, resolution alone could not eliminate moiré
patterns or replicate the simulated range of tones present in the original
halftone. Had the curator determined that illustrations were not significant,
we could have saved approximately 19% in storage costs for these pages
by scanning and 600 dpi with no enhancement, and thus improved throughput.
We have found that sophisticated image enhancement capabilities must be
incorporated in bitonal scanners to capture halftones, line engravings,
and etchings.
We also created derivative files from several high-resolution images
in a manner analogous to the processes used in Cornell's
prototype digital library. These are intended to balance legibility,
completeness, and speed of delivery to the user.
Examples 2a-2c
show the impact on legibility for body text, caption, and halftone. Each image is
approximately 24 Kb and would be delivered to the user at the same speed.
With respect to image quality, the text in all three cases is comparable,
but the halftone derived from the richer 600 dpi bitonal scan is superior
to either the 300 dpi descreened image or the unenhanced 600 dpi version.[12]
As network bandwidths, compression processes, and monitor resolutions improve,
we would expect to create higher resolution images, which could also be
derived from the 600 dpi image. In the meantime, the quality of the 100
dpi derivatives has been judged by users sufficient to support on-screen
browsing and minimize print requirements.
There are compelling preservation, access, and economic reasons for
creating a digital master in which all significant information contained
in the source document is fully represented. The most obvious argument
for full-informational capture can be made in the name of preservation.
Under some circumstances, the digital image can serve as a replacement
for the original, as is the case with brittle books or in office backfile
conversion projects. In these cases, the digital image must fully represent
all significant information contained in the original as the image itself
becomes the source document and must satisfy all research, legal, and fiscal
requirements. If the digital image is to serve as a surrogate for the original
(which can then be stored away under proper environmental controls), the
image must be rich enough to reduce or eliminate users' needs to view the
original.
It may seem ironic, however, that an equally strong case can be made
for the access and economic advantages of full informational capture. As
Nancy Gwinn has recently stated in an article about funding digital conversion
projects, "Everything libraries do in the digital world will be more
visible to more people."[13]
If observers of digital library development are correct that use will (and
must) increase with enhanced access, then the strengths and weaknesses
of digital conversion will become readily apparent -- to librarians, users,
and funders. If access in the digital world emphasizes ease of use over
physical ownership, then present and future users must have confidence
that the information they receive is authentic. Their needs will not likely
be served by a static version of a digital image, and it is natural to
anticipate that many will soon prefer alternative formats such as PDFs
derived from the digital images. Digital masters faithful to originals
should be used to create these multiple images and formats because:
In an ideal world, we would be able to create digital masters from hard
copy sources without regard to cost, and to produce multiple derivatives
tailored to the individual user upon request. In the real world, of course,
costs must be taken into account when selecting, converting, and making
accessible digital collections. Those looking for immediate cost recovery
in choosing digital over hard copy or analog conversion will be disappointed
to learn that preliminary findings indicate that these costs can be staggering.
Outside funding may be available to initiate such efforts, but the investments
for continuing systematic conversion of collections to develop a critical
mass of retrospective digitized material, electronic access requirements,
and long-term maintenance of digital libraries will fall to institutions
both individually and collectively. These costs will be sustainable only
if the benefits accrued are measurable, considerable, and sustainable.
Table 4 compares the average time per volume spent to reformat brittle
books via photocopy, microfilm, and digital imaging from paper or microfilm.[14]
The total costs for any of these reformatting processes are comparable.
What is notable are the shifts in time spent in activities before, during,
and after actual conversion. Only in the case of scanning microfilm do
we observe that the actual conversion times fall below half of the total.
In this case, many of the pre-conversion activities were subsumed in the
process of creating the original microfilm. Perhaps more significantly,
we see an increasing percentage of time spent in post-conversion activities
associated with digital reformatting efforts. These increases reflect time
spent in creating digital objects from individual image files, providing
a logical order and sequence, as well as minimal access points. These steps
are analogous to the binding of photocopy pages and to the processing of
microfilm images onto reels of film.
|
|||
medium for reformatting |
pre-conversion |
conversion |
post-conversion |
Photocopy |
17.0% |
74.7% |
8.3% |
Microfilm (RLG median times) |
28.1% |
58.9% |
13.0% |
Digital images (CLASS) |
23.3% |
56.1% |
20.6% |
Digital images (COM Project) |
23.1% |
57.2% |
19.7% |
Digital images (Project Open Book) |
16.3% |
32.1% |
51.6% |
Table 4 does not translate times to costs, but the times are indicative
of the range of labor costs associated with reformatting projects. The
actual costs of retrospective conversion will vary widely, according to
the condition, formats, content, and volume of the original collections;
the choice of scanning technologies; scanning in-house versus outsourcing;
the level of metadata needed to provide basic access; and the range of
searching/processing functions to be supported following conversion. Despite
these differences, it seems clear that the costs of creating high-quality
images that will endure over time will be less than the costs associated
with creating lower-quality images that fail to meet long-term needs. As
Michael Lesk has noted, "since the primary scanning cost is not the
machine, but the labor of handling the paper, using a better quality scan
. . . is not likely to increase a total project cost very much."[15]
Should the paper handling capabilities of overhead and flatbed scanners
improve, we would expect to see high-resolution conversion times decrease
further, thereby allowing managers to shift funds to accommodate the greater
percentage of post-conversion activities.
Although the costs associated with reformatting and basic access will
be high, these digital conversion efforts will pay off when the digital
collections serve many users, and institutions develop mechanisms for sharing
responsiblity for distribution and archiving. Given these considerations,
creating images which reflect full informational capture may prove to be
the best means to ensure flexibility to meet user needs, and to preserve
historically valuable materials in an increasingly digital world.
[1] See, for example, Michael Lesk, Substituting
Images for Books: The Economics for Libraries, January 1996.
and Task Force on Archiving of Digital Information, Preserving
Digital Information (final report), commissioned by the Commission
on Preservation and Access and The Research Libraries Group, Inc., June
1996. In the Yale cost model, projected cost recovery rates in the digital
library assume that demand will increase by 33% because access to electronic
information is "easier" and more timely (Task Force). [return
to fn 1]
[2] Draft
Statement of Principles on the Preservation of and Long-Term Access to
Australian Digital Objects, 1995. [return to fn2]
[3] Anne R. Kenney and Stephen Chapman, Digital
Imaging for Libraries and Archives, Ithaca, NY: Cornell University
Library, June 1996, iii. [return to fn3]
[4] Reformatting via digital or analog techniques
presumes that the informational content of a document can somehow be captured
and presented in another format. Obviously for items of intrinsic or artifactual
value, a copy can only serve as a surrogate, not as a replacement. [return
to fn4]
[5] Michael Ester, "Issues in the Use of Electronic
Images for Scholarship in the Arts and the Humanities," Networking
in the Humanities, ed. Stephanie Kenna and Seamus Ross, London: Bowker
Saur, 1995, p. 114. [return to fn5]
[6] Geoffrey Nunberg, "The Digital Library: A
Xerox PARC Perspective," Source
Book on Digital Libraries, Version 1.0, ed. Edward A. Fox, p. 82. [return
to fn6]
[7] Cornell has been developing a benchmarking approach
to determining resolution requirements for the creation and presentation
of digital images. This approach is documented in Kenney and Chapman, Tutorial:
Digital Resolution Requirements for Replacing Text-Based Material: Methods
for Benchmarking Image Quality, Washington, DC: Commission on Preservation
and Access, April 1995, and also in Digital Imaging for Libraries and
Archives. [return to fn7]
[8] For more information on assessing significant
information in printed materials, see Anne R. Kenney, with Michael A. Friedman
and Sue A. Poucher, Preserving Archival Material Through Digital
Technology: A Cooperative
Demonstration Project, Cornell University Library, 1993, Appendix VI;
for assessment of photographs, see James Reilly and Franziska Frey, Recommendations
for the Evaluation of Digital Images Produced from Photographic, Microphotographic,
and Various Paper Formats, report presented to the National Digital
Library Program, June 1996, and Franziska Frey, "Image Quality Issues
for the Digitization of Photographic Collections," IS&T Reporter,
Vol. 11, No. 2, June 1996. [return
to fn8]
[9] See, "Verifying predicted quality," Digital
Imaging for Libraries and Archives, pp. 28-31. [return
to fn9]
[10] Responses to Cornell and University of Michigan
RFPs for imaging services over the past several years indicate that the
cost differential between 600 and 300 dpi scanning is narrowing rapidly.
[return to fn10]
[11] In their Text Digitization Project, the National
Agricultural Library determined that using OCR programs to generate searchable
text for their image database required minimum accuracy rates of 95% to
be considered cost-effective (conversation with Judi Zidar, NAL). Kevin
Drum of Kofax, a leading provider of conversion services, argues that "if
you cannot get 95% accuracy, the cost of manually correcting OCR errors
outweighs the benefit." See, Tim Meadows, "OCR Falls Into Place,"
Imaging Business, Vol. 2, No. 9, September 1996, p. 40. [return
to fn11]
[12] Randy Frank, one of the leading figures in the
TULIP and JSTOR journal projects, acknowledges that a primary benefit of
capturing journal pages at 600 dpi instead of 300 dpi is the quality of
the resulting derivative converted on-the-fly using the University of Michigan
TIF2GIF conversion utility. Conversation with Anne R. Kenney, March 1996.
[return to fn12]
[13] Nancy E. Gwinn, "Mapping the Intersection
of Selection and Funding," Selecting Library and Archive Collections
for Digital Reformatting: Proceedings from an RLG Symposium, Mountain View:
The Research Libraries Group, Inc., August 1996, 58. [return
to fn13]
[14] Figures used to calculate percentages of time
for reformatting activities were obtained from the following sources: for
preservation photocopy and the CLASS project, see Anne R. Kenney and Lynne
K. Personius, The
Cornell/Xerox/Commission on Preservation and Access Joint Study in Digital
Preservation, Commission on Preservation and Access, 1992, p. 41-43;
for microfilm, see Patricia A. McClung, "Costs Associated with Preservation
Microfilming: Results of the Research Libraries Group Study," Library
Resources & Technical Services, October/December 1986, p. 369;
for the Cornell Digital-to-COM Project, project files from time studies;
and for Project Open Book, see Paul Conway, Yale
University Library's Project Open Book: Preliminary Research Findings,
D-Lib Magazine, February 1996. Because actual times were not available
for title identification, retrieval, circulation, searching, and curatorial
review (common pre-conversion activities), these were drawn from the RLG
study (McClung, Table 1) for comparative purposes. Yale and Cornell will
issue a joint report later this year detailing the times and costs associated
with the full range of activities in the hybrid approach of creating preservation
quality microfilm and digital masters. Note that this table does not include
times for queuing, cataloging, or any activities related to delivering
information to the user, only times associated with activities from selection
to storage of archival masters that have been minimally ordered and structured.
[return to fn14]
[15] Michael Lesk, Substituting
Images for Books: The Economics for Libraries. [return
to fn15]
Editorial corrections at the request of the authors, November 1, 1996.
hdl:cnri.dlib/october96-chapman