Clips & Pointers

Spacer

D-Lib Magazine
January 2005

Volume 11 Number 1

ISSN 1082-9873

In Brief


Spacer

The Gamera Software Development Kit for Document Image Analysis

Contributed by:
Michael Droettboom
Scholarly Publishing Specialist
The Johns Hopkins University
Baltimore, Maryland, USA
<mdboom@jhu.edu>

As storage and imaging technologies become less expensive, more digitized document images are making their way into digital libraries and elsewhere. In order to support content-based retrieval on these collections, the digital images must first be converted to an electronic representation of the content. For many modern and straightforward text documents, existing optical character recognition (OCR) technology is adequate. For example, the recently announced Google Print project uses OCR to make printed materials available through the Google search engine. However, for many antiquarian, non-printed or even non-textual documents, custom recognition systems must be built in order to automatically extract their content.

Gamera is a software development kit that reduces the time it takes to develop such custom tools. It provides building blocks in image processing and machine learning and allows them to be plugged together in new ways to solve a problem. It also provides a common framework for other developers to create and share new building blocks that solve specific problems. Gamera's other advantage is its rich interface that allows researchers to see the results of their experiments immediately. Gamera is open source software developed at The Johns Hopkins University with the assistance of a wider community of developers.

Gamera has been the basis for systems for interpreting such diverse documents as common music notation, Medieval French manuscript, The Statistical Accounts of Scotland (1799), lute tablature, Navajo language documents and even multiple choice test forms. It has also been used as a platform for pure document image analysis research.

More information about Gamera is available at <http://gamera.sourceforge.net/>.


XML Techniques for the Representation and Interchange of Thesaurus Data

Contributed by:
Ceri Binding
University of Glamorgan, School of Computing
Pontypridd, South Wales, United Kingdom
<cbinding@glam.ac.uk>

During the course of recent work, the Hypermedia Research Unit at the University of Glamorgan has produced some lightweight and flexible XML schemas describing formats suitable for the representation, storage and interchange of thesaurus data. The schemas created are relatively simple but model powerful conceptual data structures. The core standard thesaurus relationships (hierarchical, equivalence and associative) are modelled using tag names based on existing thesaurus standards.

Various schema files, demonstration data files and usage examples have been created, and these may be viewed online, downloaded, adapted or extended as necessary to suit your own work. Although initially developed for in-house use, the schemas are offered under an open source license to encourage further development by interested parties. The work is described in more detail at <http://www.comp.glam.ac.uk/~FACET/formats/>.

XSLT examples are included showing how transformation may be used to present alternative views of the data contained within documents conforming to the schemas.

The use of direct XPATH querying is shown using a series of small practical example expressions, and an online thesaurus search and browse interface demonstrates these techniques to good effect. The examples are not presented as a universal approach to searching and browsing of online knowledge organisation systems, but as demonstrations of some useful techniques that may be employed to create responsive interfaces when dealing with relatively small datasets.

The development of suitable common representation / interchange formats for thesaurus data will play a key role in ensuring wider usage and interoperability. Some important issues initially arising as a result of this work include data ownership (licensing and intellectual copyright), thesaurus versioning, and measures to ensure data integrity during transformation to other formats. Moves are underway to revise existing thesaurus standards, and it is envisaged that these will specify recommended formats for exchanging thesaurus data.

Information on the schemas and their use can be found at <http://www.comp.glam.ac.uk/~FACET/formats/>. See also <http://www.comp.glam.ac.uk/~FACET/> for background on the broader FACET project investigating semantic expansion and thesaurus based retrieval. A case study of the project was featured recently in Thematic Issue 6 of the EC project DigiCult, a technology watch for cultural and scientific heritage <http://www.digicult.info/pages/Themiss.php>.

The UTOPIA Project

Contributed by:
Mark McFarland
Associate Director for Digital Initiatives
The University of Texas Libraries
Austin, Texas, USA
<m.mcfarland@austin.utexas.edu>

Two years ago the president of the University of Texas at Austin (UT) announced a campus-wide initiative to systematically digitize and make available to the public via the World Wide Web the intellectual and cultural treasures of the institution. On March 6, 2004, the campus went live with a site called "UTOPIA". Initially, UTOPIA is focusing on providing K-12 teachers and students with content that draws on UT's expertise, meets specific needs for curriculum in public schools, and is aligned with the Texas Essential Knowledge and Skills. Texas schools now have widespread connectivity, and much UT content has already been developed, organized, and curated for the K-12 population, but there is a continuing need for high-quality, well-presented material to be easily accessible online, free to teachers and students. Nevertheless, there is a great deal of content on the site already that will be of interest to a general audience.

We are currently in the process of preparing for a DSpace implementation, and we have been discussing the problem of how best to engage with faculty on this issue. The primary goal of this effort is to greatly accelerate ongoing efforts to digitize and make available UT-owned resources held in its libraries, museums, collections, laboratories, research units and departments. This new effort has three important components:

  • To digitize and provide access to materials from our collections, museums, laboratories and classrooms to the public free of charge
  • To establish a connection with K-12 community by developing content and services that support the classroom teacher
  • To serve as a digital archive for material produced by faculty members

The UTOPIA project has created the impetus for new collaborations among scholars and librarians and campus technologists. As librarians work more closely with faculty members on specific digital projects, opportunities to help shape digital research products in ways that enhance their long-term viability and availability are emerging. With a mandate from the campus administration to work directly with faculty to select, produce and maintain content over time, the UTOPIA project has become an important means of initiating conversations about providing an institutional repository (IR) infrastructure that will help sustain the University's digital assets through time.

The UTOPIA project and website (http://utopia.utexas.edu/) have become a useful way to initiate conversations with faculty members and graduate researchers about the long-term availability of their current work. So, while the UTOPIA site is not, itself, an IR, the purpose it serves is to showcase pieces of larger works and collections that will be included in the IR.

In the News

Excerpts from Recent Press Releases and Announcements

University of Illinois Offers Advanced Degree and Fellowships in Digital Librarianship

January 14, 2005, announcement by Molly Dolan: "Urbana-Champaign, Illinois - Beginning in the 2005-2006 school year, the Graduate School of Library and Information Science (GSLIS) at the University of Illinois at Urbana-Champaign will offer a structured Certificate of Advanced Study (CAS) in Digital Libraries. Five one-year, non-renewable fellowships will also be available to CAS and MS degree students wishing to focus on digital libraries. The program aims to give students a thorough and technically focused background in digital libraries that will enable them to serve as designers, decision-makers, and creators of digital collections."

"Students may choose to enroll in the CAS program either on campus at Urbana-Champaign or at a distance via GSLIS's LEEP online education option. The core courses for the program will be offered via LEEP, while elective courses may be completed via LEEP or on campus, as offered. By making use of the LEEP option, GSLIS will be able to offer classes taught by distinguished practitioners from other institutions in the field of digital librarianship."

"The CAS degree is a program of advanced course work intended for those who hold a master's degree in library and information science or a related field. Librarians, information scientists, and others in information management can enroll in the program to refresh and update their skills and gain greater specialization in digital librarianship and related issues. To earn the degree, students will be required to complete 40 hours of course work, including 8 hours focusing on an individual project related to digital libraries."

Information about applying to the program can be found at <http://www.lis.uiuc.edu/gslis/degrees/cas_dl.html>.


IBM Statement of Non-Assertion of Named Patents Against OSS

January 11, 2005 - Excerpt from IBM statement: "IBM is committed to promoting innovation for the benefit of our customers and for the overall growth and advancement of the information technology field. IBM takes many actions to promote innovation. Today, we are announcing a new innovation initiative. We are pledging the free use of 500 of our U.S. patents, as well as all counterparts of these patents issued in other countries, in the development, distribution, and use of open source software. We believe that the open source community has been at the forefront of innovation and we are taking this action to encourage additional innovation for open platforms."

For more information, please see the full announcement at <http://www.ibm.com/ibm/licensing/patents/pledgedpatents.pdf>.


Library of Congress announces new digital collection

January 11, 2005 - "The Library of Congress' s Rare Book & Special Collections Division is pleased to announce the release of a new digital collection, The Kraus Collection of Sir Francis Drake, available on the Library's Global Gateway Web site at: <http://international.loc.gov/intldl/drakehtml/>. "

"Sir Francis Drake, English explorer and naval strategist, circumnavigated the globe from 1577-1580. During these travels, Drake visited the Caribbean and the Pacific, claiming a portion of California for Queen Elizabeth and waging battles on the Spanish. His voyages revealed significant new geographical data about the New World and added greatly to Queen Elizabeth's treasury."

"This online presentation of The Kraus Collection of Sir Francis Drake joins other world history collections available on the Library of Congress's Global Gateway Web site: <http://international.loc.gov/intldl/intldlhome.html>. The Kraus Collection of Sir Francis Drake may be found under the heading: 'Individual Digital Collections.'"

For more information, please contact <http://www.loc.gov/help/contact-international.html>.


Nature Publishing Group announces change to self-archiving policy

January 10, 2005 - "Nature Publishing Group: As of January 2005, authors of original research papers published by Nature Publishing Group (NPG) will be encouraged to submit the author's version of the accepted, peer-reviewed manuscript to their relevant funding body's archive, for release six months after publication. In addition, authors will also be encouraged to archive their version of the manuscript in their institution's repositories (as well as on their personal web sites), also six months after the original publication."

"This policy has been developed to extend the reach of scientific communications, and to meet the needs of authors and the evolving policies of funding agencies that may wish to archive the research they fund. It is also designed to protect the integrity and authenticity of the scientific record, with the published version clearly identified as the definitive version of the article...."

"...NPG recognizes the balance of rights held by publishers, authors, their institutions and their funders (Zwolle Principles, 2002), and has been a progressive and active participant in the recent debates about access to the literature (see http://www.nature.com/nature/focus/accessdebate/). In 2002, NPG was one of the first publishers to allow authors to post their contributions on their personal web sites, by requesting an exclusive license-to-publish, rather than requiring authors to transfer copyright. We see this most recent development as another step forward in the evolution of scientific communication on the Internet."

For more information, please contact David Hoole, Nature Publishing Group, The Macmillan Building, 4 Crinan Street, London, N1 9XW, UK, Phone: +44 (0)20 7843 4727.


OCLC and Antarctica Systems, Inc. to test library users' search preferences

January 10, 2005 - Announced in OCLC ABSTRACTS, (Vol. 8, No. 2) "OCLC is launching a pilot to evaluate library users' experiences with searching and display of search results using a visual interface developed by Antarctica Systems, Inc. The pilot will run from January through April 2005 and will be implemented on a database of electronic books that will be available to all users of the OCLC Base Package and the OCLC Collection on the OCLC FirstSearch service."

"Antarctica Systems, Inc. will use its VisualNet data visualization software to create a visual interface to the electronic books database. When users select the electronic books database on FirstSearch, they will be given the option to use the visual interface for searching and viewing results. OCLC will conduct a user survey to gauge feedback during this pilot and will also collect usage statistics that will be evaluated for future applications."

For more information, please see OCLC ABSTRACTS at <http://www5.oclc.org/downloads/design/abstracts/01102005/index.htm>.


Libraries for the Blind Launch Digital Audio Book Service

January 5, 2005 - "State libraries for the blind in Colorado, Delaware, Illinois, New Hampshire, and Oregon, along with the National Library Service for the Blind and Physically Handicapped (NLS), part of the Library of Congress, have partnered to launch an innovative digital audio book service for visually impaired users."

"Unabridged (http://www.unabridged.info/) enables blind patrons to check out and download digital spoken word audio books directly to their computers. The digital audio books can then be played back on a PC, transferred to a portable playback device, or burned onto CDs."

"...The first year of the program will serve as the pilot phase, with a limited number of users in each participating state."

For more information, please see the full press release at <http://www.unabridged.info/pressrelease20050105.htm>.


Digital Reference Standard In Trial Use

January 5, 2005 - (From NISO Newsline, January 2005) "NISO's Draft Networked Reference Standard, which defines a method and structure for data exchange in digital reference services, is available freely for trial use (http://www.loc.gov/standards/netref) through April 5, 2005. NISO Committee AZ developed the Question/Answer Transaction Protocol, known informally as NetRef, to support exchanges between library patrons and reference sources."

"Digital reference services constitute a new but rapidly growing extension of the traditional reference service offered to library patrons. While the service may be delivered via real-time chat or asynchronous e-mail, the essential characteristic of the service is the ability of the patron to submit questions and to receive answers via electronic means. The standard responds to strong interest in the information community in evolving localized network reference services into more fully interconnected, collaborative reference services."

For more information, please see NISO Newsline at <http://www.niso.org/news/newsline/NISONewsline-Jan2005.html>.


Digital Library of Mathematics

December 18, 2004, announcement from Dr. M. Krishnamurthy: "The Indian Statistical Institute Library (http://library.isibang.ac.in:8080/dspace) is pleased to announce it has installed the DSPACE digital library for Mathematics and Statistics to provide an open platform and make access to digital information."

"The Indian Statistical Institute Bangalore center is one among three important centers of ISI. The Indian Statistical Bangalore Center was established in the year 1976 to meet the academic and research interests of students, scholars, teachers and others."


University of Southampton to provide free access to academic research

December 15, 2004 - "The University of Southampton is to make all its academic and scientific research output freely available."

"A decision by the University to provide core funding for its Institutional Repository establishes it as a central part of its research infrastructure, marking a new era for Open Access to academic research in the UK."

"Until now, the databases used by universities to collect and disseminate their research output have been funded on an experimental basis by JISC (the Joint Information Systems Committee). The University of Southampton is the first in the UK to announce that it is transitioning its repository from the status of an experiment to an integral part of the research infrastructure of the institution."

For more information, please see the full press release at <http://www.ecs.soton.ac.uk/news/667>.


Online library seeks volunteers for the frontline

December 13, 2004 - "The first phase of the new People's Network Service is launched to library professionals today. The People's Network Online Enquiry Service will deliver a real-time information service to the public by providing 'live' access to library and information professionals across the internet."

"Developed by the Museums, Libraries and Archives Council (MLA), the service is being delivered collaboratively by public library staff across England. Currently 29 library authorities in England are piloting the service. MLA is now inviting library colleagues across the country to trial the new service by asking sample questions. MLA hopes that by the end of 2005 all 149 library authorities in England will contribute to the service."

"This new phase of the People's Network will give the public an additional route to the kind of high-quality information and reference advice only available from the library profession. The service will be free, and they will be able to ask questions in 'real-time', dealing directly with a member of library staff online."

For more information, please see the full press release at <http://www.mla.gov.uk/news/press_article.asp?articleid=758>.

Copyright 2005 © Corporation for National Research Initiatives

Top | Contents
Search | Author Index | Title Index | Back Issues
Conference Report | Clips & Pointers
E-mail the Editor


doi:10.1045/january2005-inbrief