OCLC Internet Cataloging Project Colloquium
Position Paper
by Diane Vizine-Goetz
OCLC Office of Research and Special Projects
Classification experts and librarians have long recognized the potential of library classification schemes for improving subject access to information. In a 1983 article, Svenonius describes several uses for classification in online retrieval systems, including the following, (1) to improve precision or recall, (2) to provide context for search terms, (3) to enable browsing, and (4) to serve as a mechanism for switching between languages. In the Dewey Decimal Classification (DDC) Online Project (Markey and Demeyer 1986), Markey demonstrated the first implementation of a library classification scheme for end-user subject access, browsing, and display. Although many online catalogs provide call number browsing, few employ classification in the manner described by Svenonius or explored by Markey in her innovative use of the DDC in an experimental online catalog which enabled users to search and browse online classification data. Only recently, some ten years after Markey's pioneering research, is online classification data once again being seriously viewed as a tool for providing advanced browsing and retrieval capabilities in online systems.
One factor that has contributed to the slow adoption of classification as a retrieval tool is that the DDC and the Library of Congress Classification (LCC) have only recently been converted into machine-readable form. The computerization of the DDC began with the production of DDC 19 (1979) from computer-based photocomposition tapes. This development and the Markey study prompted Forest Press, in 1984, to commission Inforonics to develop an online editorial support system (ESS) for the Dewey Classification. See Finni and Paulson (1987) for a description of the development of the Dewey ESS. The resulting system and database was used to produce DDC 20 (1989), the first classification to be produced using an online editorial support system.
A different path is being taken in the conversion of the LCC to machine-readable form. Recognizing the benefits of online classification data for maintenance and distribution of LCC, the Library of Congress began developing the USMARC Format for Classification Data in 1987. The format was given provisional approval in 1990 and shortly afterward the Library of Congress began converting the forty six LCC schedules. The LCC database is expected to contain over 450,000 classification records when complete. See Guenther (1991) for a summary of the development and implementation of the USMARC classification format.
Electronic versions of the DDC and LCC make it possible to realize the potential of library classification to improve subject retrieval; however, much of the renewed interest in classification as an organizing and retrieval device for information resources has been sparked by the growth in usage of the Internet and World Wide Web (WWW).
Several WWW sites give users the ability to perform word or phrase searches to retrieve items of interest, with two popular sites, Yahoo and Infoseek, providing the additional capability of allowing users to navigate through a series of subject categories to discover potentially relevant documents. Although, Yahoo and Infoseek use essentially the same input (WWW documents and Internet newsgroup files) as the basis for their subject structures, the resulting categories displayed to users are quite different. The broad subject "Education" is found at the top level of both Yahoo and Infoseek, however, the next level under "Education" reveals a very different organization of the topic in each system. In Yahoo (see appendix A) [http://www.yahoo.com] over 30 sub-categories are available for browsing education-related topics while Infoseek (see appendix B) [http://guide.infoseek.com] presents a leaner outline.
Library classification schemes have long provided a similar organizing tool for library materials. The subject categories found in the DDC and LCC are based largely on the topics expressed in monographic material in traditional book format. For printed books, the Dewey Summaries [http://www.oclc.org/fp/] and LC Classification Outline are the library community's functional equivalent to the subject categories of Yahoo and Infoseek. In fact, several noncommercial WWW sites are using DDC and LCC to provide subject access to Web-accessible documents. Some examples are:
Patrick's Subject Catalog
[http://www.slac.stanford.edu/~clancey/dewey.html]
The UK Web Library - Searchable Classified Catalogue of UK Web
sites
[http://www.scit.wlv.ac.uk/wwlib/newclass.html]
CyberDewey: A guide to Internet resources organized using
Dewey Decimal Classification codes
[http://ivory.lm.com/~mundie/DDHC/DDH.html]
Morton Grove Public Library Webrary
[http://www.nslsilus.org/mgkhome/orrs/webrary.html]
CyberStacks(sm) [http://www.public.iastate.edu/~CYBERSTACKS/homepage.html]
At a time when both Internet-based classification schemes and traditional library classification systems are being used to provide access to Internet resources it is appropriate to review the major characteristics of DDC and LCC and to assess whether the electronic versions of these schemes can be successfully extended to the Internet.
Chan, Comaromi, and Satija remind us that the purpose of the Dewey Decimal Classification is to arrange a general collection of materials--"[the DDC] aims to classify books and other material on all subjects in all languages in every kind of library ... ." Similarly, LCC is designed to provide order for a general collection, the collection of the Library of Congress. Although based on the collection of a single library, the LC Classification has been successfully adopted by a majority of U.S. academic and research libraries.
To determine how well library classification systems compare to Internet classifications in terms of general topic coverage, categories 1-10 and 35-45 of Yahoo's 50 most popular categories were compared to DDC and LCC. The results are shown in table 1. All but four Yahoo categories (7, 36, 41, and 45) mapped to explicit DDC or LCC numbers or ranges. Although DDC and LCC both contain provisions for subdivision by geographical area within topics and a geographical breakdown for historical works, no direct mapping could be made for categories 36 and 45 which are essentially geographic areas subdivided by topic. For category 7 (Magazines) all three schemes provide a topical breakdown. Category 41 (Humor, Jokes, and Fun) was the most disperse when translation to DDC or LCC is attempted. In Dewey, humorous material can be classed by the specific literature or literary form, with specific subjects, etc. A similar situation exists in LCC. The mappings of the other categories indicate that DDC and LCC have sufficiently wide topic coverage for classifying Internet resources. This result is not surprising given that DDC and LCC numbers have been successfully assigned to well over 1.5 million items by the Library of Congress alone, resulting in more than 340,000 unique LCC classes and 280,000 unique DDC classes.
Table 1
Yahoo DDC LCC
1. Entertainment Performing arts (791-792) Performing arts (PN) and and by subject by subject 2. Computers and Internet Computers; Internet Computer Science; (QA76+) (004-006) & Telecommunication (TK 5105) 3. News News media; Broadcast Newspapers (AN), media (070.1+; 302.23+) Journalism & Broadcast news (PN4699-5648) 4. Recreation Recreation (793-799) Recreation. Leisure (GV) 5. Business and Economy Economics (330-390) Economics (H-HJ) 6. Society and Culture Religion (200), Social Religion (BL-BX) groups (305) & Culture Sociology (HM), The and institutions (306) family. Marriage. Women (HQ), Social and Public welfare (HV) 7. Entertainment: General periodicals (050) General periodicals (AP) Magazines and by subject and by subject 8. Entertainment: Movies Motion pictures (791.43) Motion pictures and Films (PN1995.5) 9. Education Education (370) Education (L) 10. Arts The Arts (700-799) Fine Arts (N) and by topic 35. News: International International news Newspapers (AN) and by (070.4332) place, event 36. Regional: Countries No direct mapping; No direct mapping; geographical; treatment geographical; treatment by subject or historical by subject or historical treatment by geographical treatment by geographical area area 37. Arts: Photography Photography (770) Photography (TR1-1050) 38. Computers and Multimedia systems Computer Science (QA76+) Internet: Multimedia (006.6) and by subject 39. Entertainment: People Performers (Entertainers) Fine Arts: Performing (791.092) arts (NX1-820) 40. Society and Culture: Social Sciences: Customs: Social Sciences: The Relationships: Dating Life cycle: Dating Family. Marriage. Woman: (306.7+; 392.6; 646.7+) Dating (HQ801-801.83) 41. Entertainment: No direct mapping; by No direct mapping; by Humor, Jokes, and Fun literary form, subject, literary form, subject, etc. etc. 42. Business and Finance and investments Social Sciences: Finance Economy: Markets and (332.6) (HG) Investments 43. Social Science Social Sciences (300-399) Social Sciences (H-HX) & & History (900-999) History (D-DL, DS, DT, E-F) 44. Entertainment: Television (791.45) Drama: Television: Television: Shows Broadcasts (PN1992.8) * 45. Regional: U.S. No direct mapping; No direct mapping; States geographical; treatment geographical; treatment by subject or historical by subject or historical treatment by geographical treatment by geographical area area
* On February 9, 1996 sites 1-45 of the top 50 sites were given at http://www.yahoo.com/text/popular.html.
A more detailed comparison of Yahoo and DDC was performed to further examine the suitability of library classification for providing access to Internet resources. The Education high-level outline on Yahoo [figure M-2] was juxtaposed with portions of the Dewey Edition 21 Education outline [Fig 3] to determine how the two systems differ in scope and coverage. Caption headings in Fig. 3 have been edited for brevity. Of the 39 subcategories under education on Yahoo, 27 mapped to one or more classes in the DDC education schedules. Category 21 "K-12" was the most disperse, mapping to 4 different DDC caption headings. Of the 27 topic areas, most mapped to DDC classes 1 to 3 levels deep, and only 4 ( those marked with an asterisk) were 5 levels down in the DDC hierarchy. The categories "Conferences," "Companies," "Databases," Journals," "Magazines," "News," and "Products" are represented by standard subdivisions in Dewey and are not shown in figure M-3 but could be listed under the general caption heading for education in Dewey or under the specific aspects of education covered by the item . The categories "Courses" and "Programs" which can map to many places in the DDC education schedule (e.g., school lunch programs, multi-cultural education programs, work-study programs, etc.) were also omitted from the M-3 display but counted as matching categories. Only three of the categories mapped to DDC classes outside the DDC education schedule, "Lectures," "Libraries," and "Interest groups." This analysis indicates that DDC possesses sufficient depth of coverage in its schedules and tables to be considered a viable tool for accessing Internet resources.
Fig. 1. Yahoo Education High-Level outline
Fig. 2. Dewey Edition 21 Education Outline
Williamson points out that:
Hierarchical relationships are the essence of all classification. Enumerative classifications systems provide a systematic arrangement of subjects according to set of principles based on an accepted philosophy of the organization of knowledge, on patterns established on the basis of literary warrant, and frequently, on a combination of both. However, classified order is not self-evident. Some method or device is required to preserve the relationships among classes, subclasses, topics and subtopics. In some classification systems, for example DDC, these relationships are preserved and may be manipulated through the hierarchical notation. LCC does not fit this pattern. Its notation preserves order but does not reflect hierarchy. ... some other means must be found to preserve those relationships.
In DDC, the sequence of subjects from general to specific, is indicated by the number of digits that form the DDC number. For example, when the DDC number 663.223 for the topic "making of red wine" is shown in the context of its Dewey hierarchy it can be seen that "White Wine," "Red Wine," and "Sparkling Wine" are at the same hierarchical level. The DDC number 663.22 corresponding to the heading "Specific kinds of grape wine" is one digit shorter than those used to indicate exact kinds of wine and is considered to be broader or superordinate to those with longer numbers. Indentation is also used to indicate hierarchy. Through both notation and indentation, this example shows that each topic except for the main class 600 Technology is subordinate to and part of all the broader classes above it.
600 Technology (Applied sciences) 660 Chemical engineering and related technologies 663 Beverage technology 663.2 Wine and wine making 663.22 Specific kinds of grape wine 663.222 White wine 663.223 Red wine 663.224 Sparkling wine
In both the LC Classification and in Yahoo's category trees, hierarchy is indicated by the indentation of category or class labels. To illustrate, consider the following class numbers and headings listed in the LC Classification QA schedule:
QA76.33 Computer Camps QA76.38 Hybrid Computers QA76.4 Analog Computers QA76.5 Digital Computers
Given only the notation (class numbers) and captions (headings), it is unclear what relationship exists among the ordered classes. When these classes are placed in the context of the LCC hierarchy structure in a display similar to what is found in the printed schedules, the indentation clearly indicates that Hybrid, Analog, and Digital computers are the same level of hierarchy and that QA76.33 is a subcategory under study and teaching and not a type of hybrid computer.
QA71-QA90 Instruments and machines QA75-QA76.95 Calculating machines QA76 Electronic computers. Computer Science QA76.27 Study and Teaching QA76.33 Computer Camps QA76.38 Hybrid Computers QA76.4 Analog Computers QA76.5 Digital Computers
Yahoo subcategory trees also use indentation to indicate hierarchy. For example, the following hierarchy is found under "Computers and Internet"
Computers and Internet Internet Entertainment Interesting Devices Connected to the Net Spy Cameras Indoor Cameras Outdoor Cameras Pets@ Aquariums
The preceding examples demonstrate that both Internet classification schemes and library classification schemes provide hierarchical structures capable of supporting topic browsing. Library schemes would seem to have some advantage over Internet-based schemes because they are accompanied by notations that facilitate the manipulation of class relationships. Recall that DDC's notation can be used to navigate broader, narrower, and coordinate relationships among classes, while LCC's can be used to arrange related topics in order. Yahoo's hierarchy structure requires encoding to take advantage of relationships among classes. Links to other subject-oriented schemes
Library classification schemes are generally considered to be retrospective: classes are added or revised only after sufficient literary warrant is demonstrated and classes are removed with even greater caution. For these reasons much greater attention needs to be given to employing the implicit and explicit links between library classification systems and other subject oriented schemes. For example, Electronic Dewey, the electronic version of DDC20, includes a statistical mapping from the OCLC Online Union Catalog of up to five of the most frequently used LCSH to each Dewey number. This Electronic Dewey feature, which has been well received by users, provides additional indexing terms to lead users to appropriate topic areas in Dewey. In addition to statistical mappings, the Electronic DDC21 database will include many DDC/LCSH links that have been reviewed editorially. Links similar to those made in Electronic Dewey can be made for LCC and LCSH by processing the in bibliographic records containing fields for both. For LCSH and LCC explicit links are also available in LC Subject Authority records that contain LC classification number fields. In an analysis of the LC Subject Authority file, Vizine-Goetz and Markey found that about 43% of topical subject heading records (MARC tag 150) contain LC classification number fields. Science and technology classes account for almost half (47.72%) of the LC class numbers.
In addition to providing supplemental vocabulary for topics already represented in class schedules, linking DDC, for example, with other subject thesauri provides a mechanism for allowing new topics to be represented in the classification even if each is not supplied with its own number. For example, the LC subject heading Microsoft Network (Online service), listed among the "Subject Headings of Current Interest" in CSB, No. 70 (Fall 1995), can be linked to (among others) DDC number 025.04 "Automated information storage and retrieval systems" and to DDC number 004.678 "Internet." This subject heading, however, has been assigned to only four LC MARC records with DDC number 025.04 and therefore may not be among the top 5 LCSH statistically mapped to this number. The ability to map current terminology into DDC and LCC is particularly important if library classification is to be used to provide access to Internet resources.
Classification experts have long recognized the potential of DDC to serve as a mechanism for switching between languages. With the recent publications of DDC in French and Spanish and with a Russian translation scheduled for publication in December 1997, it may now be possible to realize this capability.
Table 2 shows DDC captions in English, French, and Spanish for three DDC classes on the topic microcomputers. Captions and relative index terms in translation databases could be used to provide a multilingual subject browser to a database of Internet-accessible resources that have been assigned DDC numbers, such as OCLC's NetFirst database.
Table 2. DDC Captions
DDC Class English Spanish French
Number 004.1 General works on Obras generales sobre Ouvrages généraux sur specific types of tipos específicos de les différents types computers computadores d'ordinateurs 004.16 Digital microcomputers Microcomputadores Micro-ordinateurs digitales 004.165 Specific digital Microcomputadores Micro-ordinateurs microcomputers digitales específicos particuliers
This paper examines several characteristics of DDC and LCC classification schemes that make them suitable for providing subject access to Internet resources. To review, DDC and LCC are
Despite these favorable properties additional improvements are needed if online classification data is to be used as a major tool for providing online subject access to traditional collections as well as to Internet-accessible resources. The following improvements are recommended:
Recommendations 1 and 2 are not new. Over ten years ago, Karen Markey advocated similar improvements be made to the DDC to facilitate its use in online catalogs. Arnold Wajenberg (1983) proposed a scheme for encoding DDC numbers to enhance automated subject retrieval . Fortunately this time, there appears to be both interest and resources to make needed enhancements to online classification data. For example, a project has been established to transform captions in the 1000 DDC summaries (the first three digits in Dewey) into end user language [DDC ALA Midwinter Conference Report, January 1996; http://www.oclc.org/oclc/fp/news/9602ala.htm]. The recast summaries will be used in the prototype of a Dewey-based subject browser for the NetFirst database of Internet-accessible resources. While a good first step, it will be necessary to look well beyond the first three levels of Dewey to captions at lower levels of the DDC hierarchy since many of the DDC numbers assigned to NetFirst records extend four or more digits past the decimal point. These efforts will advance recommendations 1 and 6.
In the context of DDC, work on recommendations number 3 and 4 is also underway. Dewey editorial staff and OCLC research staff are collaborating on projects to enhance the electronic version of the Dewey editorial database with selected LC subject headings from the Weekly Lists and headings with high postings in the OCLC Online Union Catalog database. Editorial staff will also add coding to the Editorial Support System database to indicate links between Dewey Relative Index terms and LCSH and Sears headings. For LCC, the LC Cataloging and Policy Support of Office is reviewing the index structure of the LCC schedules and is consulting with classification expert Lois Chan on the design of a combined index to LCC. It is very likely that this work could lead to future efforts to form better links between LCC and LCSH.
The projects described above indicate a commitment by the owners and maintainers of DDC and LCC to improve these systems for automated subject retrieval. If Internet resource catalogers display a similar commitment to assigning class numbers to the bibliographic records they create, online classification data can form an important bridge between library methods for organizing materials and Internet-based techniques for accessing electronic collections. Furthermore, DDC and LCC based interfaces will provide users with a common interface to traditional and electronic libraries.
Chan, Lois Mai, John P. Comaromi and Mohinder P. Satija. 1994. Dewey Decimal Classification: a practical guide. Albany, N.Y.: Forest Press. p. 6.
Finni John J. and Peter J. Paulson. 1987. "The Dewey Decimal Classification enters the computer age: developing the DDC database(TM) and Editorial Support System." International Cataloguing 16 (4 ):46-48 (October/December 1987).
Guenther, Rebecca S. 1992. "The Development and Implementation of the USMARC Format for Classification Data." Information Technology and Libraries 11 (2):120-131 (June 1992).
Markey, Karen, and Anh N. Demeyer. 1986. Dewey Decimal Classification Online Project: Evaluation of a Library Schedule and Index Integrated into the Subject Searching Capabilities of an Online Catalog. Dublin, Ohio: OCLC Online Computer Library Center, Inc., Office of Research .
Svenonius, Elaine. 1983. "Use of classification in online retrieval." Library Resources and Technical Services 27(1):76-80 (Jan./Mar. 1983).
Vizine-Goetz, Diane and Markey, Karen. 1989. "Characteristics of Subject Heading Records in the Machine-Readable Library of Congress Subject Headings." Information Technology and Libraries 8(2): 203-209 (June 1989).
Wajenberg, Arnold S. 1983. "MARC Coding of DDC for Subject Retrieval." Information Technology and Libraries 2(3): 246-251 (September 1983).
Williamson, Nancy J. 1995. The Library of Congress Classification: a content analysis of the schedules in preparation for their conversion into machine-readable form. Washington, D. C.: Library of Congress, Cataloging Distribution Service. p. 17.
The author is grateful for the valuable comments of Joan S. Mitchell, Editor Dewey Decimal Classification and is also grateful for assistance from Barbara A. Brownell, Technical Processing Specialist, OCLC.