OCLC Internet Cataloging Project Colloquium
Position Paper

Using Library Classification Schemes for Internet Resources

by Diane Vizine-Goetz
OCLC Office of Research and Special Projects


Contents


Classification experts and librarians have long recognized the potential of library classification schemes for improving subject access to information. In a 1983 article, Svenonius describes several uses for classification in online retrieval systems, including the following, (1) to improve precision or recall, (2) to provide context for search terms, (3) to enable browsing, and (4) to serve as a mechanism for switching between languages. In the Dewey Decimal Classification (DDC) Online Project (Markey and Demeyer 1986), Markey demonstrated the first implementation of a library classification scheme for end-user subject access, browsing, and display. Although many online catalogs provide call number browsing, few employ classification in the manner described by Svenonius or explored by Markey in her innovative use of the DDC in an experimental online catalog which enabled users to search and browse online classification data. Only recently, some ten years after Markey's pioneering research, is online classification data once again being seriously viewed as a tool for providing advanced browsing and retrieval capabilities in online systems.

Online Classification Data

One factor that has contributed to the slow adoption of classification as a retrieval tool is that the DDC and the Library of Congress Classification (LCC) have only recently been converted into machine-readable form. The computerization of the DDC began with the production of DDC 19 (1979) from computer-based photocomposition tapes. This development and the Markey study prompted Forest Press, in 1984, to commission Inforonics to develop an online editorial support system (ESS) for the Dewey Classification. See Finni and Paulson (1987) for a description of the development of the Dewey ESS. The resulting system and database was used to produce DDC 20 (1989), the first classification to be produced using an online editorial support system.

A different path is being taken in the conversion of the LCC to machine-readable form. Recognizing the benefits of online classification data for maintenance and distribution of LCC, the Library of Congress began developing the USMARC Format for Classification Data in 1987. The format was given provisional approval in 1990 and shortly afterward the Library of Congress began converting the forty six LCC schedules. The LCC database is expected to contain over 450,000 classification records when complete. See Guenther (1991) for a summary of the development and implementation of the USMARC classification format.

Classification and the Internet

Electronic versions of the DDC and LCC make it possible to realize the potential of library classification to improve subject retrieval; however, much of the renewed interest in classification as an organizing and retrieval device for information resources has been sparked by the growth in usage of the Internet and World Wide Web (WWW).

Several WWW sites give users the ability to perform word or phrase searches to retrieve items of interest, with two popular sites, Yahoo and Infoseek, providing the additional capability of allowing users to navigate through a series of subject categories to discover potentially relevant documents. Although, Yahoo and Infoseek use essentially the same input (WWW documents and Internet newsgroup files) as the basis for their subject structures, the resulting categories displayed to users are quite different. The broad subject "Education" is found at the top level of both Yahoo and Infoseek, however, the next level under "Education" reveals a very different organization of the topic in each system. In Yahoo (see appendix A) [http://www.yahoo.com] over 30 sub-categories are available for browsing education-related topics while Infoseek (see appendix B) [http://guide.infoseek.com] presents a leaner outline.

Library classification schemes have long provided a similar organizing tool for library materials. The subject categories found in the DDC and LCC are based largely on the topics expressed in monographic material in traditional book format. For printed books, the Dewey Summaries [http://www.oclc.org/fp/] and LC Classification Outline are the library community's functional equivalent to the subject categories of Yahoo and Infoseek. In fact, several noncommercial WWW sites are using DDC and LCC to provide subject access to Web-accessible documents. Some examples are:

DDC

Patrick's Subject Catalog
[http://www.slac.stanford.edu/~clancey/dewey.html]

The UK Web Library - Searchable Classified Catalogue of UK Web sites
[http://www.scit.wlv.ac.uk/wwlib/newclass.html]

CyberDewey: A guide to Internet resources organized using Dewey Decimal Classification codes
[http://ivory.lm.com/~mundie/DDHC/DDH.html]

Morton Grove Public Library Webrary
[http://www.nslsilus.org/mgkhome/orrs/webrary.html]

LCC

CyberStacks(sm) [http://www.public.iastate.edu/~CYBERSTACKS/homepage.html]

At a time when both Internet-based classification schemes and traditional library classification systems are being used to provide access to Internet resources it is appropriate to review the major characteristics of DDC and LCC and to assess whether the electronic versions of these schemes can be successfully extended to the Internet.

Major Characteristics of DDC and LCC

DDC and LCC Are General Classification Systems

Chan, Comaromi, and Satija remind us that the purpose of the Dewey Decimal Classification is to arrange a general collection of materials--"[the DDC] aims to classify books and other material on all subjects in all languages in every kind of library ... ." Similarly, LCC is designed to provide order for a general collection, the collection of the Library of Congress. Although based on the collection of a single library, the LC Classification has been successfully adopted by a majority of U.S. academic and research libraries.

To determine how well library classification systems compare to Internet classifications in terms of general topic coverage, categories 1-10 and 35-45 of Yahoo's 50 most popular categories were compared to DDC and LCC. The results are shown in table 1. All but four Yahoo categories (7, 36, 41, and 45) mapped to explicit DDC or LCC numbers or ranges. Although DDC and LCC both contain provisions for subdivision by geographical area within topics and a geographical breakdown for historical works, no direct mapping could be made for categories 36 and 45 which are essentially geographic areas subdivided by topic. For category 7 (Magazines) all three schemes provide a topical breakdown. Category 41 (Humor, Jokes, and Fun) was the most disperse when translation to DDC or LCC is attempted. In Dewey, humorous material can be classed by the specific literature or literary form, with specific subjects, etc. A similar situation exists in LCC. The mappings of the other categories indicate that DDC and LCC have sufficiently wide topic coverage for classifying Internet resources. This result is not surprising given that DDC and LCC numbers have been successfully assigned to well over 1.5 million items by the Library of Congress alone, resulting in more than 340,000 unique LCC classes and 280,000 unique DDC classes.

Table 1


  Yahoo                     DDC                        LCC                        

1.  Entertainment           Performing arts (791-792)  Performing arts (PN)  and  
                            and by subject             by subject                 

2.  Computers and Internet  Computers; Internet        Computer Science; (QA76+)  
                            (004-006)                  & Telecommunication (TK    
                                                       5105)                      

3.  News                    News media; Broadcast      Newspapers (AN),           
                            media (070.1+; 302.23+)    Journalism & Broadcast     
                                                       news (PN4699-5648)         

4.  Recreation              Recreation (793-799)       Recreation. Leisure (GV)   

5.  Business and Economy    Economics (330-390)        Economics (H-HJ)           

6.  Society and Culture     Religion (200), Social     Religion (BL-BX)           
                            groups (305) & Culture     Sociology (HM), The        
                            and institutions (306)     family. Marriage. Women    
                                                       (HQ), Social and Public    
                                                       welfare (HV)               

7.  Entertainment:          General periodicals (050)  General periodicals (AP)   
    Magazines               and by subject             and by subject             

8.  Entertainment: Movies   Motion pictures (791.43)   Motion pictures            
    and Films                                          (PN1995.5)                 

9.  Education               Education (370)            Education (L)              

10. Arts                    The  Arts (700-799)        Fine Arts (N) and by       
                                                       topic                      

35. News: International     International news         Newspapers (AN) and  by    
                            (070.4332)                 place, event               

36. Regional: Countries     No direct mapping;         No direct mapping;         
                            geographical; treatment    geographical; treatment    
                            by subject or historical   by subject or historical   
                            treatment by geographical  treatment by geographical  
                            area                       area                       

37. Arts: Photography       Photography (770)           Photography (TR1-1050)    

38. Computers and           Multimedia systems         Computer Science (QA76+)   
    Internet: Multimedia    (006.6)                    and by subject             

39. Entertainment: People   Performers (Entertainers)  Fine Arts: Performing      
                            (791.092)                  arts (NX1-820)             

40. Society and Culture:    Social Sciences: Customs:  Social Sciences: The       
    Relationships: Dating   Life cycle: Dating         Family.  Marriage. Woman:  
                            (306.7+; 392.6; 646.7+)    Dating (HQ801-801.83)      

41. Entertainment:          No direct mapping; by      No direct mapping; by      
    Humor, Jokes, and Fun   literary form, subject,    literary form, subject,    
                            etc.                       etc.                       

42. Business and            Finance and investments    Social Sciences: Finance   
    Economy: Markets and    (332.6)                    (HG)                       
    Investments                                                                      

43. Social Science         Social Sciences (300-399)   Social Sciences (H-HX) &   
                            & History (900-999)        History (D-DL, DS, DT,     
                                                       E-F)                       

44. Entertainment:         Television (791.45)         Drama: Television:         
    Television: Shows                                  Broadcasts (PN1992.8)      

* 45. Regional: U.S.       No direct mapping;          No direct mapping;         
      States               geographical; treatment     geographical; treatment    
                           by subject or historical    by subject or historical   
                           treatment by geographical   treatment by geographical  
                           area                        area                       

* On February 9, 1996 sites 1-45 of the top 50 sites were given at http://www.yahoo.com/text/popular.html.

A more detailed comparison of Yahoo and DDC was performed to further examine the suitability of library classification for providing access to Internet resources. The Education high-level outline on Yahoo [figure M-2] was juxtaposed with portions of the Dewey Edition 21 Education outline [Fig 3] to determine how the two systems differ in scope and coverage. Caption headings in Fig. 3 have been edited for brevity. Of the 39 subcategories under education on Yahoo, 27 mapped to one or more classes in the DDC education schedules. Category 21 "K-12" was the most disperse, mapping to 4 different DDC caption headings. Of the 27 topic areas, most mapped to DDC classes 1 to 3 levels deep, and only 4 ( those marked with an asterisk) were 5 levels down in the DDC hierarchy. The categories "Conferences," "Companies," "Databases," Journals," "Magazines," "News," and "Products" are represented by standard subdivisions in Dewey and are not shown in figure M-3 but could be listed under the general caption heading for education in Dewey or under the specific aspects of education covered by the item . The categories "Courses" and "Programs" which can map to many places in the DDC education schedule (e.g., school lunch programs, multi-cultural education programs, work-study programs, etc.) were also omitted from the M-3 display but counted as matching categories. Only three of the categories mapped to DDC classes outside the DDC education schedule, "Lectures," "Libraries," and "Interest groups." This analysis indicates that DDC possesses sufficient depth of coverage in its schedules and tables to be considered a viable tool for accessing Internet resources.

Fig. 1. Yahoo Education High-Level outline

  • Education (General) (# 12.) o Education for specific objectives - Vocational schools (# 38.) o Educational research; related topics - Teacher training, institutes (# 17.)
  • Schools and their activities; special education o Specific kinds of schools - Alternative education (# 2.) o Teachers and teaching (# 35.) - Community-school relations # Parent-school relations * Parent-teacher associations (# 30.) o School administration; administration of student academic activities - Student aid and cooperative education (# 11.) # Scholarships, fellowships and grants (# 14.) - Examinations and tests; placement (# 9.) o Methods of instruction and study - Instructional technology (# 18.) # Audiovisual * Television (# 36.) # Computers (# 29.) o Student guidance and counsel - Educational and vocational guidance (# 15.) o School discipline and related activities - Student government (# 30.) o Special education (# 34.)
  • Elementary education o Specific levels of elementary education - Preschool education # Kindergarten (# 21.) - Specific levels of elementary school (# 21.) o Computers and science - Science and technology (#26.) # Environmental studies (# 10.) o Language arts and literacy (# 25.) - Foreign languages and bilingual education (# 22.) o Math (#26.) o Other studies - Social studies (# 33.)
  • Secondary education o Secondary schools and programs - Specific levels of secondary education # Junior high schools; Middle schools (# 21.) # Senior high schools (# 21.)
  • Adult education (# 1.) o For specific objectives (Work training) (# 39.)
  • Higher education (# 16.) o Colleges and universities (# 37.) o Organization and activities - Specific levels of higher education # Undergraduate colleges * Junior and two-year community colleges (# 4.) - Administration of student academic activities # College admissions * College entrance requirements (# 3.)
  • Public policy issues in education (# 13.)
    Fig. 2. Dewey Edition 21 Education Outline
    


    DDC and LCC Have a Hierarchical Structure

    Williamson points out that:

    Hierarchical relationships are the essence of all classification. Enumerative classifications systems provide a systematic arrangement of subjects according to set of principles based on an accepted philosophy of the organization of knowledge, on patterns established on the basis of literary warrant, and frequently, on a combination of both. However, classified order is not self-evident. Some method or device is required to preserve the relationships among classes, subclasses, topics and subtopics. In some classification systems, for example DDC, these relationships are preserved and may be manipulated through the hierarchical notation. LCC does not fit this pattern. Its notation preserves order but does not reflect hierarchy. ... some other means must be found to preserve those relationships.

    In DDC, the sequence of subjects from general to specific, is indicated by the number of digits that form the DDC number. For example, when the DDC number 663.223 for the topic "making of red wine" is shown in the context of its Dewey hierarchy it can be seen that "White Wine," "Red Wine," and "Sparkling Wine" are at the same hierarchical level. The DDC number 663.22 corresponding to the heading "Specific kinds of grape wine" is one digit shorter than those used to indicate exact kinds of wine and is considered to be broader or superordinate to those with longer numbers. Indentation is also used to indicate hierarchy. Through both notation and indentation, this example shows that each topic except for the main class 600 Technology is subordinate to and part of all the broader classes above it.

    600   Technology (Applied sciences)
    660     Chemical engineering and related technologies
    663          Beverage technology 
    663.2           Wine and wine making
    663.22            Specific kinds of grape wine
    663.222            White wine
    663.223             Red wine
    663.224             Sparkling wine         
    

    In both the LC Classification and in Yahoo's category trees, hierarchy is indicated by the indentation of category or class labels. To illustrate, consider the following class numbers and headings listed in the LC Classification QA schedule:

    QA76.33   Computer Camps
    QA76.38   Hybrid Computers
    QA76.4   Analog Computers
    QA76.5   Digital Computers
    

    Given only the notation (class numbers) and captions (headings), it is unclear what relationship exists among the ordered classes. When these classes are placed in the context of the LCC hierarchy structure in a display similar to what is found in the printed schedules, the indentation clearly indicates that Hybrid, Analog, and Digital computers are the same level of hierarchy and that QA76.33 is a subcategory under study and teaching and not a type of hybrid computer.

    QA71-QA90	   Instruments and machines
    QA75-QA76.95	     Calculating machines
    QA76                  Electronic computers.  Computer Science
    QA76.27                Study and Teaching
    QA76.33                  Computer Camps
    QA76.38		        Hybrid Computers
    QA76.4		        Analog Computers
    QA76.5		        Digital Computers
    

    Yahoo subcategory trees also use indentation to indicate hierarchy. For example, the following hierarchy is found under "Computers and Internet"

    Computers and Internet
      Internet
        Entertainment
         Interesting Devices Connected to the Net
           Spy Cameras
             Indoor Cameras
             Outdoor Cameras
             Pets@
               Aquariums
    

    The preceding examples demonstrate that both Internet classification schemes and library classification schemes provide hierarchical structures capable of supporting topic browsing. Library schemes would seem to have some advantage over Internet-based schemes because they are accompanied by notations that facilitate the manipulation of class relationships. Recall that DDC's notation can be used to navigate broader, narrower, and coordinate relationships among classes, while LCC's can be used to arrange related topics in order. Yahoo's hierarchy structure requires encoding to take advantage of relationships among classes. Links to other subject-oriented schemes

    Library classification schemes are generally considered to be retrospective: classes are added or revised only after sufficient literary warrant is demonstrated and classes are removed with even greater caution. For these reasons much greater attention needs to be given to employing the implicit and explicit links between library classification systems and other subject oriented schemes. For example, Electronic Dewey, the electronic version of DDC20, includes a statistical mapping from the OCLC Online Union Catalog of up to five of the most frequently used LCSH to each Dewey number. This Electronic Dewey feature, which has been well received by users, provides additional indexing terms to lead users to appropriate topic areas in Dewey. In addition to statistical mappings, the Electronic DDC21 database will include many DDC/LCSH links that have been reviewed editorially. Links similar to those made in Electronic Dewey can be made for LCC and LCSH by processing the in bibliographic records containing fields for both. For LCSH and LCC explicit links are also available in LC Subject Authority records that contain LC classification number fields. In an analysis of the LC Subject Authority file, Vizine-Goetz and Markey found that about 43% of topical subject heading records (MARC tag 150) contain LC classification number fields. Science and technology classes account for almost half (47.72%) of the LC class numbers.

    In addition to providing supplemental vocabulary for topics already represented in class schedules, linking DDC, for example, with other subject thesauri provides a mechanism for allowing new topics to be represented in the classification even if each is not supplied with its own number. For example, the LC subject heading Microsoft Network (Online service), listed among the "Subject Headings of Current Interest" in CSB, No. 70 (Fall 1995), can be linked to (among others) DDC number 025.04 "Automated information storage and retrieval systems" and to DDC number 004.678 "Internet." This subject heading, however, has been assigned to only four LC MARC records with DDC number 025.04 and therefore may not be among the top 5 LCSH statistically mapped to this number. The ability to map current terminology into DDC and LCC is particularly important if library classification is to be used to provide access to Internet resources.

    Links to Editions in Other Languages

    Classification experts have long recognized the potential of DDC to serve as a mechanism for switching between languages. With the recent publications of DDC in French and Spanish and with a Russian translation scheduled for publication in December 1997, it may now be possible to realize this capability.

    Table 2 shows DDC captions in English, French, and Spanish for three DDC classes on the topic microcomputers. Captions and relative index terms in translation databases could be used to provide a multilingual subject browser to a database of Internet-accessible resources that have been assigned DDC numbers, such as OCLC's NetFirst database.

    Table 2. DDC Captions


    DDC Class   English                 Spanish                 French        
    

    Number 004.1 General works on Obras generales sobre Ouvrages généraux sur specific types of tipos específicos de les différents types computers computadores d'ordinateurs 004.16 Digital microcomputers Microcomputadores Micro-ordinateurs digitales 004.165 Specific digital Microcomputadores Micro-ordinateurs microcomputers digitales específicos particuliers


    Library Classification or Internet-based Schemes?

    This paper examines several characteristics of DDC and LCC classification schemes that make them suitable for providing subject access to Internet resources. To review, DDC and LCC are

    Despite these favorable properties additional improvements are needed if online classification data is to be used as a major tool for providing online subject access to traditional collections as well as to Internet-accessible resources. The following improvements are recommended:

    1. Evaluate DDC and LCC captions for expressiveness and currency
    2. Decompose and code class number components to identify the specific subject and aspects represented
    3. Continue to add new terminology as index terms even if each is not supplied with its own number
    4. Expand links to other controlled vocabularies
    5. Expand definitions of literary warrant to include Internet resources
    6. Build demonstration systems

    Recommendations 1 and 2 are not new. Over ten years ago, Karen Markey advocated similar improvements be made to the DDC to facilitate its use in online catalogs. Arnold Wajenberg (1983) proposed a scheme for encoding DDC numbers to enhance automated subject retrieval . Fortunately this time, there appears to be both interest and resources to make needed enhancements to online classification data. For example, a project has been established to transform captions in the 1000 DDC summaries (the first three digits in Dewey) into end user language [DDC ALA Midwinter Conference Report, January 1996; http://www.oclc.org/oclc/fp/news/9602ala.htm]. The recast summaries will be used in the prototype of a Dewey-based subject browser for the NetFirst database of Internet-accessible resources. While a good first step, it will be necessary to look well beyond the first three levels of Dewey to captions at lower levels of the DDC hierarchy since many of the DDC numbers assigned to NetFirst records extend four or more digits past the decimal point. These efforts will advance recommendations 1 and 6.

    In the context of DDC, work on recommendations number 3 and 4 is also underway. Dewey editorial staff and OCLC research staff are collaborating on projects to enhance the electronic version of the Dewey editorial database with selected LC subject headings from the Weekly Lists and headings with high postings in the OCLC Online Union Catalog database. Editorial staff will also add coding to the Editorial Support System database to indicate links between Dewey Relative Index terms and LCSH and Sears headings. For LCC, the LC Cataloging and Policy Support of Office is reviewing the index structure of the LCC schedules and is consulting with classification expert Lois Chan on the design of a combined index to LCC. It is very likely that this work could lead to future efforts to form better links between LCC and LCSH.

    The projects described above indicate a commitment by the owners and maintainers of DDC and LCC to improve these systems for automated subject retrieval. If Internet resource catalogers display a similar commitment to assigning class numbers to the bibliographic records they create, online classification data can form an important bridge between library methods for organizing materials and Internet-based techniques for accessing electronic collections. Furthermore, DDC and LCC based interfaces will provide users with a common interface to traditional and electronic libraries.

    References

    Chan, Lois Mai, John P. Comaromi and Mohinder P. Satija. 1994. Dewey Decimal Classification: a practical guide. Albany, N.Y.: Forest Press. p. 6.

    Finni John J. and Peter J. Paulson. 1987. "The Dewey Decimal Classification enters the computer age: developing the DDC database(TM) and Editorial Support System." International Cataloguing 16 (4 ):46-48 (October/December 1987).

    Guenther, Rebecca S. 1992. "The Development and Implementation of the USMARC Format for Classification Data." Information Technology and Libraries 11 (2):120-131 (June 1992).

    Markey, Karen, and Anh N. Demeyer. 1986. Dewey Decimal Classification Online Project: Evaluation of a Library Schedule and Index Integrated into the Subject Searching Capabilities of an Online Catalog. Dublin, Ohio: OCLC Online Computer Library Center, Inc., Office of Research .

    Svenonius, Elaine. 1983. "Use of classification in online retrieval." Library Resources and Technical Services 27(1):76-80 (Jan./Mar. 1983).

    Vizine-Goetz, Diane and Markey, Karen. 1989. "Characteristics of Subject Heading Records in the Machine-Readable Library of Congress Subject Headings." Information Technology and Libraries 8(2): 203-209 (June 1989).

    Wajenberg, Arnold S. 1983. "MARC Coding of DDC for Subject Retrieval." Information Technology and Libraries 2(3): 246-251 (September 1983).

    Williamson, Nancy J. 1995. The Library of Congress Classification: a content analysis of the schedules in preparation for their conversion into machine-readable form. Washington, D. C.: Library of Congress, Cataloging Distribution Service. p. 17.

    Acknowledgments

    The author is grateful for the valuable comments of Joan S. Mitchell, Editor Dewey Decimal Classification and is also grateful for assistance from Barbara A. Brownell, Technical Processing Specialist, OCLC.

    Appendix

    A. Yahoo Education Topics

    yahoo.gif

    B. Infoseek Education Topics

    infoseek.gif


    Back to beginning