D-Lib Magazine
July/August 2014
Volume 20, Number 7/8
Table of Contents
Managing Ambiguity In VIAF
Thomas B. Hickey and Jenny A. Toves
Online Computer Library Center, Inc.
{hickey, tovesj}@oclc.org
doi:10.1045/july2014-hickey
Printer-friendly Version
Abstract
The Virtual International Authority File (VIAF) is built from tens of millions of names represented in more than 130 million authority and bibliographic records expressed in multiple languages, scripts and formats. VIAF does not replace the source authority data, but creates something new built upon the relations mined from it. A common use of VIAF is in the creation of new 'local' authority records for authors based on information already in VIAF about the entity. VIAF can also be used as an authority file in its own right, for instance OCLC is now using VIAF as part of its identification of works and expressions. In a series of automated steps these names are linked and combined into VIAF clusters. Ambiguity occurs at several stages in VIAF, from the initial matching to cluster creation. VIAF's approach to managing this gives us a great deal of flexibility to deal with additions, deletions and changes to the underlying authority data. VIAF's approach to clustering has several rather novel aspects. The clustering itself proceeds in multiple stages in what could be called progressive refinement. It uses fairly loose matching to bring in candidates and then gradually brings them into the finished clusters using the information that can be gleaned from the rough groupings to make more informed decisions than could be made a priori. Another aspect is that all the information from all the records is used during the clustering. This results in a more fluid view of identity than hand-built authority files provide, while giving VIAF the ability to react to refinements in the clustering algorithms and new data on a regular basis. Finally, just the scale of VIAF provides opportunities the library community has not previously had to analyze and use authority data in machine processing. The problems and approaches used by VIAF may have implications in the use of linked data for other information services.
Background
Authority files (WikiMedia) (Calhoun, 1998) provide identifiers and often a standard for names, titles, series and concepts. The authority records themselves can be quite extensive, codifying and documenting the headings. In general, each authority record has:
- An identifier (typically a number)
- One or more strings showing how the name (or concept) is displayed
- Alternate forms
- Links to related headings
- Information to help differentiate names, e.g. dates
- Explanations of how the heading was derived, as appropriate
VIAF deals with 'names', primarily names of persons and corporations, but it also includes names of FRBR (IFLA Study Group on the Functional Requirements for Bibliographic Records, 1997) works, expressions, and jurisdictional geographic names. Of the terms libraries control, only concepts (topical subjects) are currently outside VIAF's scope.
As of April 2014, VIAF is built from 38 million name records from 36 institutions, along with 104 million bibliographic records associated with those names. Updates to VIAF are processed each month, completely redoing the matching and clustering process. The processing occurs on a 300+ core Hadoop (Apache Hadoop Project) cluster and takes about 12 hours of cluster compute time to complete.
The clusters visible in the VIAF interface (http://viaf.org) are built in several stages:
- Harvest and ingest new authority and bibliographic records
- Associate each authority record with its corresponding bibliographic records
- Identify links within files
- Create processed authority records
- Do pair-wise matching of the records between authority files
- Look for duplicates within files
- Pull together groups of records that are linked by the pair-wise matching
- Divide the groups into coherent clusters
- Merge resulting groups
- Assign VIAF Identifiers to the clusters
- Create links between clusters as needed (e.g. between pseudonyms)
- Maintain information about merged, split and deleted clusters
VIAF uses 'cool URIs' (World Wide Web Consortium, 2007) to identify the entity (e.g. http://viaf.org/viaf/77390479) and its description (e.g. http://viaf.org/viaf/77390479/ note the trailing slash). This distinction of having one URI for the entity and another that points at a Web resource about the entity follows standard Linked Data (World Wide Web Consortium) practice. Representations are available in HTML (for Web browsing), MARC-21, UNIMARC and RDF. A user interface, linked data access and a RESTful API are all integrated and available at http://viaf.org. The VIAF system is run by OCLC with the participation of data providers that make up the VIAF Council. OCLC processes the data and hosts the VIAF service. Implemented by OCLC, VIAF has grown with the attention and close collaboration of the participants from the start (Bennett, 2007).
Within OCLC Research, VIAF has been informed by the Networking Names project led by Karen Smith-Yoshimura (Smith-Yosimura, 2009).
The Problem
Even within a single authority file of names, there are almost always some ambiguities, for example two name records that unintentionally refer to the same entity. Another ambiguity which often goes unnoticed in a single authority record is the mixing of two entities into one. This can happen when two entities with similar names are not differentiated, or when someone catalogs items produced by one entity using the authority record for another. When considering the cataloging process that uses the authority records, this is not surprising since it can be difficult to distinguish individuals with similar names. Because VIAF uses titles and other information pulled from bibliographic records to help in the matching process, this sort of ambiguity can lead to confusion in subsequent matching.
Another common pattern encountered in matching entities from multiple authority files is differing ideas about pseudonyms. For example, one file may distinguish between 'Mark Twain' and 'Samuel Clemens', while another may treat them as the same entity. There can be also be disagreement between files about the type of name, for example whether an entity is a corporation or a person, or whether it is a corporation or jurisdictional name.
A similar problem occurs with U.S. presidents. Thomas Jefferson actually has three records in the LC/NACO file: as an individual, as governor of Virginia , and then as president. Other files may or may not follow suit. VIAF's philosophy in this is to honor the most specific entities. For Thomas Jefferson this means creating three different VIAF clusters reflecting the three aspects of his life (person, governor, president) and then trying to bring each source authority record into the most appropriate cluster. This same pattern can be seen in kings, popes and other related personae1. Sometimes these different forms are treated as different entities, other times they are just considered variant forms for the same entity.
Some problems are just not apparent when matching pairs of records. For instance:
- Personal name A with dates 1910-1980
- Might match name B with dates 1910-
- Which in turns matches name C with dates 1910-1970
So, although B links to both A and C, names A and C should (probably) not be in the same VIAF cluster.
As mentioned above, VIAF deals with more than three dozen authority files with 38 million records. We make about 34 million links between records and create 26 million clusters. While most of those 'clusters' consist of a single source record, 5 million clusters have more than one source record. Currently 3.5 million proto-clusters have conflicts across files like those described above. This number of conflicts has grown as we have gradually relaxed the initial matching criteria. This relaxation of the criteria passes more information to the disambiguation stages enabling better progressive refinement of groups into the final clusters.
Since VIAF is international in scope, clustering and disambiguation needs to be relatively language and script independent. For differing scripts especially this is a bit of a balancing act, since to human eyes similarity in names is often an excellent indicator of whether they represent the same entity, but with mixed scripts names can look very different. Another issue is the effect additional records can have on clusters. Since the clusters are periodically recalculated based on the pair-wise links, a new record could be brought into the potential new cluster and conflict with a name already in it. In this case one of the conflicts will be moved out of the cluster, but that move may force the movement of other records it is linked to. In this case, we do not want the new record (which could include wrong information) to make the existing cluster worse.
Our philosophy for VIAF is to only make links we are quite sure of. With spouses co-authoring papers, fathers and their children writing in the same field and the inevitable coincidences, this is not always easy. Our goal is that for any two source records in a cluster, that there is less than a 1% chance that they describe two different entities, that is that the link is correct more than 99% of the time. A recent check of 300 random pairings within VIAF clusters failed to find any incorrect pairs. Reaching this level of certainty means that some records will not be pulled into clusters because of lack of information, rather than any indication that they do not match. A typical problem is a personal authority record with no date or title information. As VIAF has gotten larger we become more confident that a unique name, even if related information is missing, has a high chance of being a useful match. See the discussion in the pair-wise matching in section 5 below.
Processing Stage Details
1. Harvest and ingest new authority and bibliographic records
VIAF harvests authority records from the institutions producing those records, mainly by OAI-PMH or FTP. Immediately after harvesting some basic cleanup work is done and the files archived. After that the files are translated into the internal MARC format used by VIAF and added to an HBase table. Bibliographic records are handled in a similar way, although some of those are harvested from OCLC's WorldCat bibliographic catalog. While this sort of harvesting is widely done, even here ambiguities can arise. For instance the same record can be issued with different identifiers, or an identifier reused for a second entity. Deletes are often a problem, and in practice periodic refreshing of the files needs to be done to minimize differences that can accumulate.
2. Associate each authority record with its corresponding bibliographic records
This stage varies depending on how the association is made between authority records and names in bibliographic records. Many files depend on unique strings for each name or type of name to make the connection; others have numeric identifiers that have to match.
At this point we assume that if the name in a bibliographic record matches an authority record that they describe the same entity. For various reasons, that is not always the case. A common case happens when an author publishes under a name that is in the authority file, but was established for someone else. For proper linking a different form or identifier should be used, but this determination can be time consuming and difficult and can result in a misattribution even when diligently pursued.
A further complication at this stage involves differentiated authority records vs. undifferentiated records. An undifferentiated record may not even be expected to refer to a single person (in some files it may, in fact, sometimes be unambiguous, in other files it will be known to be ambiguous). We handle these differently in later stages. In either case, if we cannot associate the name in the bibliographic record with a single authority record, we do not consider it a match.
3. Identify links within files
There are many links implicit between authority records within a single authority file that are useful if made explicit. A common example of this is a record for a pseudonym that has a cross reference to another record representing a related entity. When these cross references are symmetric (each references the other) we identify them and add the corresponding record identifiers. Other examples include identifying parallel records for the same entity and making links between uniform titles for controlled works (IFLA Study Group on the Functional Requirements for Bibliographic Records, 1997), expressions and authors.
Since many of these intra-file links are based on strings rather than identifiers, the linkage can be ambiguous, for example a cross reference that matches multiple records. In general, we only make the links when they are unambiguous
4. Create processed authority records
Once the bibliographic and authority records have been associated, each is mined for information useful in matching names. This includes variant forms of names, dates associated with the names, dates associated with publications, titles, publishers, co-authors, ISBNs and any other standard identifiers. This information is then merged with the original authority record producing what we called a 'processed' record. Since the resulting record can become very large (many megabytes in some cases), we have established limits on how much of the information from bibliographic records is imported.
To a certain extent, each of the files needs its own set of rules to extract the data. In particular, routines need to be specialized for date patterns and for recognizing title information in authority records (which is often in a free-text note field). We currently support MARC-21 (Library of Congress Network Development and MARC Standards Office), UNIMARC (IFLA UNIMARC Strategic Programme), MADS (Library of Congress) and some specialized XML formats for input.
Errors can, of course, enter at this stage. Since dates are often pulled from free text notes, sometimes erroneous dates are identified. Occasionally a field in a record will be mistagged, with the result that the forename and surname are reversed, or the dates merged into the name rather than staying separate. This stage makes a number of assertions about names and has to be done carefully since later stages depend on these assertions.
5. Do pair-wise matching of the records between authority files
This stage makes assertions about matches between records in different authority files (names within a single authority file are not compared here). Later stages will test and adjust those assertions if they prove to be ambiguous.
We use surnames as our primary way of bringing personal name records together for matching. Once brought together we evaluate whether the names are compatible (e.g. similar, have dates that do not conflict). All the processing is done with normalized (Hickey, Toves, & O'Neill, 2006) Unicode. Right now we do not make any attempt to transliterate scripts to look for matches, depending on cross references in the files to bring together different scripts and forms of the names. As VIAF has grown we see more 'network effects' (Wikimedia) where the addition of an alternative name form in one file can bring together records in several other files.
The links between records currently come in more than two dozen types and carry along what prompted the link, e.g. what title resulted in a title link. The link types are ranked and this ranking is used in some of the subsequent phases when multiple links are compared. They range from a 'forced link' (an explicit reference to another authority record), the most reliable, to 'exact name', the least certain. In general, VIAF needs multiple pieces of information to agree before creating links between names. Beyond forced links, the most reliable links are based on name/title similarity and, for people, matching birth and death dates. In cases where the resultant match does not look ambiguous, we allow single date matching, e.g. just a birth date to bring two records together. Corporate names require different matching criteria, as do geographic names and uniform titles.
As mentioned earlier we have recently recognized that the existence of sparse records (where the only information was the name) was an impediment to the use of VIAF as they add a large number of singleton clusters (clusters composed of a single source authority record) of little utility. We now do matching across all the VIAF clusters and if there is only one cluster that such a name is compatible with, we add it. The VIAF interface marks these names as 'sparse'.
Even the best of matching systems need some sort of manual override. For VIAF we have created an additional authority file (called xA) which can be used to control matching in VIAF. For instance if we find that VIAF is mixing records by 'Homer' with a 'Pseudo Homer', we can create two xA records, one representing 'Homer' linked to the proper 'Homer' source records in contributor files, and another for 'Pseudo Homer'. The records in xA are harvested each month and treated much like other source files to VIAF. The added information will be encountered in the clustering phase and should result in two, better differentiated clusters. A more common condition is that VIAF has missed a match because of missing or incorrect information. In that case adding a xA record with pointers to the two clusters that need to be merged will bring them together. xA uses MADS for its internal format as we felt that its representation of authority information was a good fit for input to VIAF and simple enough to make creating an interface to it manageable. Currently there are less than 300 entries in xA overriding the automatic clustering, although we expect this to gradually grow.
Much of the ambiguity VIAF needs to deal with enters at this stage. One of the most important aspects of VIAF is that it associates entities with many different forms of their name. This means we have to have the flexibility to realize that T. B. Hickey may well be the same person as Thomas B. Hickey, Thom Hickey or even the name written in other scripts when equivalents are found in cross references. This flexibility is another opening for incorrect assertions to creep in. In practice almost all such ambiguities are apparent when looking across all the match assertions for a record, and are dealt with in subsequent stages.
6. Look for duplicates within files
Nearly every file has a certain level of duplication in it, and at this stage we look for duplicate names within files. This deduplication depends entirely on the preferred forms of names, looking for matches on the name itself as well as auxiliary information in the heading, such as dates and qualifications that appear to indicate the same person. When found, the records are merged for subsequent processing, retaining any alternative forms and links they may have.
Many of the authority files contributed to VIAF are themselves a merge of other files. While the contributors do their best to eliminate redundancy, there is always a certain residual amount.
7. Pull together groups of records that are linked by the pair-wise matching
At this point we have individual records, or actually what we call 'nodes' that contain just the information needed for clustering (a node may be derived from and represent multiple source records). Much of this and subsequent processes are treated as a graph-processing problem with the links between records forming the edges of a graph made up of these nodes.
Our first task is to pull together all the nodes into connected groups, that is to divide all the nodes into disjoint groups where there are no linkages between them.
The largest of these connected groups can have hundreds of nodes. Keeping this number down to a reasonable number of nodes involves a number of heuristics. One of the major ones is that if a record is linked to multiple records in another authority file, only the links with the highest strength are kept. In VIAF this breaks hundreds of thousands of links, pruning many links between records, at the expense of breaking links that conceivably would be useful assertions about identity later in the processing. A typical case would be where two records from one file match a single record in another file. One match is based on a personal name plus a work title, but the other match is based on the name plus two dates. In that case the double-date match would be preferred and the title match ignored.
8. Divide the groups into coherent clusters
Along with the initial matching, this is the section of VIAF that has been rewritten the most. At one time the approach was to take the connected groups, look for types of links causing the most ambiguity and pruning them until the connected group was broken into smaller clusters that were no longer ambiguous. Our most recent approach to finding coherent clusters is to reverse this and build up clusters based on links that look the most reliable. In addition to making sure those reliable links are kept, the procedure seems to be more understandable.
For each of the connected groups we go through the following process, even if the group does not look ambiguous (no multiple records from a single authority file, no conflict among dates). Usually such groups are fine, but occasionally there are records pulled in through a chain of pair-wise matches that should not be in the same VIAF cluster.
In each connected group we:
- Look for maximal complete subgraphs of 3 or more nodes
- Often called cliques, these are groups of records each of which is linked to all the other nodes
- Nodes not in one of these cliques form their own individual subgraphs
- Merge the 'best' pair of subgraphs together based on the following criteria
- Strength of the best link between the pair
- Number of links between the pair
- A metric based on
- Strength of the match
- Title closeness
- Node type (corporate, personal, etc.)
- Name closeness
- Whether the nodes are personal names or not
- This process continues until there are no more subgraphs which can be merged
- The resulting subgraphs are then used to create the VIAF clusters
Some groups will be separated into dozens of VIAF clusters.
The techniques used in this step are really a series of heuristics tuned for the sort of difficulties we see in VIAF. Even slight variations in the order they are applied or small changes to metrics measuring closeness will result in changes in some of the final clusters. Since the records are brought together in asynchronous processes we have to be careful to apply the tests against the records in a predictable manner to ensure consistent results from one run to another.
9. Merge resulting groups
During this calculation of the 'best' pair of subgraphs a number of criteria are looked at to avoid clustering incompatible nodes:
- Date conflicts within clusters
- Similar names that are incompatible, e.g. for a individual and his 'spirit' or 'ghost'
- Names that are cross references to each other (even though they may share titles, dates, etc.)
- Names that differ only in a number (e.g. John V vs. John VI)
- Names from the same authority file
In addition, the connected groups offer the possibility of recognizing additional names within an authority file that should be brought together in a single cluster. One example of this is when we see pairs of records that share many of the same links and are not otherwise incompatible. We treat these as duplicate records and allow both to be in the same VIAF cluster.
Unfortunately at this stage we often find closely related groups that should be merged. Typically this is because at least one of the authority files has records in both groups, forcing them into separate groups. To overcome this we look for pairs of groups in which most of the nodes are related to nodes in both groups. We then merge the groups, occasionally ejecting a 'problem' node into its own group.
10. Assign VIAF indentifiers to the clusters
A VIAF cluster is derived from its processed member records. As described in the previous sections, the clusters are recalculated each processing cycle. While the vast majority of the clusters are stable, there are many that split or merge based on changes in records or VIAF's algorithms. We recognize that the clusters are fluid and that no individual member record can be permanently assigned to a single VIAF identifier, however VIAF identifiers need to be persistent to allow linking. Cluster identifiers can be added, abandoned and resurrected from previously abandoned identifiers. The algorithm used for cluster assignment minimizes the number of source records that move from one identifier to another.
New clusters are created when member records are added to the incoming files or when an existing cluster is split. The first choice for an identifier for a split cluster is to resurrect an abandoned identifier, but that is not always possible, so splits can be assigned new identifiers. Member records coming into VIAF for the first time that do not get assigned to an existing cluster are assigned new identifiers. Cluster identifiers are abandoned when all members of the cluster are deleted or when clusters are merged. A merge representing several existing cluster identifiers will keep one of the cluster identifiers and abandon the others, trying to minimize changes in the association of member records and VIAF identifiers. Abandoned identifiers are tracked along with the identifiers of the member records that were in the cluster before it was abandoned so that an abandoned identifier can be resurrected if a cluster in search of an identifier has members that belonged to a previously abandoned cluster identifier.
11. Create links between clusters as needed (e.g., between pseudonyms)
Once the clusters have been created and Identifiers assigned to them it is then possible to make links between clusters. Currently links are made between clusters for pseudonyms and between author, work and expression entities. In the future we hope to link more entities such as coauthors and publishers.
12. Maintain information about merged, split and deleted clusters
We want to make VIAF identifiers as reliable as possible, so they should always return a response from the Web site even after they have been abandoned, perferably to a cluster describing the same entity referenced by the original link. To this end we collect information about abandoned identifiers in a database so that if a current cluster exists that contains records from the abandoned cluster, then the identifier will redirect to the current cluster that probably describes the same entity. If no current cluster can be located, then the original records have probably been deleted. In that case a copy of the abandoned cluster record was saved and can be presented to a user. In the worst case, nothing can be located and an appropriate HTTP response code and message is returned.
VIAF also has a URI pattern that accepts the identifier for an authority record and redirects to the cluster the record is in (e.g. http://viaf.org/viaf/sourceID/LC|n79089957).
Implications for Linked Data
The authority files that VIAF is built from can be considered early examples of linked data, in that they create controlled entities and relationships between them. We both expose VIAF as linked data, and use it as such internally at OCLC, and we believe the issues we encounter merging authority files have implications for the use of linked data more generally. The source files to VIAF include some of the most carefully curated files of names available. In addition, the bibliographic records using the files are professionally created and often reviewed and corrected by many libraries. In spite of the substantial work put into the creation and maintenance of them, the files still have ambiguities. We are able to successfully resolve many of these issues within VIAF, but only with a mixture of both domain independent (e.g. looking at closely coupled records) and domain dependant techniques (e.g. rules developed especially for names).
Extrapolating from our experience we expect that naïve uses of linked data will run into many of the same problems, resulting in very strange inferences. This can be especially difficult across domains (we see a rather manageable slice of this problem dealing with relationships between different types of names), and suspect that resolving these issues will be very difficult without deep domain knowledge.
Fluid ID Assignment
This whole process is repeated monthly. A consequence of this is that authority records are free to move from one cluster to another based on changes to the matching software or new information, such as additional titles, dates or even coauthors. We see this as a substantial benefit, for example as we see problems in clusters we can improve the clustering algorithm and immediately improve VIAF's clusters. Without that capability minor changes can build up resulting much larger changes needed in the future, but it is not without problems. If a given authority record really wants to be in two clusters, the slightest change in data can move it from one to the other (and back again). We experimented with making the assignments more 'sticky', but did not like the results. We would change the clustering software and not be able to see the differences we expected because the records were 'stuck' in their previous cluster. How to balance the desire for stability with that of reacting to new information is an area still under study.
As noted above, VIAF remembers previous assignments and can redirect to the appropriate cluster as IDs change.
Summary
While quite compact, the process described here requires several thousand lines of code (mostly Python with some Java and XSLT), so there are many unmentioned details, but this describes the major issues we address in VIAF.
As can be seen from the process, the resulting VIAF record is a construct based on the underlying source records. Other than our ability to redact data from the database, there is no option to directly edit a VIAF record. Changes to the VIAF record are accomplished by changes to its source records, and in this way the VIAF clusters are 'virtual' authority records.
We expect VIAF to be an important part of the expanding cloud of linked data, but also expect that integration of various sources of linked data will involve substantial intellectual and software effort.
Notes
1 "Under other cataloging rules, however, authors may be viewed in certain circumstances as establishing more than one bibliographic identity, and in that case a specific instance of the bibliographic entity person may correspond to a persona adopted by an individual rather than to the individual per se" in (IFLA Working Group on Functional Requirements and Numbering of Authority Records (FRANAR), 2008, pp. 4-5)
Bibliography
[1] Apache Hadoop Project. (n.d.). Hadoop.
[2] Bennett, R. C.-D. (2007). VIAF (Virtual International Authority File): Linking the Deutsche Nationalbibliothek and Library of Congress Name Authority Files. International Cataloguing and Bibliographic Control: Quarterly Bulletin of the IFLA UBCIM Programme, 36 (1), 12-18.
[3] Calhoun, K. (1998, 06 22-23). A Bird's Eye View of Authority Control in Cataloging. Proceedings of the Taxonomic Authority Files Workshop, Washington, DC, June 22-23, 1998.
[4] Hickey, T. B. (2006). NACO Normalization A Detailed Examination of the Authority File Comparison Rules. Library Resources and Technical Services, 50 (3), 166-172.
[5] IFLA Study Group on the Functional Requirements for Bibliographic Records. (1997, 09). Functional Requirements for Bibliographic Records, Final Report.
[6] IFLA UNIMARC Strategic Programme. (n.d.). about-unimarc.
[7] IFLA Working Group on Functional Requirements and Numbering of Authority Records (FRANAR). (2008, 12). FRAD. As amended and corrected through July 2013.
[8] Library of Congress. (n.d.). MADS Metadata Authority Description Schema.
[9] Library of Congress Network Development and MARC Standards Office. (n.d.). MARC Standards.
[10] Smith-Yosimura, K. (2009). Networking Names. Dublin: OCLC Porgrams and Research.
[11] WikiMedia. (n.d.). Authority Control.
[12] Wikimedia. (n.d.). Network effect.
[13] World Wide Web Consortium. (2007, Decemeber 17). Cool URIs for the Semantic Web.
[14] World Wide Web Consortium. (n.d.). Semanitc Web Data.
About the Authors
|
Thomas Hickey is Chief Scientist at OCLC where he helped found OCLC Research. Current interests include metadata creation and editing systems, authority control, parallel systems for bibliographic processing, and information retrieval and display. In addition to implementing VIAF, his group looks into exploring Web access to metadata, identification of FRBR works and expressions in WorldCat, the algorithmic creation of authorities, and the characterization of collections. He has an undergraduate degree in Physics and a Ph.D. in Library and Information Science.
|
|
Jenny A. Toves is a Software Architect in OCLC Research. In addition to VIAF, her current interests include the large scale FRBR clustering of bibliographic records, which is central to OCLC's linked data program. One aspect of this work is the algorithmic creation of authority records by mining the bibliographic data. She has an undergraduate degree in Computer Information Systems and a M.S. in Computer Science.
|
|