D-Lib Magazine
|
|
Kat Hagedorn Joshua Santelli |
IntroductionThis report is a follow-up to the McCown et al. article in IEEE Internet Computing two years ago, in which the researchers investigated the percentage of URLs from OAI records in Google, Yahoo and MSN search indexes [1]. We were interested in whether Google in particular had increased the number of OAI-based resources in its search index. To this end, we used a slightly different methodology using the OAIster [2] metadata corpus to see what percentage of the corpus was found in the Google search index only. OAIster harvests and aggregates OAI metadata with links to digital resources those without links to digital objects are removed during our transformation and indexing process. MethodologyOn June 6, 2008, a snapshot was taken of the harvested content in OAIster. The snapshot contained 978 repositories comprised of 16,276,756 records. Each repository was placed into one of four groups based on the number of Dublin Core records indexed in OAIster. Group A is made up of repositories with 100 or fewer records; Group B has 101 to 1,000 records; Group C has 1,001 to 10,000 records; and repositories with more than 10,000 records were put into Group D. (See Table 1.)
Table 1. Randomly sampled records for each group. Since OAIster only indexes records with URLs, each record has at least one URL. For records containing more than one URL, a single URL was selected at random from within the record. Because Group A was very small (6,582 records) we selected all the records to run against the Google search index. From Group B, we randomly selected 10% of the records from each repository; from Group C, we selected 5% from each repository; and from Group D we selected 1% from each repository. With this method, we selected and tested a total of 147,305 URLs. Sampling size was chosen to maintain at least a 95% confidence level (±1%). This method differs from that of McCown et al. They grouped the records using a different method, and they randomly selected 1,000 records from each group. They also searched MSN, Yahoo and Google while we searched for the records only in the Google search index. To determine if a record was indexed by Google, we made an "info" request (e.g., info:http://oaister.org/) for each sampled URL against the Google Research API using the University Research Program for Google Search [3]. Either zero or one result was returned from the API. If a result was returned we marked that record as "found"; if no results were returned, we marked that record as "not found". Results and CaveatsOf the sampled records, 44.35% of them were found in Google. (See Table 2.)
Table 2. Records found and not found in the Google search index. We spot-checked the sampling by choosing a few repositories and requesting all the URLs in the repositories in the Google search index. We chose one repository from each of Groups B, C and D. (Group A was already fully represented.) We chose these particular repositories because of the mostly even split between "found" and "not found" records and wanted to test the assumption that this would be the case for all the records in the repositories. We found that our assumption was correct. (See Tables 3 and 4.)
Table 3. Original requests for three repositories to the Google search index.
Table 4. Requests for all records in the three repositories to the Google search index. We are aware that URLs in OAI records can be constructed differently from URLs accessed by Google for its index in its normal course of operations. For instance, a record from the Project Euclid repository accessed via OAI has the URL http://projecteuclid.org/euclid.bams/1183524923, with the title "Every planar graph with nine points has a nonplanar complement". If you perform an info request for this resource in the Google search index, the article is not found [4]. If you look for the URL in Google with the addition of a "/handle" element1 http://projecteuclid.org/handle/euclid.bams/1183524923 the article is found [5]. Both types of URLs resolve correctly on the Project Euclid site. We are not able to determine how widespread this case is across repositories. There is the potential that running all the records for the small repositories skewed the results by representing these more. Alternatively, choosing 1% of the records in Group D repositories could also have skewed the results by including too many records from a single, large repository. ConclusionsGoogle's indexing does not seem to have retrieved more of the hidden web since the publication of the McCown, et al. article in 2006. We would venture to conclude that Google has not endeavoured to increase their support and access to OAI materials. Even taking into account the caveats, we would also conclude that aggregations of OAI records are as valuable for user research purposes as they were at least two years ago. From our own experience, we know that providing the OAIster records in bulk to Google proved problematic for them, and eventually they requested only the OAIster URLs instead of the complete metadata. We are not, at this point, certain that Google is using these URLs (crawling them) for addition to their search index. It is also interesting to note that Google has recently dropped support of OAI for website indexing [6]. Given the resulting numbers from our investigation, it seems that Google needs to do much more to gather hidden resources, not less. (Granted, the OAI for Sitemaps feature may not have been an appropriate approach for Google.) We are very interested in others' evaluation of our data crunching. We would also like to encourage other OAI aggregators to run their metadata against the Google index, to prove or disprove our conclusions. Our source code and raw data are available upon request. AcknowledgementsThis research draws on data provided by the University Research Program for Google Search, a service provided by Google to promote a greater common understanding of the web. Bibliography[1] McCown, F., Liu, X., Nelson, M. L., and Zubair, M. "Search engine coverage of the OAI-PMH corpus." IEEE Internet Computing 10:2 (March/April 2006) pp. 66-73. <http://doi.ieeecomputersociety.org/10.1109/MIC.2006.41>. [2] OAIster website. Accessed June 20, 2008. <http://www.oaister.org/>. [3] University Research Program for Google Search website. Accessed June 19, 2008. <http://research.google.com/university/search/>. [4] Info request for "http://projecteuclid.org/euclid.bams/1183524923" in Google search index. Accessed June 19, 2008. <http://www.google.com/search?q=info:http://projecteuclid.org/euclid.bams/1183524923>. [5] Info request for "http://projecteuclid.org/handle/euclid.bams/1183524923" in Google search index. Accessed June 19, 2008. <http://www.google.com/search?q=info:http://projecteuclid.org/handle/euclid.bams/1183524923>. [6] Mueller, J. "Retiring support for OAI-PMH in Sitemaps." Google Webmaster Central Blog (April 23, 2008). <http://googlewebmastercentral.blogspot.com/2008/04/retiring-support-for-oai-pmh-in.html>. Note1. The use of the term "handle" here does not refer to the Handle System®.Copyright © 2008 Kat Hagedorn and Joshua Santelli |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Top | Contents | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
D-Lib Magazine Access Terms and Conditions doi:10.1045/july2008-hagedorn
|