D-Lib MagazineNovember/December 2014 GROTOAP2 The Methodology of Creating a Large Ground Truth Dataset of Scientific Articles
Dominika Tkaczyk AbstractScientific literature analysis improves knowledge propagation and plays a key role in understanding and assessment of scholarly communication in scientific world. In recent years many tools and services for analysing the content of scientific articles have been developed. One of the most important tasks in this research area is understanding the roles of different parts of the document. It is impossible to build effective solutions for problems related to document fragments classification and evaluate their performance without a reliable test set, that contains both input documents and the expected results of classification. In this paper we present GROTOAP2 a large dataset of ground truth files containing labelled fragments of scientific articles in PDF format, useful for training and evaluation of document content analysis-related solutions. GROTOAP2 was successfully used for training CERMINE our system for extracting metadata and content from scientific articles. The dataset is based on articles from PubMed Central Open Access Subset. GROTOAP2 is published under Open Access license. The semi-automatic method used to construct GROTOAP2 is scalable and can be adjusted for building large datasets from other data sources. The article presents the content of GROTOAP2, describes the entire creation process and reports the evaluation methodology and results. Keywords: Document Content Analysis, Zone Classification, System Evaluation, Ground Truth Dataset 1. IntroductionThe analysis of scientific literature accelerates spreading ideas and knowledge and plays a key role in many areas of research, such as understanding scholarly communication, the assessment of scientific results, identifying important research centers or finding interesting unexplored research possibilities. Scientific literature analysis supports a number of tasks related to metadata and information extraction, scientific data organizing, providing intelligent search tools, finding similar and related documents, building citation networks, and many more. One of the most important tasks in this research area is understanding the roles of different fragments of documents, which is often referred to as document zone or block classification. An efficient zone classification solution needs to be carefully evaluated, which requires a reliable test set containing multiple examples of input documents and the expected results of classification. Even narrowed to analysing scientific publications only, document zone classification is still a challenging problem, mainly due to the vast diversity of document layouts and styles. For example, a random subset of 125,000 documents from PubMed Central contains scientific publications from nearly 500 different publishers, many of which use original layouts and styles in their articles. Therefore it is nearly impossible to develop a high quality, generic zone classification solution without a large volume of ground truth data based on a diverse document set. The role played by a given document fragment can be deduced not only from its text content, but also from the way the text is displayed for the readers. As a result, document zone classifier can benefit a lot from using not only the text content of objects, but also geometric features, such as object dimensions, positions on the page and in the document, formatting, object neighbourhood, distance between objects, etc. A dataset useful for training such a classifier must therefore preserve the information related to object size, position and distance. In this paper we present GROTOAP2 (GROund Truth for Open Access Publications) a large and diverse test set built upon a group of scientific articles from PubMed Central Open Access Subset. GROTOAP2 contains 13,210 ground truth files representing the content of publications in the form of a hierarchical structure. The structure stores the full text of a publication's source PDF file, preserves the geometric features of the text, and also contains zone labels. The corresponding PDF files are not directly included in the dataset, but can be easily obtained online using provided scripts. GROTOAP2 is distributed under the CC-BY license in the Open Access model and can be downloaded here. The method used to create GROTOAP2 is semi-automatic and requires a short manual phase, but does not include manual correction of every document and therefore can scale easily to produce larger sets. The method can also be adapted to produce similar datasets from other data sources. GROTOAP2 test set is very useful for adapting, training and performance evaluation of document analysis-related solutions, such as zone classification. The test set was built as a part of the implementation of CERMINE [8] a comprehensive open source system for extracting metadata and parsed bibliographic references from scientific articles in born-digital form. GROTOAP2 has been successfully used for training and performance evaluation of CERMINE's extraction process and its two zone classifiers. The whole system is available here. In the rest of the paper we present the content of GROTOAP2, compare our solution to existing ones and discuss its advantages and drawbacks. We also describe in details the semi-automatic process of creating the dataset, report the evaluation methodology and results. 2. Previous WorkExisting test sets containing ground truth data useful for zone classification are usually based on scanned document images instead of born-digital documents. For example UW-III contains various document images along with structure-related ground truth information. Unfortunately UW-III is not free and difficult to purchase. MARG is a dataset containing scanned pages from biomedical journals. The main problem is that it contains only the first pages of publications and only a small subset of zones is included, and as a result its usability for performance evaluation of page segmentation and zone classification is very limited. PRImA [1] dataset is also based on document images of various types and layouts, not only scientific papers. Other data sets built upon scanned document images of various layouts are: MediaTeam Oulu Document Database [5], UvA dataset and Tobacco800 [3]. Open Access Subset of PubMed Central contains around 500,000 life sciences publications in PDF format, as well as their metadata in associated NLM files. NLM files contain a rich set of metadata of the document (title, authors, affiliations, abstract, journal name, etc.), full text (sections, headers and paragraphs, tables and equations, etc.), and also document's references with their metadata. This makes PMC a valuable set for evaluating document analysis-related algorithms. Unfortunately, provided metadata contains only labelled text content and lacks the information related to the way the text is displayed in the PDF files. GROTOAP2 is a successor of GROTOAP [7], a very similar, but much smaller, semi-automatically created test set. GROTOAP was created with the use of zone classifiers provided by CERMINE. First, a small set of documents was labelled manually and used to train the classifiers. Then, a larger set was labelled automatically by retrained classifiers and finally the results were corrected by human experts. Since GROTOAP's creation process required a manual correction of every document by an expert, the resulting test set is relatively small and every attempt to expand it is time-consuming and expensive. Due to the small size and lack of diversity, algorithms trained on GROTOAP did not generalize well enough and performed worse on diverse sets. In GROTOAP2 all these drawbacks were removed. The dataset is based on born-digital PDF documents from PMC and preserves all the geometric features of objects. In contrast to GROTOAP, the method used to construct GROTOAP2 is scalable and efficient, which allowed for constructing much larger and diverse dataset. Table 1 compares the basic parameters of GROTOAP and GROTOAP2 and shows the difference in size and diversity.
Table 1: The comparison of the parameters of GROTOAP and GROTOAP2 datasets. The table shows the numbers of different publishers included in both datasets, as well as the numbers of documents, pages, zones and zone labels. 3. GROTOAP2 DatasetGROTOAP2 is based on documents from PubMed Central Open Access Subset. The dataset contains:
The dataset contains 13,210 documents with 119,334 pages and 1,640,973 zones in total, which gives the average of 9.03 pages per document and 13.75 zones per page. GROTOAP2 is a diverse dataset and contains documents from 208 different publishers and 1,170 different journals. Table 2 and Figure 1 show the most popular journals and the most popular publishers included in the dataset. Figure 1: The most popular publishers included in GROTOAP2 test set. The abbreviations stand for: BMC BioMed Central, PLOS Public Library Of Science, OUP Oxford University Press, RU Press The Rockefeller University Press, Hindawi Hindawi Publishing Corporation, IUCr International Union of Crystallography, Springer Springer Verlag, FRF Frontiers Research Foundation, MDPI Molecular Diversity Preservation International, DMP Dove Medical Press
Table 2: The most popular journals in GROTOAP2 test set. The table shows how many documents from each journal are included in the dataset as both the exact number and the fraction of the dataset. 3.1 Ground truth structureThe main part of GROTOAP2 are ground truth files built from scholarly articles in PDF format. A ground truth file contains a hierarchical structure that holds the content of an article preserving the information related to the way elements are displayed in the corresponding PDF file. The hierarchical structure represents an article as a list of pages, each page contains a list of zones, each zone contains a list of lines, each line contains a list of words, and finally each word contains a list of characters. Each structure element can be described by its text content, position on the page and dimensions. The structure stored in a ground truth file contains also the natural reading order for all structure elements. Additionally, labels describing the role in the document are assigned to zones. The smallest elements in the structure are individual characters. A word is a continuous sequence of characters placed in one line with no spaces between them. Punctuation marks and typographical symbols can be separate words or parts of adjacent words, depending on the presence of spaces. Hyphenated words that are divided into two lines appear in the structure as two separate words that belong to different lines. A line is a sequence of words that forms a consistent fragment of the document's text. Words placed geometrically in the same line of the page, that are parts of neighbouring columns, do not belong to the same line. A zone is a consistent fragment of the document's text, geometrically separated from surrounding fragments and not divided into paragraphs or columns. All bounding boxes are rectangles with edges parallel to the page's edges. A bounding box is defined by two points: left upper corner and right lower corner of the rectangle. The coordinates are given in typographic points (1 typographic point equals to 1/72 of an inch). The origin of the coordinate system is the left upper corner of the page. Every zone is labelled with one of 22 labels:
A zone that contains only a title of a section, such as "Abstract", "Acknowledgments" or "References" is labelled with the same label as the section itself. Figure 2 shows fractions of the documents from the dataset that contain a given label. Figure 2: The diagram lists the labels in GROTOAP2 dataset, and for each label shows the fraction of documents in the dataset that contain zones with a given label. 3.2 Ground truth file formatWe use TrueViz format [2] for storing ground truth files. TrueViz is an XML-based format that allows to store the geometrical and logical structure of the document, including pages, zones, lines, words, characters, their content and bounding boxes, and also zone labels and the order of the elements. The listing below shows a fragment of an example ground truth file from GROTOAP2. Repeated fragments or fragments that are not filled have been omitted. XML-based format makes ground truth files easily readable by machines, but also makes them grow to enormous size, much greater than related PDFs. To limit the size of the dataset, only ground truth files were included. The corresponding PDF files can be easily downloaded using provided URL list or directly using a bash script. 4. The Method of Building GROTOAP2 DatasetGROTOAP2 dataset was created semi-automatically from PubMed Central resources (Figure 3):
Figure 3: The process of creating GROTOAP2 dataset. First, PDF files from PMC were processed in order to extract characters, words, lines and zones. Then, the text content was matched against annotated NLM files from PMC, which resulted in zone labelling. Finally, some of the most often repeated errors were removed by simple rules. 4.1 Ground truth generationPDF and NLM files downloaded from PubMed Central Open Access Subset were used to generate the geometric hierarchical structures of the publications' content, which were stored using TrueViz format [2] in ground truth files. In the first phase of ground truth generation process we used automatic tools provided by CERMINE [8]. First, the characters were extracted from PDF files. Then, the characters were grouped into words, words into lines and finally lines into zones. After that reading order analysis was performed resulting in elements at each hierarchy level being stored in the order reflecting how people read manuscripts. In the second phase the text content of each zone extracted previously was matched against labelled fragments extracted from corresponding NLM files. We used Smith-Watermann sequence alignment algorithm [6] to measure the similarity of two text strings. For every zone, a string with the highest similarity score above a certain threshold was chosen from all strings extracted from NLM. The label of the chosen string was then used to assign a functional label to the zone. If this approach failed, the process tried to use "accumulated" distance, which makes it possible to assign a label to these zones that form together one NLM entry, but were segmented into several parts. If none of the similarity scores exceeded the threshold value, the zone was labelled as unknown. After processing the entire page, an additional attempt to assign a label to every unknown zone based on the labels of the neighbouring zones was made. 4.2 Filtering filesData in NLM files vary greatly in quality from perfectly labelled down to containing no valuable information. Poor quality NLMs result in sparsely labelled zones in generated TrueViz files, as the labelling process has no data to compare the zone text content to. Hence, it was necessary to filter documents whose zones are classified in satisfying measure. Figure 4 shows a histogram of documents with specified percentage of zones labeled with concrete classes. There are many documents (43%) having more than 90% of zones labelled, and only those documents were selected for further processing steps. Figure 4: Histogram of documents in the dataset having given percentage of zones with an assigned class value. We also wanted to be sure that the layout distributions in the entire processed set and the selected subset are similar. The layout distribution in a certain document set can be approximated by publisher or journal distribution. If poor quality metadata was associated with particular publishers or journals, choosing only highly covered documents could result in eliminating particular layouts, which was to be avoided. We calculated the similarity of publisher distributions of two sets using the following formula: where P is the set of all publishers in the dataset and dA(p) and dB(p) are the percentage share of a given publisher in sets A and B, respectively. The formula yields 1.0 for identical distributions, and 0.0 in the case of two sets, which do not share any publishers. The same formula can be used to calculate the similarity with respect to journal distribution. In our case the similarity of publisher distributions of the entire processed set and a subset of documents with at least 90% labelled zones is 0.78, and the similarity of journal distributions 0.70, thus the distributions are indeed similar. 4.3 Ground truth improvementAfter filtering files we randomly chose a sample of 50 documents, which were subjected to a manual inspection done by a human expert. The sample was big enough to show common problems occurring in the dataset, and small enough to be manually analyzed within a reasonable time. The inspection revealed a number of repeated errors made by the automatic annotation process. At this point we decided to develop a few simple heuristic-based rules, which would significantly reduce the error rate in the final dataset. Some examples of the rules are:
The inspection also revealed segmentation problems in a small fraction of pages. The most common issue were incorrectly constructed zones, for example when the segmentation process put every line in a separate zone. Those errors were also corrected automatically by joining the zones that are close vertically and have the same label. From the dataset improved by the heuristic-based rules we randomly chose 13,210 documents for the final set. 4.4 Application to other datasetsThe process can be reused to construct a dataset containing labelled zones from a different set of documents. It will need source document files in PDF format and any form of annotated textual data that can be matched against the sources' content. In the reused ground truth generation process the automatic matching step can be used out of the box, while the manual inspection step has to be repeated and possibly a different set of rules have to be developed and applied. Fortunately, the process does not require manual correction of every document, and scales well even to very large collections of data. Proposed method cannot be applied if there is no additional annotated data available. In such cases the dataset can be built in a non-scalable way using a machine learning-based classifier, similarly to the way the first version of GROTOAP was created. 5. EvaluationThe process of creating GROTOAP2 did not include manual inspection of every document by a human expert, which allowed us to create a large dataset, but also caused the following problems:
GROTOAP2 dataset was evaluated in order to estimate how accurate the labelling in the ground truth files is. Two kinds of evaluation were performed a direct one, which included manual evaluation done by a human expert, and the indirect one, which included evaluating the performance of CERMINE system trained on GROTOAP2 dataset. 5.1 Manual EvaluationFor the direct manual evaluation we chose a random sample of 50 documents (different than the sample used to construct the rules). We evaluated two document sets: files obtained before applying heuristic-based rules and the same documents from the final dataset. The groups contain 6,228 and 5,813 zones in total, respectively (the difference is related to the zone merging step which reduces the overall number of zones). In both groups the errors were corrected by a human expert, and the original files were compared to the corrected ones, which gave the precision and recall values of the annotation process for each zone label for two stages of the process. The overall accuracy of the annotation process increased from 0.78 to 0.93 after applying heuristic rules. More details about the results of the evaluation can be found in Table 3.
Table 3: The results of manual evaluation of GROTOAP2. The table shows the precision, recall and F1 values for every zone label for the annotation process without and with heuristic-based rules. The correct labels for zones were provided by a human expert. No values appear for labels that were not present in the automatically annotated dataset. 5.2 CERMINE-based EvaluationFor the indirect evaluation we chose randomly a sample of 1000 documents from GROTOAP2 and used them to train CERMINE [8] our system for extracting metadata and content from scientific publications. CERMINE is able to process documents in PDF format and extracts: document's metadata (including title, authors, emails, affiliations, abstract, keywords, journal name, volume, issue, pages range and year), parsed bibliographic references and the structure of document's sections, section titles and paragraphs. CERMINE is based on a modular workflow composed of a number of steps. Most implementations utilize supervised and unsupervised machine-leaning techniques, which increases the maintainability of the system, as well as its ability to adapt to new document layouts. The most important parts of CERMINE, that have the strongest impact on the extraction results are: page segmenter and two zone classifiers. Page segmenter recognizes words, lines and zones in the document using Docstrum algorithm [4]. Initial zone classifier labels zones with one of four general categories: metadata, references, body and other. Metadata zone classifier assigns specific metadata classes to metadata zones. Both zone classifiers are based on Support Vector Machines. More details can be found in [8]. We evaluated the overall metadata extraction performance of CERMINE with retrained zone classifiers on 500 PDF documents randomly chosen from PMC. The details of the evaluation methodology can be found in [8]. The results are shown in Table 4. The retrained system obtained a good mean F1 score of 79.34%, which shows the usefulness of GROTOAP2 dataset.
Table 4: The results of the evaluation of CERMINE system trained on GROTOAP2 dataset. The table shows the precision, recall and F1 values for the extraction of various metadata types, as well as the mean precision, recall and F1 values. We also compared the overall performance of CERMINE system trained on the entire GROTOAP dataset and randomly chosen 1000 documents from GROTOAP2 dataset in two versions: before applying correction rules and after. The system achieved the average F1 score 62.41% when trained on GROTOAP, 75.38% when trained on GROTOAP2 before applying rules and 79.34% when trained on the final version of GROTOAP2. The results are shown in Table 5.
Table 5: The comparison of the performance of CERMINE trained on GROTOAP, GROTOAP2 before applying improvement rules and final GROTOAP2. The table shows the mean precision, recall and F1 values calculated as an average of the values for individual metadata classes. 6. Conclusions and Future WorkWe presented GROTOAP2 a test set useful for training and evaluation of content analysis-related tasks like zone classification. We described in details the content of the test set and the automatic process of creating it, we also discussed its advantages and drawbacks. GROTOAP2 contains 13,210 scientific publications in the form of a hierarchical geometric structure. The structure preserves the entire text content of the corresponding PDF documents, geometric features of all the objects, the reading order of the elements, and the labels denoting the role of text fragments in the document. The method used to create GROTOAP2 is semi-automatic, but does not require manual correction of every document. As a result the method is highly scalable and allows to create large datasets, but the resulting set may contain labelling errors. The manual evaluation of GROTOAP2 showed that the labelling in the dataset is 93% accurate. Despite the errors and thanks to the large volume of GROTOAP2, the dataset is still very useful, which we showed by evaluating the performance of CERMINE system trained on various versions of the dataset. The main features distinguishing GROTOAP2 from earlier efforts are:
Our future plans include:
AcknowledgementsThis work has been partially supported by the European Commission as part of the FP7 project OpenAIREplus (grant no. 283595). References[1] A. Antonacopoulos, D. Bridson, C. Papadopoulos, and S. Pletschacher. A Realistic Dataset for Performance Evaluation of Document Layout Analysis. 2009 10th International Conference on Document Analysis and Recognition, pages 296300, 2009. http://doi.org/10.1109/ICDAR.2009.271 [2] C. H. Lee and T. Kanungo. The architecture of TrueViz: a groundTRUth/metadata editing and VIsualiZing ToolKit. Pattern Recognition, 15, 2002. http://doi.org/10.1016/S0031-3203(02)00101-2 [3] D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard. Building a Test Collection for Complex Document Information Processing. In Proc. 29th Annual Int. ACM SIGIR Conference, pages 665-666, 2006. http://doi.org/10.1145/1148170.1148307 [4] L. O'Gorman. The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(11):11621173, 1993. http://doi.org/10.1109/34.244677 [5] J. Sauvola and H. Kauniskangas. MediaTeam Document Database II, a CD-ROM collection of document images, University of Oulu, Finland, 1999. [6] T. Smith and M. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195197, 1981. http://doi.org/10.1016/0022-2836(81)90087-5 [7] D. Tkaczyk, A. Czeczko, K. Rusek, L. Bolikowski, and R. Bogacewicz. Grotoap: ground truth for open access publications. In 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 381382, 2012. http://doi.org/10.1145/2232817.2232901 [8] D. Tkaczyk, P. Szostek, P. J. Dendek, M. Fedoryszak, and L. Bolikowski. CERMINE - automatic extraction of metadata and references from scientific literature. In Proceedings of the 11th IAPR International Workshop on Document Analysis Systems, pages 217221, 2014. http://doi.org/10.1109/DAS.2014.63 About the Authors
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|