D-Lib MagazineNovember/December 2014 Towards a Marketplace for the Scientific Community: Accessing Knowledge from the Computer Science Domain
Mark Kröll, Stefan Klampfl and Roman Kern AbstractAs scientific output is constantly growing, it is getting more and more important to keep track not only for researchers but also for other scientific stakeholders such as funding agencies or research companies. Each stakeholder values different types of information. A funding agency, for instance, might be rather interested in the number of publications funded by their grants. However, information extraction approaches tend to be rather researcher-centric indicated, for example, by the type of named entities to be recognized. In this paper we account for additional perspectives by proposing an ontological description of one scientific domain the computer science domain. We accordingly annotated a set of 22 computer science papers by hand and make this data set publicly available. In addition, we started to apply methods to automatically extract instances and report preliminary results. Taking various stakeholders' interests into account as well as automating the mining process represent prerequisites for our vision of a "Marketplace for the Scientific Community" where stakeholders can exchange not only information but also search concepts or annotated data. Keywords: Scientific Stakeholders, Marketplace, Computer Science Domain, Scientific Publications, Knowledge Acquisition 1. IntroductionThe scientific community produces output by publishing their activities and achievements, for instance, in journal articles. As this scientific output is constantly growing, it is getting more and more difficult to keep track not only for researchers but also for other scientific stakeholders such as funding agencies, research companies or science journalists. Each stakeholder values different types of information. So far, research assistants such as BioRAT [2] or FACTA [9] are often researcher-centric, i.e. scanning scientific publications for interesting facts in their respective domain. In contrast, a funding agency might be interested in grant information in combination with the research team, while a research company might be more interested in which algorithm works best for a specific data set. We account for these different perspectives by ontologically describing one scientific domain the computer science domain. Guided by questions such as "What is considered valuable information from a stakeholder's viewpoint", we introduce (i) stakeholder specific categories such as "Funding Information" or "License" and (ii) more general categories such as "Research Team" or "Performance". We then add linkage information between these categories to describe factual knowledge in form of triples. Linking these facts together into information blocks creates value, for example, by briefly summarizing scientific content. Information blocks can vary from stakeholder to stakeholder, for instance, a research company might be interested in following information block: {"Task", "Algorithm", "Software", "License"} while a graduate student might be interested in the following one: {"Scientific Event", "Task", "Corpus", "Algorithm", "Performance", "Numeric Value"}. We consider these categories and their combinations as useful and as a prerequisite for providing an interactive marketplace for researchers where scientific stakeholders can exchange not only information but also manually annotated data of high quality. According to our ontological description we manually annotated a data set containing 22 computer science publications. Annotating these scientific publications served two purposes. The first one was to refine and update our ontological description by examining real-world data and thus to get a better understanding of instances to expect and how to extract them. The second one was to create a small data set for testing and evaluating learnt models to automate instance as well as relation extraction. In a first step we applied simple extraction methods such as regular expressions and gazetteer-based approaches for five selected categories. Finally, we report our observations with respect to category characteristics and discuss achieved results as well as future steps. 2. Ontological Description of the Computer Science DomainScientific publications constitute an extremely valuable body of knowledge whose explication aids scientific processes including state-of-the-art research, research comparison or re-usage. Explicating scientific knowledge includes extracting facts from large amounts of data [7]. However, efforts to automatically extract knowledge from scientific domains so far have rather been researcher-centric. Medical entity recognition [1] focuses on classes such as "Disease", "Symptom" or "Drug". In bioinformatics [10], the focus is on identifying biological entities, for example, instances of "Protein", "DNA" or "Cell Line", and extracting the relations between these entities as facts or events. Departing from a mere content-level, Liakata et al. [4] introduced a different approach by focusing on the discourse structure to characterize the knowledge conveyed within the text. For this purpose, the authors identified 11 core scientific concepts including "Motivation", "Result" or "Conclusion". Ravenscroft et al. [6] present the Partridge system which automatically categorizes articles according to their types such as "Review" or "Case Study". Teufel et al. [8] use rhetorical elements of a scientific article such as "Aim" or "Contrast" to generate summaries. In a similar attempt, we ontologically describe the computer science domain with respect to various stakeholders by taking their motivations into account to scan scientific publications. Scientific stakeholders we considered during the designing process include:
To account for these perspectives we introduce 15 categories which are described in Table 1. Table 1: Categories of the Computer Science Domain.
In Figure 1 we present an ontological description of the computer science domain to explicate knowledge and thereby making it accessible and usable. The domain model contains categories as well as linkage information. Relation categories include "Achieve", "AppliedTo", "Develop", "Generate", "Get", "Have, "RelevantFor", "Require", "TrainedOn", "Use" and "WorkIn". Relations between categories are required to describe factual knowledge that can also aid the process of generating summaries of scientific content [5]. Figure 1: Ontological Description of the Computer Science Domain. Figure 2 illustrates an annotation example with reference to our ontological description. The annotation example shows a potential population of the information block: {"Scientific Event", "Task", "Corpus", "Algorithm", "Performance", "Numeric Value"} in the introduction. This kind of factual knowledge might be interesting, for instance, for graduate students compiling state-of-the-art. To generate real value, the extraction process needs to be automated to handle larger amounts of scientific publications the proposed domain model in combination with the manually annotated data set represents a first step into this direction. Figure 2: Annotation Example using the proposed Ontological Description of the Computer Science Domain. We used the BRAT annotation tool in this example. Finally, we point out additional benefits of an ontological description including (i) support for reasoning systems by using constraints or (ii) guidance for automated acquisition processes. In addition, we refrain from arguing completeness of the ontological description, yet, we firmly believe that the proposed categories plus their relationships capture knowledge which is beneficial for various scientific stakeholders. 3. Data SetTo automatically extract the textual content from scientific publications, we developed a processing pipeline [3] that applies unsupervised machine learning techniques in combination with heuristics to detect the logical structure of a PDF document. For annotation purposes we used the BRAT rapid annotation tool which provided us with an easy-to-use annotation interface. The annotation process was preceded by a setup phase where we examined a couple of papers to create a common understanding of the categories. After that, two of the authors manually annotated 22 computer science papers (11 papers each) from various subdomains including information extraction, computer graphics or hardware architecture to broaden the instance variety. In total the data set contains 5353 annotations 4773 category annotations and 580 relation annotations. Table 2 provides statistics about the unique instances per category. Table 2: Number of (unique) instances per category.
The zip file containing the data set can be downloaded here. Beside the .txt and respective .ann files the zip container includes an annotation.conf file for the BRAT viewer as well as information about the annotated publications. 4. Automated ApproachThe categories introduced in Table 1 exhibit different characteristics which can be taken into account to select automated (learning) methods. Some of these categories such as "Corpus", "Performance" or "Format" resemble named entity categories in the classical sense such as "Person" or "Location". Instance additions are rather the exception than the rule making them nearly closed. This closeness property makes them accessible to simple gazetteer-based approaches. Categories such as "Numeric Value" or "Funding Information" encompass characteristical instances, for instance, containing digits, occurring at certain positions or in certain sections of the publication. Such characteristics can be exploited by regular expressions. Categories such as "Algorithm" or "Task" contain a broad variety of instances. For example, instances of the "Task" category include noun phrases such as "document analysis" as well as verb phrases such as "acquiring common sense knowledge" making an automated recognition challenging. In the following section we apply regular expressions and gazetteer-based approaches to our dataset to automatically extract instances from following categories: "Algorithm", "Performance", "Numeric Value", "Corpus" and "Funding Information". 4.1 Preliminary ResultsFor the categories "Algorithm", "Corpus" and "Performance" we applied a gazetteer-based approach and crawled respective lists on-line. Wikipedia, for instance, offers some lists containing algorithms used in the computer science domain. We manually reviewed acquired category instances and discarded (i) incorrect entries, (ii) too general entries, for instance, "scoring algorithm" and (iii) ambiguous entries such as "birch" which can relate to either the clustering algorithm or the tree. Our gazetteer lists contain following number of entries: 1193 for the category "Algorithm", 105 for the category "Corpus" and 85 for the category "Performance". To pre-process the textual content, we applied (i) tokenization, (ii) sentence splitting, (iii) normalization & stemming and (iv) part-of-speech tagging. The gazetteer-based annotator focuses on stems and matches the longest sequence only so that a gazetteer entry "support vector machine" would match the plural "support vector machines" but not the phrase "support vector". After scanning through instance characteristics of the categories "Numeric Value" and "Funding Information", we opted for a regular-expression based approach. Regular expressions to extract "Numeric Value" instances were designed to include instances such as such as "60 million", "50 MHz" and "2.5 orders of magnitude" and neglecting numbers which are part of references or instances such as "Figure 1". Funding information such as grants is almost always provided at the end of a paper exhibiting a clear structure, i.e. typical a noun phrase containing digits, indicated by trigger phrases such as "<supported | funded | financed><by>". Recognition results for these five selected NE categories are shown in Table 3. Table 3: Precision, Recall and F-Scores for selected categories.
To calculate precision, recall and subsequently the F1-score, we counted correctly, partially correct and incorrectly matched annotations as well as the number of missing (not in the evaluation set) and spurious annotations (not in the ground truth). Precision is calculated as follows: (correct + ½ partially correct) / (correct + partially correct + incorrect + spurious). Recall is calculated as follows: (correct + ½ partially correct) / (correct + partially correct + missing). 4.2 DiscussionThe result values in Table 3 corroborate our intuition that the introduced categories exhibit different characteristics and therefore benefit from different extraction approaches. While rather closed or well-defined categories such as "Performance" and "Funding Information" react well to simpler approaches, other more open categories such as "Algorithm" show the need for more sophisticated approaches. The results can serve as a baseline for learning approaches such as maximum entropy models or conditional random fields which are widely used in the field of named entity recognition. However, 22 annotated papers do not provide enough data for training these models. We thus intend to compile larger training sets (i) using manually annotated instances to enlarge our gazetteer lists as well as (ii) using automatically annotated instances as candidates for training the models. 5. ConclusionIn this paper we create awareness of different stakeholder perspectives when processing and accessing scientific data. Taking stakeholders' interests into account represents a prerequisite for an interactive marketplace which serves the scientific community to access an extremely valuable body of knowledge. For that purpose we ontologically describe one scientific domain, the computer science domain and accordingly annotated 22 publications. In a first step we apply simple methods such as regular expressions and gazetteer-based approaches to automatically extract instances of five selected categories. Future steps include (i) compiling larger sets of training data, (ii) applying extraction algorithms which take, for instance, sequence information into account as well as (iii) detecting relations between the categories. Last but not least this paper contributes to the vision of establishing a commercially oriented ecosystem, i.e. a marketplace for scientific stakeholders to interact. This interaction, in our opinion, generates not only high-quality knowledge, but also commercial value for all participants. AcknowledgementsWe thank the reviewers for their constructive comments. The presented work was developed within the CODE project funded by the EU FP7 (grant no. 296150). The Know-Center is funded within the Austrian COMET Program Competence Centers for Excellent Technologies under the auspices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Economy, Family and Youth and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG. References[1] Abacha, A. and Zweigenbaum, P. 2011. Medical entity recognition: a comparison of semantic and statistical methods. BioNLP 2011 Workshop. Association for Computational Linguistics. [2] Corney, D., Buxton, B., Langdon, W. and Jones, D. 2004. BioRAT: extracting biological information from full-length papers. Bioinformatics, 20. http://doi.org/10.1093/bioinformatics/bth386 [3] Klampfl, S. and Kern, R. 2013. An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles. Research and Advanced Technology for Digital Libraries. [4] Liakata, M., Saha, S., Dobnik, S., Batchelor, C. and Rebholz-Schuhmann, D. 2012. Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics 28 (7). [5] Liakata, M., Dobnik, S., Saha, S., Batchelor, C. and Rebholz-Schuhmann, D. 2013. A discourse-driven content model for summarising scientific articles evaluated in a complex question answering task. Conference on Empirical Methods in Natural Language Processing. [6] Ravenscroft, J., Liakata, M. and Clare, A. 2013. Partridge: An Effective System for the Automatic Classification of the Types of Academic Papers. AI-2013: The Thirty-third SGAI International Conference. http://doi.org/10.1007/978-3-319-02621-3_26 [7] Seifert, C., Granitzer, M., Höfler, P., Mutlu, B., Sabol, V., Schlegel, K., Bayerl, S., Stegmaier, F., Zwicklbauer, S. and Kern, R. 2013. Crowdsourcing fact extraction from scientific literature. Workshop on Human-Computer Interaction and Knowledge Discovery. [8] Teufel, S. and Moens, M. 2002. Summarizing scientific articles: experiments with relevance and rhetorical status. Computational Linguistics 28 (4). [9] Tsuruoka, Y., Tsujii, J. and Ananiadou, S. 2008. FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics 24(21). http://doi.org/10.1093/bioinformatics/btn469 [10] Zweigenbaum, P., Demner-Fushman, D., Yu, H., and Cohen, K. 2007. Frontiers of biomedical text mining: current progress. Briefings in Bioinformatics, 8(5). http://doi.org/10.1093/bib/bbm045 About the Authors
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|