D-Lib MagazineNovember/December 2014 Discovering and Visualizing Interdisciplinary Content Classes in Scientific Publications
Theodoros Giannakopoulos, Ioannis Foufoulas, Eleftherios Stamatogiannakis, Harry Dimitropoulos, Natalia Manola and Yannis Ioannidis AbstractText visualization is a rather important task related to scientific corpora, since it provides a way of representing these corpora in terms of content, leading to reinforcement of human cognition compared to abstract and unstructured text. In this paper, we focus on visualizing funding-specific scientific corpora in a supervised context and discovering interclass similarities which indicate the existence of inter-disciplinary research. This is achieved through training a supervised classification visualization model based on the arXiv classification system. In addition, a funding mining submodule is used which identifies documents of particular funding schemes. This is conducted, in order to generate corpora of scientific publications that share a common funding scheme (e.g. FP7-ICT). These categorized sets of documents are fed as input to the visualization model in order to generate content representations and to discover highly correlated content classes. This procedure can provide a high level monitoring which is important for research funders and governments in order to be able to quickly respond to new developments and trends. Keywords: Text Mining, Text Visualization, Funding Mining, Classification, Dimensionality Reduction, Scientific Corpora 1. IntroductionOne of the most important tasks in data mining applications, especially when large datasets are involved, is visualizing the textual content. In general, this is a way of representing what particular document corpora refer to, in terms of content, which has obviously major importance for a wide range of textual information available in the Web. When it comes to the particular case of scientific documents, the respective text analytics and visualization tasks [2, 7, 6, 3] are particularly interesting due to their content richness and diversity. Modern digital libraries that store scientific documents are not limited to simple searching and storing functionalities but also try to provide text mining facilities that result in either semantic enrichment of the scientific content or high-level content statistics and visualizations. Such infrastructures help research funders, scientists, researchers and policy makers to understand content changes and trends. In this paper we present a method for discovering interdisciplinary research efforts through a supervised text classification and visualization technique. In particular, we adopt a well-defined classification system and scientific dataset, namely the arXiv library. This is a rather simple categorization covering a wide range of content 130 classes from Computer Science, Biology, Physics, etc. Our purpose is to visualize the content of text corpora and to extract and visualize strong correlations between different content classes (e.g. between computer vision and biology). This is achieved through a supervised classification/visualization procedure, using the arXiv dataset as training data and classification scheme. In the sequel, a simple funding mining module is used to extract sets of publications that share a common funding scheme. This is used input to the classification-visualization module that generates content representations and discovers interclass content similarities. The proposed system has been implemented under the OpenAIREplus EU project titled, "2nd-Generation Open Access Infrastructure for Research in Europe" (283595). This EU project aims to build an information infrastructure of scientific publication and data repositories and to implement EC's open access policies through services that effectively connect publications to scientific data and funding information. This is achieved through text mining techniques towards the automatic classification, indexing and clustering of scientific corpora, funding information mining, content visualization and link mining. In the context of such an infrastructure, such visualization applications can be rather useful tool for research administrators, especially in the case that such techniques are connected to funding schemes. In particular, it can assist the research administrators in the process of strategy or policy making, since it will provide them with an optical description of the content distribution for particular funding schemes. 2. Method2.1 Supervised Classification and VisualizationThe core of the automatic text classification and visualization functionality has been already described in [4], so please resort to that publication for further details. Here, we give a brief description of this supervised procedure. In, general the purpose of the classification-visualization module is:
Both the classification and visualization steps are supervised; therefore, a predefined taxonomy is needed. In this work we have selected to use the arXiv classification scheme. This is a large archive for electronic preprints of scientific publications, covering a wide range of fields. The arXiv categorization is rather simple: 2 levels of hierarchy with a total of almost 130 class labels in the second level. These 2nd level labels are organized in 7 general categories (let us call them superclasses from this point), for example "Computer ScienceData Structures and Algorithms". We have simply used the 2nd level labels, without taking into consideration the hierarchical structure of the taxonomy. In total, around half a million of abstracts have been used for training both the classification and visualization functionalities. 2.2 Inter-class similarity and normalizationThe aforementioned joint classification visualization process is used to classify an unknown document to a set of classes, each one of which has been statically mapped to a 2-D point so that similar classes correspond to close points in the 2-D space. Each time, the module provides soft outputs, i.e., class probability estimations, and represent each of the classes in the 2-D space. So in order to extract corpora text visualizations, we simply need to apply this procedure to a set of documents: this results in a set of output class probabilities, therefore the whole corpus is represented by triplets of (a) average 2-D class coordinates and (b) a respective aggregated class probability. In addition, in order to discover interclass similarities, for each document we store the k highest class probabilities, where k is a user-defined parameter. This is stored in a interclass similarity matrix CSM, each element CSM(i,j) of which corresponds to the aggregated sum of class probabilities P(i) and P(j) for the cases that classes i and j have co-appeared in the same document. Then this aggregated interclass similarity matrix is normalized based on the initial class visualization coordinates that have been extracted in a supervised manner during the training phase. In particular, each interclass similarity element is normalized based on an exponential distribution of the distance between the two corresponding class points that have been calculated during the training phase: where D(i,j) is the Euclidean distance (in the 2-D space) between the class representations, as they have been extracted during the training of the visualization model. The meaning of this process is to emphasize interclass similarities that correspond to "apriori" (i.e. based on the training data) dissimilar content classes. An example of this normalization process is shown in Figure 1 where two graphs are presented: (a) the original interclass similarity graph (based on the aggregation process described above) and (b) the normalized interclass similarity graph based on the apriori class positions in the 2-D visualization space. The initial similarity graph has groups of nodes (content classes) from the same arXiv superclass compared to the normalized graph where the distant classes are promoted. These graphs are presented here to indicate the change that results from the normalization process. However, we have selected to use chord diagrams as the final representation method, as they provide a sufficient way to illustrate directed or undirected relationships among a group of entities (in our case content classes). Figure 1: Graph that visualizes the normalization process performed on the interclass similarities. 2.3 Fund miningThe content classification, visualization and inter-class similarity extraction is performed on set of documents. As stated before, in this paper we focus on funding-specific scientific publications. This is the purpose of the fund mining module: to detect and recognize funding schemes for every input text. The initial aim was to identify EU FP7-funded documents, however, recently this was extended to also include Wellcome Trust projects. The funding mining module first does some pre-processing (stop word removal, etc.) on the text of each publication and it then scans the text to extract possible matches using patterns, and it finds matches against the current known lists of project grant agreement numbers and/or acronyms for various funding bodies. Contextual information is also used in order to provide a confidence value for each match and to filter out any false positives. The funding mining submodule has been extensively evaluated against a large number of publication corpora from library resources (arXiv, PLoS, etc.). The resulting accuracy was over 99%. More information on algorithmic details and implementation issues can be found in [4]. 3. DatasetWe have adopted the arXiv dataset and classification scheme for training the supervised models. For demonstrating both the content visualization and funding mining modules, we have used (a) an independent subset of the arXiv data archive and (b) a subset of PubMed. The funding mining module has extracted in total more than 12 of fund-classified scientific documents, using three funding schemes, namely the PEOPLE, ICT and ERC funding categories. The detailed statistics of these datasets is shown in Table 1. Table 1: Publication distribution per data provided and funding scheme
4. ResultsIn this section, we present some of the visualizations and inter-class similarities statistics. For the shake of space economy, in the context of this paper we have only focused on a single funding scheme, namely the FP7-ICT funding category. We have chosen this particular funding scheme because of its plurality in subject areas and its huge number of funded projects (and therefore respective publications). Furthermore, the research areas of this scheme are well represented by the adopted supervised taxonomy (arXiv). For comparison purposes, we are also presenting the corresponding visualization results for two additional datasets of particular domain-specific research areas:
Figures 2, 3 and 4 present the chord diagrams for the ICT, EUSIPCO and Cognitive Computation corpora respectively. The content classes are organized in a circular layout and links between them correspond to (normalized) inter-class similarities. Different colors correspond to different arXiv superclasses (e.g. Computer Science). A complete set of (clearer and interactive) visualizations can be found here. The highest interdisciplinarity between the classes of the Cognitive Computation dataset is obvious (as expected) from the existence of many inter-class links. On the other hands, the EUSIPCO dataset leads to a more non-interdisciplinary representation. Figure 2: Chord diagram for the ICT dataset Figure 3: Chord diagram for the EUSIPCO dataset Figure 4: Chord diagram for the Cognitive Computation dataset Tables 2 and 3 present the most dominant pairs of similar arXiv classes in the corresponding dataset (ICT and EUSIPCO). These are the inter-class normalized similarities as described in Section 2. It can be seen that for the ICT funding scheme, there are obvious interdisciplinary links between several arXiv classes (e.g. robotics-neurons & cognition). For the case of the EUSIPCO dataset, most of the major inter-class similarities appear between classes of the same arXiv superclass, with an exception between computer vision and medical physics. A complete list of the arXiv subject classes and superclasses descriptions can be found in [1]. Table 2: Most important normalized inter-class similarities for the ICT funding scheme
Table 3: Most important normalized inter-class similarities for the EUSIPCO conference publications (years 2011 - 2013)
5. ConclusionsWe have presented a supervised method towards visualization and interclass similarity extraction for scientific corpora. The method has been trained on the arXiv dataset and classification system, while funding schemes have been extracted on more than 12K publications in order to generate the corpora that are used in the core mining procedures. Interclass similarity representations and corresponding measures have been presented for the ICT funding scheme, while for the shake of comparison, we have also illustrated content representations from a domain-specific conference. 6. AcknowledgementsThe research leading to these results has received funding from the EU's FP7 under grant agreement no. RI-283595 (OpenAIREplus). References[1] arXiv. Cornel University Library article archive. [2] K. W. Boyack, B. N. Wylie, and G. S. Davidson. Domain visualization using vxinsight® for science and technology management. Journal of the American Society for Information Science and Technology, 53(9):764774, 2002. http://doi.org/10.1002/asi.10066 [3] S. Fukuda, H. Nanba, and T. Takezawa. Extraction and visualization of technical trend information from research papers and patents. D-Lib Magazine, 18(7/8), 2012. http://doi.org/10.1045/july2012-fukuda [4] T. Giannakopoulos, E. Stamatogiannakis, I. Foufoulas, H. Dimitropoulos, N. Manola, and Y. Ioannidis. Content visualization of scientific corpora using an extensible relational database implementation. In Theory and Practice of Digital LibrariesTPDL 2013 Selected Workshops, pages 101112. Springer, 2014. [5] T. Kohonen. Self-Organizing Maps (3rd ed.). Springer-Verlag New York, Inc., 2001. [6] F. Osborne and E. Motta. Exploring research trends with rexplore. D-Lib Magazine, 19(9):4, 2013. http://doi.org/10.1045/september2013-osborne [7] R. M. Patton, C. G. Stahl, J. B. Hines, T. E. Potok, and J. C. Wells. Multi-year content analysis of user facility related publications. D-Lib Magazine, 19(9):5, 2013. http://doi.org/10.1045/september2013-patton [8] I. Rish. An empirical study of the Naive bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence, volume 3, pages 4146, 2001. About the Authors
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|