D-Lib MagazineNovember/December 2014 AMI-diagram: Mining Facts from Images
Peter Murray-Rust and Richard Smith-Unna AbstractAMI-Diagram is a tool for mining facts from diagrams, and converting the graphics primitives into XML. Targets include X-Y plots, barcharts, chemical structure diagrams and phylogenetic trees. Part of the ContentMine framework for automatically extracting science from the published literature, AMI can ingest born-digital diagrams either as latent vectors, pixel diagrams or scanned documents. For high-quality/resolution diagrams the process is automatic; command line parameters can be used for noisy or complex diagrams. This article provides on overview of the tool which is currently being deployed to alpha-testers, especially in chemistry and phylogenetics. Keywords: AMI-Diagram, Converting Graphics Primatives, XML 1. IntroductionThere are at least 10 million diagrams published in the scientific literature each year and many of them represent factual information. AMI-Diagram is a flexible tool which can mine facts from diagrams and convert the graphics primitives into XML. The targets include X-Y plots, barcharts, chemical structure diagrams and phylogenetic trees. AMI can ingest born-digital diagrams either as latent vectors (converted from Postscript), pixel diagrams (PNGs and JPEGs) or scanned documents. For high-quality/resolution diagrams the process is automatic; command line parameters can be used for noisy or complex diagrams. AMI is part of the ContentMine framework for automatically extracting science from the published literature. 2. BackgroundOver 1 million scientific articles are published yearly and a similar amount of theses and grey literature [1]. Many contain diagrams, such as graphs or domain-specific objects, representing factual information and often this is the primary way of communicating the information contained (e.g. molecular structure diagrams). Almost all diagrams are now born digital (i.e. the output is written directly from a program to file). The originating programs include generic plotting packages (GNUPlot, R, Excel), specialist editors such as JChempaint or Chemdraw for molecules, or are generated directly from instruments (e.g. spectra). The plots are usually high resolution, either scalable vectors + text (such as SVG or Postscript derivatives) or large pixel maps, often between 1 million and 20 million pixels. Since most scientific data is never published (estimates are often > 80% loss; [2]), extraction of data from images can be a vital source of semantic data. Traditional, labour intensive approaches include pencil and ruler, or cutting out peaks and weighing the paper and these are still, unfortunately, used today. Authors are reluctant to save data publicly; the Treebase database of phylogenetic trees only contains 4% of published trees. 3. OverviewConverting a semantic object to vector or pixel graphics loses most of the information. However in some domains it is possible to combine computer vision techniques with machine-learning or rules/heuristics to recover the likely generating object. Moreover, ambiguity can often be resolved by lookup against public semantic data (e.g. dbpedia.org) or recomputing the object. We have therefore developed image and vector processing technology which can reconstruct semantic data from a wide range of diagrams. Users may start with PDF documents, PNG or JPEG diagrams, or other sources of vectors (Word or Powerpoint EWF, PostScript, etc.). AMI is a work-in-progress being deployed to alpha-testers especially in chemistry and phylogenetics. The overall process is:
There is often an advantage in knowing the style of a journal or generating program. Collaboration is very useful here and the AMI framework is developed so that users can add in plugins (AMI uses the Visitor pattern). A Visitor can be tailored to a specific journal or domain of science. 4. Interpreting PDFsPDFs are made up of three streams: characters with code points or their glyphs; paths (lines and curves); and pixel images. We use PDFBox from Apache which provides these, but most STM publishers do not use Unicode fonts, and it is formally impossible to identify many character. We use a per-journal lookup which is constructed by expert classification. There is often some difficulty in identifying the pixel images and they may be layered with character codes or paths. We translate characters and paths to SVG which is an excellent intermediate format. We generally keep the images as PNGs as the SVG representation is verbose. 5. Interpreting pixel mapsWe have tried many methods including Hough line transforms, erosion (e.g BoofCV), and histogram equalisation. The following are the problems and approaches that we have found most appropriate for modern scientific articles. We warn that articles before ca. 2000 may have poor typography with less systematic presentation, and this makes it harder to create simple heuristics.
6. Reconstructing objectsMany segmented objects are suitable for domain-specific interpretation. For example:
Figure A: A photograph of a chemical molecule on a poster taken with a mobile phone camera, showing varying backgrounds, line broadening and skewing. In spite of this it is automatically and accurately converted to a semantic molecule with formula C18 H18 N3 O in under a second.
Figure B: This circular tree is separated from its characters and correctly converted to the semantic Newick representation (image from http://doi.org/10.1371/journal.pone.0036933, figure 1).
Figure C: This X-Y plot is correctly decomposed into X and Y coordinates of data points and size of the corresponding error bars (image from http://doi.org/10.1371/journal.pone.0095565, figure 2). It is possible to create a CSV file directly from this. 7. Current statusAMI is open source under the Apache 2 license and written in pure Java. It can be deployed as a command-line option including recursion over directories and ingestion of web streams. It ingests PDF, SVG, XML, HTML and image formats and usually takes <1 second per image (some documents include tens of such). It has a plugin architecture using the Visitor pattern so that domain experts can create their own image analysers without having to also write the more basic tools described here. We are actively seeking collaborators. AcknowledgementsPeter Murray-Rust thanks The Shuttleworth Foundation for a Fellowship and Grant and Ross Mounce thanks BBSRC for support for the PLUTo project References[1] Chalmers, I. and Glasziou, P. 2009. Avoidable waste in the production and reporting of research evidence. The Lancet. 374, 9683 (Jul. 2009), 8689. [2] Glasziou, P. 2014. The role of open access in reducing waste in medical research. PLoS Med. 11, 5 (May 2014), e1001651. About the Authors
|
|||||||||||
|