The Benchmarking Forum at IPRES 2015

D-Lib Magazine

January/February 2016
Volume 22, Number 1/2
Table of Contents

The Benchmarking Forum at IPRES 2015

Christoph Becker
University of Toronto and Vienna University of Technology
christoph.becker@utoronto.ca

Krešimir Ðuretec
Vienna University of Technology
duretec@ifs.tuwien.ac.at

Artur Kulmukhametov
Vienna University of Technology
artur.kulmukhametov@tuwien.ac.at

Andreas Rauber
Vienna University of Technology
rauber@ifs.tuwien.ac.at

DOI: 10.1045/january2016-becker

Printer-friendly Version

Abstract

The Benchmarking Forum at the International Conference on Digital Preservation (iPres) 2015, held in Chapel Hill, North Carolina, November 2 - 6, 2015, brought together digital preservation researchers, community organizations and practitioners to discuss opportunities and challenges in adopting software benchmarking as a systematic tool for evaluating digital preservation tools. Outcomes included an initial set of benchmark specifications for targeted scenarios elaborated jointly during the workshop, and a set of concrete collaborative actions that will take place in 2016.

1 Collaborative benchmarking in digital preservation

Applied research and technology development efforts in Digital Preservation (DP) have invested substantial resources into the development and maintenance of tools to support key DP processes. These include file format identification, migration and emulation tools, quality assurance mechanisms, automated annotation, and digital forensics, to name but a few. The need to evaluate these tools systematically has become more pressing as they are increasingly being deployed in operational digital archives and repository systems.

Systematic evaluation enables the community of researchers, solution providers and content holders to establish the systematic sharing, aggregation and analysis of evidence [1]. In the context of software systems, this makes rigorous experimentation a particularly relevant mode of inquiry.

Fields such as Information Retrieval (IR) and Software Engineering (SE) have adopted benchmarking, a specific mode of systematic experimentation, as a core component of their research agenda. In these communities, a systematic, well-defined process exists through which members of the community create and share evidence about specific products in an organized, rigorous process, in order to assess and compare these products according to accepted measures of success. A benchmark is "a set of tests used to compare the performance of alternative tools or techniques" [3]. This means that for different kinds of tools, different tests need to be developed. A key obstacle to the feasibility of such benchmarks has been the lack of well-annotated data sets available to facilitate comparative testing [4][5].

While the exact definition and structure of a benchmark varies across these disciplines, the underlying purpose is similar. A software benchmark is a systematic, repeatable method of comparing software tools reliably for a particular purpose. In digital preservation, this generally requires five main components [2]: a motivating comparison that specifies the purpose of creating an evaluation and comparison test; the specific function to be performed, and a dataset on which it should be performed; ground truth, i.e. accurate and provably correct answers; and the performance measures to be collected. Note that performance here is not limited to speed, but denotes any success measures to be compared.

Increasing consensus is emerging on the types of tasks that need to be supported by DP tools. Our research aims to establish systematic evaluation and sharing of evidence about software tools as a technique in digital preservation. To be effective, this must be a community-driven process. To this end, we organized a workshop at IPRES 2015 to bring together stakeholders in digital preservation to discuss the needs and opportunities of benchmarking in the field, and to define and prioritize initial benchmarks. The workshop was organized as part of the project BenchmarkDP.

2 Objectives and participants

The following questions guided the workshop:

Which kinds of tools would benefit most from the kind of systematic evaluation supported by benchmarking?
What are relevant sample tasks for testing tools and effective data sets to support this testing?
Which opportunities arise for collaborative efforts to facilitate benchmarking and share the resulting evidence?
Which concrete actions can we take to establish benchmarking as a useful method in the field?

The participants of the workshop covered key roles and perspectives on this topic:

Practitioners use DP systems and solutions to preserve digital material. The need to choose tools carefully in organizational decision making is addressed in preservation planning, and the systematic analysis of such decisions has yielded research priorities for automation [6]. The direct involvement of practitioners in benchmark definitions ensures that priorities for benchmarking initiatives focus on those kinds of tools that are most in need of systematic comparison. This allows practitioners to influence the definition of benchmarks to reflect their decision needs, terminology, and practice.
Developers create and improve software tools. Involvement in benchmarking allows these tools to undergo rigorous testing and comparison and enables a well-supported demonstration of improvement over time. Robust data sets and comparison metrics make testing more effective and it's results more visible. This provides an opportunity to promote specific solutions and supports a deeper understanding of quality aspects and features that are most important in order to support the allocation of limited resources.
Researchers are driven by a variety of interests, focussing on interesting questions rather than answers. For some, benchmarking provides a rigorous mechanism to corroborate technology innovation. Others are interested in the social collaboration aspects of benchmarking, and in methodological questions of rigorous experimentation and evaluation.
For the community at large, a common ground and joint roadmap for systematic evaluation and comparison of software tools is important for the growth of an emerging discipline. Member organizations such as the Digital Preservation Coalition and the Open Preservation Foundation (OPF) are a natural meeting place and platform to advance this collaborative effort.

The workshop attracted a motivated group of 15 participants from these categories. It was designed to be highly interactive. After a set of short presentations and discussions, we jointly collected and prioritized specific benchmarking scenarios. The majority of the workshop focused on elaborating two scenarios and developing a robust understanding and definition of the motivating comparison, success criteria and requirements for the data set. In a final plenary discussion session, the participants identified opportunities for collaboration, outlined concrete steps to take, and agreed on a number of follow-up actions.

3 Benchmarking theory and practice

After a short welcome and round of introductions, the workshop started with presentations that set the stage. Krešimir Ðuretec summarized the full paper "Benchmarks for Digital Preservation Tools" presented at IPRES [2] and thus provided key insights on the concepts and techniques of software benchmarking as outlined above.

Andreas Rauber provided an overview of benchmarking initiatives, providing examples and lessons learned in settings such as music information retrieval (MIREX), multimedia (LifeCLEF) and text retrieval settings (TREC).

Common themes arose that were particularly relevant to the workshop:

Benchmarking poses specific challenges to data quality. In particular, the availability of data and the evaluation of data quality can be difficult. Music IR faced legal challenges that prevented the release of open data for copyright reasons. Robust ground truth annotation, a crucial ingredient, is often challenging to create. Providing the ground truth through human evaluation is expensive. For many tasks, human evaluation (e.g. via the Evalutron 6000) is essential [7].
Benchmarking benefits from central platforms. Managing the data sets centrally and reliably turned out to be a key factor and an essential requisite for reproducibility.
Benchmarks may proliferate once the community gets started. After the initial establishment of benchmarking as a method and platform, the number of different tasks that are proposed and evaluated can grow drastically. Tasks can emerge through a community process of proposal and voting.
Performance measures can be contentious. It is not easy to agree on clear and well-specified criteria for success.
Timelines and joint actions need to be established. Benchmarking requires effective coordination of contributions from a wide range of community members. This means that clarity on timelines, expectations and commitments is needed.

A discussion ensued in which comparisons were made between IR, SE, and DP. In IR, the participants of benchmarking campaigns are research teams from industry and academia, and there is considerable competition and continuous improvement of algorithms. In contrast, many tools in DP are still used off-the-shelf. The community is in a middle ground, then, between IR and SE. This is reflected in the specification of the benchmark components as well [2]. Legal barriers have also obstructed the wide availability of data sets in DP [4][5].

A discussion on the collaborative aspects of benchmarking provided the lead-in to the following session.

4 Roles and opportunities

In this session, informal presentations from participants described the perspective and interest of their organizations in benchmarking.

Carl Wilson, Technical Lead of OPF, emphasized that the organization's key focus is to support people in producing tools for DP. Hence, OPF can play a role in benchmarking in several ways. First, OPF is able to host data sets. An existing data set, the OPF format corpus, has grown through a bottom-up collection process based on community contributions. Second, OPF can host a testing platform, it possesses the expertise and infrastructure to do so, and it sees this as an excellent opportunity.

A member of the veraPDF consortium, Carl also spoke as a tool developer: veraPDF is developing a PDF compliance validation tool. As part of this process, commercial validators are tested and compared against each other. The veraPDF consortium, led by OPF, is creating a ground truth test corpus for PDF/A validation. The starting point in this case is the specification itself, and the test corpus is developed in collaboration with the ISO committee. The corpus is licensed under a Creative Commons license and approved by the PDF/A ISO committee.

Bengt Neiss, IT architect at the National Library of Sweden, provided an overview of the Preforma project and the consortium's interest in benchmarking. Preforma aims to give memory institutions full control of the process of the conformity tests of files to be ingested into archives, with a focus on three types of formats: text, video, and audio [8]. The project consortium is interested in obtaining accurate measurements about the tools it procured.

5 Benchmarks to consider

Three presentations proposed starting points for the breakout sessions. These benchmark specifications were in draft stage at the time of the forum, with varying degrees of detail, but clearly scoped and aimed to be developed further into full specifications.

Artur Kulmukhametov proposed the photo migration benchmark. It measures functional correctness to enable a ranking of migration tools using a dataset of raw photographs. The function to benchmark is the migration of photographs from proprietary raw formats to the Adobe Digital Negative (DNG) format. This is a practical problem for professionals and institutions when selecting the best tool for migration of raw photographs. The motivation for this benchmark is the need to compare the correctness of migrations done by software tools on the photograph dataset. To define when a migration is objectively correct, a set of performance measures is proposed [9]. There is no robust data set at this point and no robust ground truth, but criteria for compiling a data set exist.

Krešimir Ðuretec presented a document property extraction benchmark, designed to facilitate comparison of characterization tools for electronic document formats. The motivation for this benchmark is to enable comparison of the coverage and correctness of characterization tools for electronic documents, with a focus on the MS Word 2007 file format. The function thus is characterization, or feature extraction, and the performance measures combine the coverage of existing properties in a data set and the accuracy of characterization. The benchmark is highly specified in terms of success metrics, and narrowly focused in terms of the suggested data set. Conducting ground truth annotations is crucial and approaches to generating data to facilitate such ground truth are under development [5].

Bengt Neiss presented one of the three Preforma benchmarks. For each major scenario in the project, one benchmark is being created using the same structure as described above [2]. Given the project focus on format conformance, the structure of these three benchmarks is identical. In the workshop we focused on one example scenario, PDF/A. One motivation to define this benchmark is to facilitate comparison between different releases of the software tools developed within the Preforma project. The tools are expected to perform four functions [8]:
1. verify conformance of files to the specifications of a particular set of standards,
2. verify compliance of files with institutional acceptance criteria,
3. report deviations in human-readable and machine-readable form, and
4. fix certain basic errors in the metadata of the files that are evaluated.
The data set necessary to do so is under development and envisioned to draw from three sources:
1. a set of files that is declared to be the reference representation,
2. synthetic files with particular conformance problems, and
3. real 'live' files with unknown properties.

Open issues discussed in the breakout sessions included performance measures, the composition of the data set, and possible approaches to compiling it.

Next, the group collaboratively assembled possible scenarios for benchmarking. The following table presents a summary of identified comparison choices that were collected.

Content Type	Motivating Comparison
Any	Compare the reliability and stability of tools.
Any	Compare the accuracy and coverage of format identification tools.
Photographs	Compare the correctness of raw photograph migration tools.
Electronic Documents	Compare the correctness of document property extraction tools.
Software Packages	Compare the ability of emulators to emulate specific platforms. Suggested measures include stability, coverage, and reliability.
Interactive Legacy Objects	Compare rendering involving emulation.
Audio	Compare the correctness of audio migration quality assurance processes.
Text /Video /Images	Compare the correctness of format compliance checkers.

An interesting discussion revolves around the need to evaluate emulators as well as rendering tools. While sometimes seen as one specific evaluation scenario, there are two aspects to be distinguished. Evaluation of rendering is a critical and challenging need [10][11]; evaluating the technical capabilities of emulators is similarly challenging, but different in focus. Evaluation of rendering is independent of the technical stack: the same approach is needed to evaluate migration and emulation [12].

6 Scenarios

Two scenarios were chosen from the table for elaboration during the remainder of the day. Starting from the initial draft vision, two breakout groups iteratively elaborated the key components of two benchmarks, identified open questions and opportunities, and consolidated the discussions into a form that the benchmark champion could take forward after the workshop.

6.1 The Preforma conformance checking scenario, with a focus on PDF

Motivating comparison	Compare format conformance checkers according to their accuracy in identifying invalid files and their specific violations of the format specification.
Function	Format conformance validation includes two aspects: one, classifying valid and invalid files, and two, producing a list of structured error messages for each file, describing any arising violations of the format specifications. The properties are not specified at this time, but are work in progress in the Preforma project. This benchmark focuses on the classification into compliant and non-compliant files.
Data set (requirements)	The datasets used in this benchmark should consist of files of the following formats: PDF/A-1[14], PDF/A-2[15], PDF/A-3[16], PDF 1.7[17]. Tentative criteria include representativeness of real content, coverage of special cases and likely errors, completeness (which remains undefined), and annotation quality.
Ground truth	Two labels are needed per file, one showing if the file is conformant and the second containing a set of structured error messages, each with a reference to the relevant format specification clause, a description and a location within the file. How to generate this ground truth is in part an open challenge: one needs classification to produce the ground truth for classification.
Performance measures	For the classification task, the benchmark would deliver values for true positives, true negatives, false positives and false negatives. To avoid simplifying ranks, we might refrain from composite scores in the early stages.

A set of open issues and follow-up actions were defined that the Preforma and BenchmarkDP project teams will take forward.

6.2 The photo migration benchmark

The discussion built directly on the scenario presented earlier, but focused primarily on the data set. Criteria include representativeness, completeness (defined as coverage of originating camera models), coverage of special cases and likely sources of error (defined as the presence of specific distribution of features and metadata tags). Generating files artificially for this case seems infeasible. Instead, we focused on collecting and potentially crowd-sourcing strategies for compiling a robust data set. Several practitioners in the workshop know of potential data in their organizations and will reach out to identify opportunities for sharing.

Next steps include building a list of camera models to be covered with the first benchmark; developing a set of properties associated with boundary test cases and likely errors; and use content profiling tools for purposeful sampling [13]. Data quality may be low initially and improve over time, so it is of paramount importance to develop a structured method of measuring data quality.

7 Outcomes and next steps

Despite being a comparably small workshop, the event succeeded in taking key steps and achieved its objectives. Participants jointly identified concrete benchmarking scenarios and champions for each; agreed on specific follow-up actions for how to drive these forward; and created an initial roadmap for coordinating benchmarking efforts and identifying key roles and responsibilities that drive who will do what.

Several specific roles and contributions were identified:

The Open Preservation Foundation plans to host an open test platform that facilitates automated execution of comparison tests and provides continuity for hosting the data sets.
BenchmarkDP and Preforma are developing a Memorandum of Understanding for close collaboration between the projects to facilitate robust, objective comparison of tools.
The Preforma project team will continue to collaborate with BenchmarkDP on the specification of benchmarks, which will be published and distributed widely.
BenchmarkDP will coordinate efforts to specify and share, in standardized structured form, draft benchmarks under development and completed benchmarks.
Future events at conferences of the digital preservation, digital curation and digital libraries communities are envisioned.

Acknowledgements

Part of this work was supported by the Vienna Science and Technology Fund (WWTF) through the project BenchmarkDP (ICT12-046).

References

[1]	National Digital Stewardship Alliance. 2015 National Agenda for Digital Stewardship. 2015.
[2]	Ðuretec K, Kulmukhametov A, Rauber A, Becker C. Benchmarks for Digital Preservation Tools. In: Proceedings of the 12th International Conference on Digital Preservation (IPRES), Chapel Hill, NC, USA, 2015.
[3]	Sim SE, Easterbrook S, Holt RC. Using Benchmarking to Advance Research: A Challenge to Software Engineering. In: Proceedings of the 25th International Conference on Software Engineering, Washington, DC, USA: IEEE Computer Society; 2003. p. 74-83.
[4]	Neumayer R, Kulovits H, Thaller H, Nicchiarelli E, Day M, Hofmann H, Ross S. On the need for benchmark corpora in digital preservation. In Proceedings of the 2nd DELOS Conference on Digital Libraries, 2007.
[5]	Becker C, Ðuretec K. Free Benchmark Corpora for Preservation Experiments: Using Model-driven Engineering to Generate Data Sets. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, New York, NY, USA, ACM, 2013.
[6]	Becker C, Faria L, Ðuretec K. Scalable decision support for digital preservation. OCLC Systems & Services, vol 30(4):249-84, 2014.
[7]	Gruzd AA, Downie JS, Jones MC, Lee JH. Evalutron 6000: Collecting Music Relevance Judgments. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, New York, NY, USA, ACM, 2007. http://doi.org/10.1145/1255175.1255307
[8]	Lemmens B. PREFORMA Challenge Brief. PREFORMA Consortium, 2014.
[9]	Bauer S, Becker C. Automated Preservation: The Case of Digital Raw Photographs. In: Digital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation, Springer Berlin Heidelberg, p. 39-49, 2011. http://doi.org/10.1007/978-3-642-24826-9_9
[10]	Cochrane E. Rendering Matters — Report on the results of research into digital object rendering. Archives New Zealand, 2012.
[11]	Becker C. Quality Assurance in Document Conversion: A HIT?. In: BooksOnline Workshop at the 20th ACM Conference on Information and Knowledge Management (CIKM), Glasgow, 2011.
[12]	Guttenbrunner M, Rauber A. Evaluating Emulation and Migration: Birds of a Feather?. In: The Outreach of Digital Libraries: A Globalized Resource Network, Springer Berlin Heidelberg, p. 158-67, 2012. http://doi.org/10.1007/978-3-642-34752-8_22
[13]	Kulmukhametov A, Becker C. Content Profiling for Preservation: Improving Scale, Depth and Quality. In:The Emergence of Digital Libraries — Research and Practices, Springer International Publishing, p. 1-11, 2014. http://doi.org/10.1007/978-3-319-12823-8_1
[14]	ISO. Document management — Electronic document file format for long-term preservation — Part 1: Use of PDF 1.4 (PDF/A-1). ISO/TC 171/SC 2, ISO 19005-1:2005, 2005.
[15]	ISO. Document management — Electronic document file format for long-term preservation — Part 2: Use of ISO 32000-1 (PDF/A-2). ISO/TC 171/SC 2, ISO 19005-2:2011. 2011
[16]	ISO. Document management — Electronic document file format for long-term preservation — Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3). ISO/TC 171/SC 2, ISO 19005-3:2012, 2012.
[17]	ISO. Document management — Portable document format — Part 1: PDF 1.7. ISO/TC 171/SC 2, ISO 32000-1:2008, 2008.

About the Authors

Christoph Becker is an Assistant Professor at the University of Toronto, where he leads the Digital Curation Institute; and a Senior Scientist at the Software and Information Engineering Group at Vienna University of Technology in Austria. He was involved in the European research projects DELOS, PLANETS, DPE, and SHAMAN. He led the sub-project Scalable Planning and Watch of the SCAPE project and is Principal Investigator of BenchmarkDP. His research focuses on digital libraries; digital curation and digital preservation; and sustainability in software engineering, requirements engineering, and information systems design.

Krešimir Ðuretec is a Project Assistant at the Department of Software Technology and Interactive Systems (IFS) at the Vienna University of Technology. He is currently pursuing his PhD at the same department. Previously he graduated with an MSc and BSc in Computer Science from the University of Zagreb in 2011 and 2009 respectively. He was working as a Sub-project lead of the SCAPE Planning and Watch sub-project. In the project BenchmarkDP, his main focus is on benchmarking digital preservation tools and automatic test dataset generation using model driven engineering principles.

Artur Kulmukhametov is Project Assistant and PhD student at the Software and Information Engineering Group, Vienna University of Technology. He was involved in the European research project SCAPE. His current focus in the project BenchmarkDP is the systematic evaluation of software tools for digital preservation, content profiling and quality assurance of migration processes.

Andreas Rauber is Associate Professor at the Department of Software Technology and Interactive Systems (ifs) at the Vienna University of Technology (TU-Wien). He furthermore is president of AARIT, the Austrian Association for Research in IT. His research interests cover the broad scope of digital libraries and information spaces, including specifically text and music information retrieval and organization, information visualization, as well as data analysis, neural computation and digital preservation.