D-Lib MagazineMay/June 2016 Linking Publications and Data: Challenges, Trends, and Opportunities
Matthew S. Mayernik AbstractMany interrelationships exist between research articles, data, software, and other resources used to produce scientific findings. A number of challenges, however, impede efforts to implement, standardize, and institutionalize cross-links between scholarly resources. This report outlines findings from a workshop titled "Data & Publication Linking" held January 5, 2016 in Washington, D.C., funded by the U.S. National Science Foundation's (NSF) Open Access & Open Data initiative, and the NSF's EarthCube initiative. The workshop convened a discussion on the challenges and opportunities for cross-linking data and publication repositories. It brought together nineteen researchers and stakeholders from a range of sectors including data repositories, scholarly publishers, academic libraries, and scholarly research service providers. In this report, we present a diversity of perspectives and initiatives that can inform community-based solutions to scholarly resource cross-linking challenges. 1 IntroductionThe open availability and wide accessibility of scholarly articles, data sets, and other digital resources is becoming the norm for 21st century research. Growing numbers of repositories of scientific resources enable researchers to discover, understand, and build upon previous work at greater scales than was previously possible. Ensuring the open availability of articles, data, software, and other resources requires considerable effort and investment of time and money. Resources must be collected and documented properly, access services must be maintained, supported, and improved over time, and copyright and resource ownership issues must be navigated. The benefits of open research resources, however, outweigh these challenges. Opening research resources to wider audiences and applications leverages public investments in research and education to increase the pace of scientific discovery, enable the identification of research errors and fraud, communicate the results of science to wider audiences, and facilitate more inclusive educational opportunities (Boulton et al., 2012; Uhlir et al., 2015; Woelfle, Olliaro, & Todd, 2011). As an indicator of these benefits, numerous studies have shown that making research articles available in open access repositories leads to higher citation counts for those articles (see Hitchcock 2013 for an overview of these studies). Some studies also show that making the data underlying research articles available for secondary use leads to higher citation counts for the original articles (Piwowar & Vision, 2013). This paper discusses a significant gap in the scholarly research infrastructure, namely, the lack of structured and robust ways to connect related resources across organizations, repositories, and systems. Many interrelationships exist between research articles, data, software, and other resources used to produce scientific findings. Repositories for these resources, however, typically only support one particular kind of resource, or at most will support a couple of resource types, such as data and software. This has led to the siloing of information in a vast number of repositories. Producers and users of scientific resources would benefit if repositories with different specializations and user communities could interoperate at a technical and process level to provide a tightly connected web of scholarly resources (Baker & Yarmey, 2009; Smit, 2011; Van de Sompel & Nelson, 2015). This paper outlines multiple efforts to establish the connections between scholarly resources on the web. This report derives from the workshop titled "Data & Publication Linking" held January 5, 2016 in Washington, D.C. Funded by the U.S. National Science Foundation's (NSF) Open Access & Open Data initiative and the NSF's EarthCube initiative, the workshop convened a discussion on the challenges and opportunities for cross-linking data and publication repositories. It brought together a group of nineteen researchers and stakeholders from a range of sectors, including data repositories, scholarly publishers, academic libraries, and scholarly research service providers. The goal was to discuss the ways that scholarly resource creators and providers can work together to identify, characterize, exchange, and maintain links between related resources. The workshop dealt mostly with ideas for how resource managers and providers can improve their processes for dealing with such resource linkages. The workshop agenda, participants, and other information can be found here. In our report, we present this diversity of perspectives and initiatives, with the goal of informing community-based solutions to scholarly resource cross-linking challenges. 2 BackgroundThe core innovation of the web is to enable people to link documents together in a globally distributed fashion. This linking feature is core to the business models of the most successful web-based companies at present, e.g. Google, Yahoo, Facebook, and Twitter. Research organizations, publishers, and other scholarly institutions have likewise attempted to integrate linking into scholarly communication infrastructures in a variety of ways. These attempts have faced a number of challenges, in particular, the flexibility of web linking (anybody can link to anything else), the uni-directional nature of web-linking, and the dynamic nature of the web environment. Links included in references in scholarly articles, for example, are known to deteriorate rapidly (Lawrence et al., 2001; Klein et al., 2014). The difficulty of establishing, maintaining, and sustaining relationships among web-based resources has been a point of discussion since the digital libraries initiatives of the 1990s (Borgman et al., 1996). The development of the Digital Object Identifier (DOI) system, OpenURL, and other key initiatives of the past few decades were focused on solving one or more aspects of these linking challenges (Paskin, 2000, 2010; Van de Sompel & Beit-Arie, 2001). Many organizations are now working toward increasing the linking and citation of a broad range of scientific resources using persistent identifiers, typically, though not exclusively, using DOIs (Mayernik, 2013; Brase, Sens, & Lautenschlager, 2015; Klump, Huber, & Diepenbroek, 2015). Additionally, there are organizations, such as DataCite, whose purpose is to facilitate the identification and citation of data and to support member organizations assigning persistent identifiers through the development of standards. Principles of such citations and recommendations for achieving machine accessibility of the cited resources are becoming more robust, particularly for citations to data sets (CODATA-ICSTI Task Group, 2013; Starr et al., 2015). Citations, however, are a one-way relationship, e.g. a paper linking outward, and represent just one possible relationship that may exist between resources in the scholarly research ecosystem. Relationships exist in a "value chain" among the many inputs and outputs of scholarly research (Borgman, 2007; Van de Sompel et al., 2006). The value and meaning of scholarly resources often lies in their relationships with other resources (Pepe et al., 2010). With better mechanisms for integrating such links into scholarly communication and discovery services, researchers could discover and learn more about relevant data sets, software, papers, etc., through the relationships those resources have with each other. In addition, managers of repositories could use these relationships to a) improve their discovery and access services, and b) assess the use and impact of the resources they provide. The "Data & Publication Linking" workshop sought to identify approaches for creating, managing, and exchanging links between related scholarly resources. The workshop goals were to present current initiatives focused on linking data and publications, discuss solutions to key cross-linking challenges, and identify near-term activities that could be taken on by relevant stakeholders. The next sections outline key themes and challenges relevant to data and publication linking that emerged from the workshop discussions. 3 ThemesDigital resource cross-linking encompasses a range of multi-faceted issues. This section outlines a few salient themes related to current work on cross-linking topics. Each theme also presents a number of challenges that impede efforts to implement, standardize, and institutionalize data and publication cross-linking. 3.1 The Need for Defining Boundaries and PurposesApproaches to cross-linking digital objects must address this basic definitional question: "what are the objects being linked?" Defining boundaries around digital objects can be difficult, and is highly dependent on contextual factors related to the digital objects' origins, management, and usage. Published articles typically do not change over time, but may be available in multiple versions, such as preprints, final accepted manuscripts, and versions of record, and may be hosted by multiple providers. Likewise, data sets and software are often combined into composites, pulled apart into subsets, or are highly dynamic, changing on a regular basis. Different definitions of "data sets", for example, emphasize a variety of characteristics, e.g. that objects are part of a particular grouping, contain representations of a particular content type, are considered to be related by the creators, or can be used for a particular purpose (Renear, Sacchi, & Wickett, 2010). These indistinct and dynamic boundaries around digital resources complicate the process of creating cross-links. The second definitional issue to be addressed is to clarify the purpose of generating, exchanging, and maintaining links between related resources. In a simple example, recommendations around data citation often conflate 1) a researcher citing his own generated data resources, and 2) a researcher citing external sources of data. A researcher's motivations, practical effort involved, and immediate benefits may be very different for these two scenarios. Beyond data citation, the variety of purposes for linking related resources are even more broad, including discoverability, provenance tracking, usage metrics gathering, among others. Different purposes may require different technical and social means for achieving robust linking mechanisms. Key ongoing challenges related to these definitional issues include:
3.2 Relationship SemanticsStandard html links are uni-directional and provide no explanation of the relationship that exists between the linking and linked resources. Defining the meaning of the link relationships is therefore a key motivation for enabling cross-links between scholarly resources. By contextualizing the linking relationship, resource providers can present richer metadata to their users, enabling better resource discovery, access and use, and a richer understanding of the research ecosystem in which the resources exist. Typed relationship statements would also allow more nuanced study of scholarly research networks, showing the context around the connections between research inputs and outputs. Many approaches for designating relationship semantics have been developed. The DataCite Metadata Schema defines a set of metadata properties for documenting resources that have been issued DOIs via DataCite. The schema, which is openly available and can be implemented in other environments, includes a controlled vocabulary of "relationship types" that can be used to designate relationships between resources that have been assigned DOIs (DataCite, 2016). This typology lists 25 possible relationships, such as "IsCitedBy," "IsSupplementTo," "HasPart," and "IsDerivedFrom." Adding context and meaning to web-based links is the main motivation for much of the work related to the Semantic Web. Numerous ontologies have been developed to represent scholarly work and products. The Scholarly Publishing and Referencing Ontologies (SPAR), for example, models a large range of entities and relationships within the scholarly communication sphere (Peroni, 2014). One SPAR ontology, the Citation Typing Ontology (CiTO), focuses on citation relationships, with the goal of representing the reasons for citations with fine granularity (Peroni & Shotton, 2012). CiTO includes a large number of terms, including relationships such as "cito:extends," "cito:usesMethodIn," and "cito:supports." The W3C Provenance ontology (PROV, Groth & Moreau, 2013) provides a conceptual model for representing entities, agents, activities, and the provenance relationships that exist between them. The PROV ontology has gained a wide range of use in many applications. A number of challenges impede efforts to formalize the semantics of scholarly resource relationships. Existing relationship schemas are defined at varying levels of specificity. The CiTO ontology, for example, provides a high degree of specificity in denoting relationship types. The PROV ontology, on the other hand, provides a generic approach for structuring relationship statements. In defining and using such schemas, trade-offs exist between expressivity and flexibility. Too much specificity may impede wide adoption of a particular schema, while highly flexible schemas may impede interoperability between different schema implementations. The use of the same relationship schema in two different systems does not guarantee coherence between the resulting relationship declarations. Organizations will use their own semantics within a schema, or might apply a set of controlled relationship type values in different ways. Finally, if semantics for scholarly resource relationships are defined with too much specificity, they may not be relevant or useful to parties outside of scholarly communication systems, thereby siloing scholarly resources away from more general applications. Key questions related to these semantics issues include:
3.3 Linking TechnologiesDespite the interlinking capabilities of the web, cross-linking digital objects in a sufficiently rich manner to provide understanding of the linking relationships is a significant challenge. For example, scholarly references rarely provide enough information for readers to understand what within a cited paper is actually serving as the basis for a scientific assertion or finding without significant effort. Internet technology, coupled with appropriate standards, has great potential for automating the establishment of cross-linked digital resources. The "linked data" approach, pioneered and promoted by Tim Berners-Lee and colleagues, is an attempt to formalize technical procedures for linking web resources broadly (Bizer, Heath, & Berners-Lee, 2009). A number of existing projects are developing and offering production level services for tackling the challenges involved in linking scholarly resources, such as data sets and published papers. Some projects are pushing these efforts further into the derivation history or provenance of the underlying data and evidence.
Technical approaches to linking scholarly resources may achieve coarse-grained interoperability by engaging the basic architecture of the web through HATEOAS (Hypertext As the Engine Of Application State), a constraint on the REST application architecture (Van de Sompel & Nelson, 2015). REST/HATEOAS allows links to be added to HTTP headers. Since all linking on the web is done via HTTP, this capability is available to any resource provider. More fine-grained linking interoperability is largely being pursued via technologies that produce relationship graphs. OAI-ORE provides a linked data-based way of building maps of connections among digital resources. The RMap project, built on top of OAI-ORE, developed a framework for capturing and preserving the links between "distributed scholarly compound objects" (DiSCOs). The RMap approach was applied to capture graphs of relationships among academic articles, data sets, and and other scholarly artifacts. The benefit of graph-based systems is that they can directly capture the complex networks of relationships that may exist between scholarly resources. The challenge of using graph-based approaches to capture and manage such relationships is that visually browsing through graphs of relationships can be difficult. It is often hard to clearly display the connections between resources to human users in readily understandable ways. Computational systems that leverage graph-based networks must also account for potential scaling challenges as graphs grow in size. Key ongoing challenges related to these linking technologies issues include:
3.4 Engaging Stakeholders, Building PartnershipsThe stakeholders involved in digital resource cross-linking initiatives encompass a range of individuals and organizations, including academic publishers, resource repositories, libraries, and the creators of the publications, data, software, and other resources to be linked. Some roles and responsibilities are unique to particular stakeholders, while some cross-cut multiple stakeholders. A significant issue to be addressed is the relative role of resource curators and creators. Curators are better positioned to deal with metadata structures and linking technologies. Resource creators, however, have direct knowledge of the resources at hand, and are typically better positioned to denote the relationships that exist between the resources they have created and the resources they used to create them. At present, it is likely necessary to rely on resource creators to identify and declare such relationships, if they are to be captured with any consistency. Both parties need to work together, with curators sharing expertise on the linking technologies and relationship semantics schemas that are most appropriate in a given situation. Partnerships between stakeholders can be critical in getting cross-linking established as a routine practice. The Dryad Digital Repository, for example, hosts scientific and medical data sets associated with publications. Dryad partners with over 70 journals to enable researchers to deposit data related to journal articles. Dryad has an established workflow for the deposition and exchange of metadata between the journals and the data repository (Vision, 2010). Dryad provides an important example for how stakeholders can interact, exchange information, and structure metadata in order to meet immediate operational functions, as well as support long term curation of resources (Krause et al., 2015). Dryad illustrates how partnerships must cut across public and private enterprises. Another example that illustrates the importance of multi-sector partnerships is the Coalition for Data Publication in the Earth and Space Sciences (COPDESS). Launched in early 2015, COPDESS is facilitating communication and joint initiatives between academic publishers and data repositories (Hanson, Lehnert, & Cutcher-Gershenfeld, 2015). Forty organizations (as of February 2016) have signed on to the COPDESS "statement of commitment", which lays out a number of key tasks and responsibilities related to managing data and promoting open science. Key ongoing challenges related to these stakeholder and partnership issues include:
3.5 ScaleMany of the issues described in the previous themes face challenges of scale. Making decisions about how to identify digital resources, how to link them, and how to exchange links are tractable tasks in isolated settings. Simple one-to-one data set to article linking between two partners, e.g. file-based data sets linked to static documents, might be easily achieved using simple metadata feeds. Scaling such linking to encompass the wide range of stakeholders, tools, and resource types at play in current scholarly settings requires different approaches. Whatever the technical approach, the engagement of stakeholders and building of partnerships is critical to overcoming scale challenges, as noted in the prior section. Scaling issues also relate to the timeline for adoption, implementation, and sedimentation of particular linking approaches. The best standards or technologies (assuming "best" could be defined by some criteria) do not always win out, and multiple competing standards or technologies may coexist for extended periods of time (Kling & McKim, 2000). Recent data citation initiatives, for example, have made notable progress in formalizing approaches for linking from articles to data, but changing the day to day practices of researchers who are asked to create such citations will take much longer (Mayo, Hull, & Vision, 2016). Data citations will likely continue to increase in visibility because researchers value citations more than other metrics of scholarly activity (Kratz & Strasser, 2015). Data citations, however, provide limited information about the context of the link. They are useful for the discovery of links, but less so for understanding why the link was made (Mayernik, 2013). Maintenance and updating are also critical scaling challenges, adding a temporal dimension to many issues discussed above. Statements about relationships between resources may need to be updated over time, as new relationships are identified for existing resources, or to fix mistakes. In addition, all web-based relationship schemas will face the same link-rot issues that exist throughout the web. It is therefore necessary to ensure that relationship links declared at a certain point in time are monitored and maintained. Key ongoing challenges related to these scaling issues include:
4 RecommendationsThe combination of the current grassroots open science movement with the many top-down agency and publisher mandates and requirements are helping to bring some coherence to practices in different communities. But with a number of initiatives currently working on resource cross-linking, and the wide range of issues identified in this article still to be addressed, more coordination is necessary if any common approach (or set of approaches) is to be developed. Continued discussion is needed to help the communities involved to coalesce around conventions for what links to identify, exchange, and manage over time. A number of recommendations emerged out of the discussions that took place during the Data and Publication Linking Workshop. The first clear recommendation is to support the ongoing progress being made by a number of existing groups. Coalitions such as COPDESS are foundational for investigations into how relationships between publications and data should be articulated and exchanged. DataCite has likewise formed an extensive community, along with a suite of tools, metadata structures, and partnerships. The second recommendation is to promote cross-linking within new and existing scholarly communication initiatives whenever possible. Data repositories, publishers, and institutional repositories should make resources programmatically accessible to facilitate automated detection of citations to data and other scholarly resources by services, such as the DLI Service, that are interested in aggregating relationship information. Protocols such as NISO/OAI ResourceSync provide robust mechanisms to expose resources for programatic access. Resource providers might also support other methods to identify and expose link information, such as annotation and curation via the Open Annotation specification, OAI-ORE aggregations, or HTTP links via the REST/HATEOAS approach described above. Publication registries, such as CHORUS, which monitors the public-access compliance status of research publications based on funded research, and SHARE, which aggregates information about research and scholarly activities, should use data models that are flexible enough to accommodate declarations of relationships between scholarly resources provided by repositories (or content providers more generally). Aggregators will then have the opportunity to build services on top of the relationship declarations. The third recommendation is to emphasize the need to get links and relationships asserted and distributed, even if the optimal approaches to doing so have not yet solidified. Focusing too much effort on building agreement and consensus toward a relationship vocabulary, for example, may be counterproductive if it results in a lengthy process with unclear goals. Even if initial attempts at generating links and relationships are non-standardized and do not result in long-term solutions, tools will be built and tested, and lessons documented from such efforts can inform the next steps by the wider community (Unsworth, 1997). Negative results can be useful, if they narrow the scope of possible cross-linking systems, or raise the visibility of the possible merits of cross-linking applications. Building a store of trusted relationship assertions is likely more productive than expending considerable time and energy up front to establish controlled vocabularies and standards for documenting relationships. Standards are often interpreted and deployed in different ways, regardless of the comprehensiveness of the standards development process or the standards' documentation. Getting links asserted and available for aggregation will enable the community to see the relative merits of different cross-linking approaches. Data driven approaches to determine what relationships are being expressed might subsequently be used to assess the coherence of the relationship designations, and their alignment with community relationship typologies and standards. 5 ConclusionResearchers need freedom to determine the resources with which they would like to work, and what linkages, conceptual and methodological, they need to make between resources in doing their research. Research work environments evolve, and workflows from past projects may be difficult to resurrect once left behind. To accommodate this inherent need for research flexibility, linking mechanisms need to get built into the scholarly research process and become embedded within scholarly communication infrastructures and institutions (Assante et al., 2015). Significant progress is being made on a number of resource cross-linking challenges, and a range of tools are starting to be developed. When annotated with sufficient context, relationships between research resources can help to support resource discovery, access, re-use, and preservation within and across institutions, disciplines, and technologies. AcknowledgementsThe "Data & Publication Linking Workshop" was funded by US National Science Foundation award #1449668, "EAGER: Repository Cross-Linking for Open Archiving and Sharing of Scientific Data and Articles", PI Matthew Mayernik, Co-PI Don Middleton, University Corporation for Atmospheric Research (UCAR) / National Center for Atmospheric Research (NCAR). Workshop participants:
Workshop organizing team:
References
About the AuthorsMatthew S. Mayernik, Ph.D., is a Project Scientist and Research Data Services Specialist within the Library in the National Center for Atmospheric Research (NCAR) / University Corporation for Atmospheric Research (UCAR), located in Boulder, CO. Jennifer Phillips, Ph.D., is the Manager of Library Services in the National Center for Atmospheric Research (NCAR) / University Corporation for Atmospheric Research (UCAR), located in Boulder, CO. Eric Nienhouse is the Group Head of the Software Applications & Gateway Engineering division within the Computational & Information Systems Lab in the National Center for Atmospheric Research (NCAR), located in Boulder, CO. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|