P R I N T E R - F R I E N D L Y  F O R M A T Return to Article

D-Lib Magazine

March/April 2016
Volume 22, Number 3/4


Humanities Data in the Library: Integrity, Form, Access

Thomas Padilla
Michigan State University

DOI: 10.1045/march2016-padilla



Digitally inflected Humanities scholarship and pedagogy is on the rise. Librarians are engaging this activity in part through a range of digital scholarship initiatives. While these engagements bear value, efforts to reshape library collections in light of demand remain nascent. This paper advances principles derived from practice to inform development of collections that can better support data driven research and pedagogy, examines existing practice in this area for strengths and weaknesses, and extends to consider possible futures.


1 Introduction

Commitments to Digital Humanities, Digital History, Digital Art History, and Digital Liberal Arts are on the rise.1, 2, 3, 4, 5 These commitments can be witnessed in federal agency and foundation activity, university and college level curriculum development, evolving positions on tenure and promotion, dedicated journals, and the hiring of faculty and staff geared toward enhancing utilization of and critical reflection on computational methods and tools within and across a wide array of disciplinary spaces.6, 7 Librarians have sought to engage these commitments through development of digital scholarship centers, recombination of services, creation of new positions, and implementation of user studies.8 While these engagements bear value, efforts to reshape library collections in light of demand remain nascent, diffuse, and unevenly distributed. Where traditional library objects like books, images, and audio clips begin to be explored as data, new considerations of integrity, form, and access come to the fore. Integrity refers to the documentation practices that ensure data are amenable to critical evaluation. Form refers to the formats and data structures that contain data users need to engage in a common set of activities. Access refers to technologies used to make data available for use. In order to inform community steps toward developing Humanities data collections, the following work advances principles derived from practice that are designed to foster the creation of data that better supports digitally inflected Humanities scholarship and pedagogy. Following advance of these principles, a wider field of Humanities data collection models are considered for strengths and weaknesses along the axes of integrity, form, and access. The work closes with a consideration of Humanities data futures that spans questions of discoverability, terms and conditions limiting access, and the possibility of a Humanities data reuse paradigm.


2 Humanities Data Integrity

In preparing Humanities data collections, it is instructive to consider documentation practices applied by researchers to the data that are generated from them. Consider Ted Underwood's work on page-by-page genre predictions for 854,476 English-language volumes held by the Hathitrust Digital Library.9, 10 Underwood goes into great detail describing the source of his data, who funded the work, what algorithmic methods were used to generate the data, data structures, file naming conventions, decisions to subset the data, and links to software used to generate the data. Taken as a whole these documentation practices make the data usable by a wide audience with intentions that simultaneously span applied computational work as well as critique of that work through a theoretical lens. In order for both angles of inquiry to occur documentation practices must cohere to make data critically addressable. By critically addressable we refer to the ability of data documentation to afford individuals the ability to evaluate both the technical and social forces that shape the data. A researcher should be able to understand why certain data were included and excluded, why certain transformations were made, who made those transformations, and at the same time a researcher should have access to the code and tools that were used to effect those transformations. Where gaps in the data are native to the vagaries of data production and capture, as is the case with web archives, these nuances must be effectively communicated.11, 12 Part to whole relationships between a digital collection and a larger un-digitized collection must be indicated.13 Periodic and/or routine updates to the data must be signaled.14 This notion of critical addressability is vital across the full spectrum of the research process, from the researcher seeking to select, evaluate, and extend arguments based on a Humanities data collection, to the peer reviewer seeking to understand composition of and processes that have been effected upon a collection and how those factors might effect arguments predicated upon the collection. In order to safeguard the integrity, and thus the critical addressability of Humanities data collections, librarians can be guided by three conceptually complementary positions advanced by Miriam Posner, Victoria Stodden, and Roopika Risam.

In a series of blog posts and talks Miriam Posner has advocated for an approach to making Digital Humanities projects more intelligible by asking a seemingly simple but often complex question: "How did they make that"?15 The "How did they Make That" approach entails breaking a Digital Humanities project apart along three levels of analysis: sources, processed, and presentation. Sources relate to the data driving a project, processed refers to processes enacted by the researcher upon the data, and presentation refers to the methods and tools used to present the processed data. While operating in domains outside of the Humanities, Victoria Stodden's work on reproducibility in scientific computing is complementary to Posner's framework.16, 17 Stodden's argument for clear documentation and unimpeded access to code and data driving a research claim makes the "How did they make that?" approach a tenable proposition by helping to ensure that data and code are present for evaluation. Roopika Risam's work on closing an oft-asserted gap between computation and cultural critique reminds us that documentation in the Humanities is about more than reproducibility.18 Documentation of data and process connote researcher ability to understand the provenance of labor, logics of inclusion and exclusion in data, and the rationale that favors one methodological choice and/or mode of transformation over another.

The considerations that Posner, Stodden, and Risam introduce are not explicitly developed to influence the manner in which libraries prepare and document Humanities data collections. However, their perspectives can certainly help us to develop Humanities data collections that are readily usable, interpretable, and critically addressable. With Posner, Stodden, and Risam, we approach articulation of an approximate rubric for evaluating the readiness of Humanities data collections to support digitally inflected scholarship:

  • Posner: to what extent is information about Humanities data collection provenance, processing, and method of presentation available to the user?
  • Stodden: to what extent are data and the code that generates data available to the user?
  • Risam: to what extent are the motivations driving all of the above available to the user?

Committing to these principles in Humanities data collection development practice marks a distinct direction for libraries. On the service side libraries typically seek to occlude subjective choices driving collection preparation and organization in the interest of presenting an objective and neutral ordering of objects. Yet libraries are never neutral.19, 20 Our systems and practices of organization must be made more transparent.21 Thinking critically on the contours of our collections, we must consider where it makes sense to add the seams back. While universality has long been a core tenet of libraries, presentation of this affect simultaneously erases individual labor committed by actual people with actual opinions and renders collections less readily addressable to critical inquiry.


3 Humanities Data Form

With the concept of Humanities data integrity as a suite of data documentation practices established we can move on to consider a process for determining what form Humanities data objects themselves should take to better support research and pedagogy at a functional level. Generally speaking, digitized objects in libraries are relatively consistent in form, having the benefit of multiple decades of digitization standardization. Born digital objects are typically more heterogeneous. Collectively, these Humanities data are instantiated in file formats. Data organization within these formats (e.g. structured vs. unstructured) is contingent in part on format and in part on design of the data creator. When preparing a Humanities data collection the goal of the librarian is to decide, at a functional level, what data form will be most readily usable for target user communities. Depending on the institution this community could be bound by discipline, by local need (e.g. campus), by purpose (e.g. research vs. pedagogy), or by a broadly defined set of users. Generally speaking, some degree of collection transformation will be required in order to better support users that want to interact with collections computationally.

Librarians can approximate data form requirements by reverse engineering curriculum, web-based projects, research presented at conferences, and scholarly articles produced by and/or relevant to target user communities. For any given disciplinary activity expressed in the prior zones of engagement there are a relatively common set of tools and methods used. Across these tools and methods there are common data format requirements. Affordances of data residing within these formats varies. Review of tool and method requirements leads to the ability to identify core formats in addition to formats with generative potential. Formats are "core" when they are fit for use in an unaltered state. Formats are "generative" when they have the quality of ready transformability toward a usable state. By extending these considerations across a wider range of pedagogical and research based outputs salient to target user communities, a librarian can begin to develop a strategy for collection transformation that produces more readily functional data.

For an example of reverse engineering toward identification of core and generative formats, consider Johanna Drucker and David Kim's DH 101 course site.22 The text analysis module provides a tutorial on Voyant and the data visualization module provides a tutorial on Cytoscape.23, 24 Delving into the documentation for Voyant reveals that it accepts data in the following formats: TXT, HTML, PDF, RTF, and DOC. Cytoscape accepts data in the following formats: SIF, NNF, GML, XGMML, SBML, BioPAX, PSI-MI, delimited text, and XLS. With respect to Voyant all formats aside from TXT are concessions to making it easier for users to get data into the text analysis environment. We know this because the structured data accorded to file formats other than TXT are for the most part not leveraged post ingest. Therefore, we can settle upon TXT as the core format for Voyant. Cytoscape stands distinct insofar as each format conveys varied functionality post ingest. In this case determination of a suitable format is predicated more on generative potential rather than immediate fit for use. We hone in on the format with the most generative potential, again, through a process of reverse engineering. Networks of characters in novels are commonly represented in the Digital Humanities. These data typically take the form of a graph. Examination of tools used to create graph data quickly surfaces the NetworkX Python software package. NetworkX can be readily used to convert structured text data into graph data. Assuming availability of a novel stored in a plain text file, a user could readily prepare that data for use with NetworkX by structuring the plain text data using a tool like Stanford Named Entity Recognizer to provide machine readable tags that could be used by NetworkX to build the graph data required for exploration via Cytoscape. From this relatively small sample, a librarian could begin to infer that plain text data are both core and generative for users interested in text analysis and data visualization.

As librarians shift their consideration across user communities the notion of what is core and generative may shift. This shift is a consequence of varying levels of skill and desired goals. For example, while data held in a series of XML files underlying a TEI project are generally less readily usable to an Introduction to Digital Humanities class, they could be considered a core format for more advanced users. With respect to variation in desired goals of a user community, it could be tempting to assume that the majority of users want to work with plain text derivatives of collection objects. This assumption could lead to extraction of and sole provision of plain text data derived from high quality PDF and TIFF page images. Yet focus on provision of plain text data at the expense of providing page images fails to consider the possibility of their potential core and/or generative value relative to other types of computational questions. For example, researchers increasingly make use of page images to visualize margin space, line indentation, ornamentation, and text density, as well as exploring automatic detection of poetic content in historical newspapers, and automatic identification of images.25, 26, 27


4 Humanities Data Access

Addressing Humanities data integrity and form must be coupled with a reconsideration of the technical solutions designed to provide access to those data. Libraries have historically privileged development of technical solutions that are geared toward emulating aspects of analog object interactions, a decidedly non data oriented approach. Page-turners, image zooming, and design biases toward single item use and single item download capabilities inhibit the ability of researchers to work computationally with collections at scale. These interfaces are essentially unusable for researchers that want to text mine, visualize, and/or creatively recombine more than a handful of objects into a work of their own making. In essence, what is required is a way of design thinking that considers what it takes to enable collection wide interactions. As a case in point consider the example of researchers Micki Kaufman and Doug Reside. Kaufman engaged in a computationally driven historical study of the meeting memoranda (memcons) and teleconference transcripts (telcons) held in the Digital National Security Archives' (DNSA) Kissinger Collection. Kaufman's set of documents totaled approximately 17,500 memcons and telcons detailing Kissinger's correspondence during the period 1969-1977.28 Like most historians Kaufman was interested in exploring aspects of Kissinger's foreign policy and personal motivations. The only difference between this research project and another was the scale at which the questions were asked and the methods and tools employed to ask them. On multiple levels, the DNSA was wholly unprepared to support this research. Consider for a moment how profoundly unprepared most traditional digital collection interfaces are to support research of this kind.

In a similar vein but on a slightly different trajectory, Doug Reside was seeking to understand how digital composition practices may have influenced Stephen Sondheim. Librarians in the Music Division at the Library of Congress pointed Reside to the Jonathan Larson papers, a recent born digital acquisition that came to the library on a series of floppy disks. Reside was able to work with library staff and the Larson estate to gain access to the storage media that held the data, to engage in some digital forensic work to access the data, and ultimately transfer data to his own workspace. Reside's reflections on this process present vital signposts for thinking about Humanities data access:

I think sometimes we who work in libraries and archives practice our role as guardians of material more fiercely than we practice our role as a collaborator in research ... It's ... important to note that once I had migrated the data to the servers, it was up to me, as a researcher, to make sense of it. I think often we worry too much about doing research for our readers. Over the last decade or so we've come to understand that "more product, less process" is a better approach for paper collections, but I still hear a lot of fretting about how we will process and serve born digital collections if we, as library staff, don't know how to access or emulate the files ourselves. My feeling is that our role is simply to give the researchers what they need and get out of the way.29

Challenges to supporting researchers like Kaufman and Reside are both technical and social. On the one hand most access systems have not been designed to support them. On the other hand professional standards of care around collection description and access hold potential to be counterproductive. As Reside mentions, archivists have sought to balance care and access, in part, by committing to the more product less process model (MPLP).30 As we move forward with Humanities data provision in libraries it will be important to pay equal attention to developing technical solutions as well as professional dispositions that support this type of collection work.


4.1 Existing Models

Library Humanities data collection models vary widely. In order to support more concerted effort in this space, it is necessary to examine models that were explicitly designed to support computational engagement with collections as well as collections and access mechanisms that were not clearly designed for this purpose but nonetheless hold the potential to inform model development. A broad view of cultural heritage institution activity in this space allows us to form a picture of practice that is complementary to Humanities data provision goals. This effort reveals the following high level Humanities data collection model characteristics.

Data are typically made available via three primary locations:

  • Content Steward Website: e.g. simple webpages that have static hierarchical structures
  • Content Steward Repository: e.g. repository or digital collections software
  • Community Repository: e.g. non content steward owned repository or digital collections software

Data are typically made accessible from these locations as:

  • Compressed collections: e.g. data are accessible via download as ZIP files
  • Static collections: e.g. data have fixed structure and are accessible via tools like wget
  • Databased collections: e.g. data are accessible via application programming interface (API)

Data are typically comprised of some combination of the following content:

  • Descriptive metadata: item and collection level description
  • Objects: text, images, sound, moving images, etc.
  • Code: programming instructions that produced the data
  • Documentation: readme files, DTDs, etc.

The form of data varies but they share some consistency across content type. Integrity of collections and corresponding documentation are not consistent across providers.


4.2 Content Steward Website

Michigan State University Libraries (MSUL) and the University of North Carolina Chapel Hill Libraries (UNC) approach Humanities data provision in a similar manner. Both institutions make data available from their website, independent of repository software acting in an intermediary role. Both institutions share commonality in providing access to compressed collections comprised by similar combinations of data. With its DocSouth Data collections, derived from the Documenting the American South Collections, UNC provides access to a series of ZIP files. Each ZIP file corresponds to an individual collection. DocSouth Data collections contain a set of TEI encoded XML files, an identical set of plain text files with markup stripped out, a table of contents file, a readme file, and an XSL file used to create derivatives within the collection. MSUL provides access to a wider mix of Humanities Data that are derived from Special Collections materials, data purchased from vendors, data licensed from vendors, and data negotiated from corporate entities. With respect to Humanities data derived from Special Collections, MSUL provides access to a series of files via a library webpage. Typically content based objects are placed in ZIP files, while collected metadata records, readme files, DTDs, and title lists are provided separately. Each collection has a dedicated webpage that functions as a readme, seen prior to downloading data. Dedicated readme-like webpages describe the digital collection, provide a preferred citation, digital collection creation background, a data summary that encompasses data format, file naming conventions, data size, and additional sections that document data quality and acknowledge individual staff effort that went into creation of the Humanities data collection as well as the source data collection.

The University of Pennsylvania Libraries' (UPenn) OPENN stands distinct from UNC and MSUL with respect to expanding the number of technical methods for accessing a Humanities data collection. UPenn provides data via 4 methods: clicking on links to files via the OPENN website, anonymous FTP, Rsync, and wget. The FTP and Rsync methods are presented as tools for downloading data in bulk, with a slight edge given to Rsync. The static structure of the collection makes it easy for a researcher to utilize a tool like wget to selectively download items from the collection at scale. Documentation for OPENN data spans licensing, metadata, intended user communities, collection background, image standards and specifications, imaging and processing equipment, and sponsorship that drove collection development.

The primary strength of the content steward website approach is that it is geared toward getting users ready access to collection data at scale through single click of entire collections or utilization of simple tools like wget, FTP, and Rsync. While an API could enable more granularly expansive access to the collections, sole commitment to an API as an access method runs the risk of inhibiting the ability of users to get access to the data, as they are either unfamiliar with using APIs generally or are simply fatigued from having to learn how to use another API slightly or substantively different than APIs that they are used to. Weaknesses of the content steward website approach include lack of ability to leverage metadata accorded to Humanities data collections, minimal integration with larger collections, and lack of provision of application programming interface (API) for users who want to create a subset of a data collection predicated on multiple parameters.


4.3 Content Steward Repository

Some institutions make use of their repository or digital collections software to provide access to Humanities data. The University of British Columbia's (UBC) Open Collections and the University of Pennsylvania's (UPenn) Magazine of Early American Datasets are prime examples of this approach. Open Collections makes data from DSpace and ContentDM installations accessible via API. UBC provides clear API documentation as well as an in browser API query builder to get users started. The API is intended to help users, "run powerful queries, perform advanced analysis, and build custom views, apps, and widgets with full access to the Open Collections' metadata and transcripts." UPenn's Magazine of Early American Datasets utilizes an installation of the bepress product, Digital Commons to make data available. Data are user submitted rather than wholly UPenn sourced collections. Each collection is discoverable via search or browse from the Digital Commons interface. Default download options typically point to a single file. In order to get access to codebooks and associated data a user clicks on files in an 'additional files' section. Metadata describing the item in question is not made available as an additional file.

Utilization of the content steward repository approach has the advantage of leveraging metadata accorded to Humanities data collections in a meaningful way, integration with other types of content and collections, integration with preservation solutions, and provision of application programming interfaces to enable granularly expansive access to data. Weaknesses of this approach depend on the combination of access mechanisms. Provision of an API at the exclusion of easier to use methods for beginners runs the risk of alienating users. The API is also potentially a barrier for more advanced users who have few parameters to their data needs where simple collection wide access is sought.31 Where the size of a given collection inhibits the ability to enable single click collection download, cultural heritage organizations may do well to explore the viability of making their data available via Academic Torrent.


4.4 Community Repository

Some institutions contribute their data to a repository that they do not own. Examples of this approach include Indiana University Bloomington and the Tate Modern Gallery utilization of Github as well as a multitude of institutions making use of Hathitrust and the Digital Public Library of America. Indiana University Bloomington uses Github to make TEI collections available for, "... easier harvesting and re-purposing ... [so that] content can ... be analyzed, parsed, and remixed outside of the context of its native interface for broader impact and exposure."32 Collection composition spans metadata as well as objects. The Tate Modern Gallery makes collection metadata available for about 70,000 artworks. It is not immediately clear what the purpose of this effort is, yet from the readme associated with the collection it appears that the Tate collection looks positively on creative remixing, visualization, and analysis of their collection metadata.33, 34

The Digital Public Library of America (DPLA) and HathiTrust Research Center (HTRC) stand distinct from the prior example in the sense that they are explicitly focused on gathering digital collections materials from cultural heritage organizations and operate under a nonprofit model. DPLA currently focuses on aggregation of collections metadata. Thus, use of the DPLA API provides access to metadata at scale, but not direct access to the digital objects that they refer to. Provision of a metadata collection focused API is intended to, "encourage the independent development of applications, tools, and resources that make use of data contained in the DPLA platform in new and innovative ways, from anywhere, at any time." To date an interesting array of applications have been developed using the DPLA API that help users navigate aggregated collections by color, visualize term frequency over time, and visualize content license type distribution across content in DPLA.35, 36, 37 DPLA provides thorough documentation for their API, a statement on API design philosophy, as well as a number of code samples that make use of the API. For the non API inclined, DPLA provides bulk access to metadata collections as gzipped JSON files via their tools for developers.

The Hathitrust Research Center (HTRC) offers an API to access data, the HTRC Portal interface to execute computational jobs against data using HTRC compute resources, a number of datasets that are constituted by extracted features of works held by Hathitrust, and at some point in 2016 HTRC aims to provide the ability to analyze in-copyright works via the HTRC Data Capsule. The HTRC Data API provides access to zip files that contain plain text volumes, pages of volumes, and associated XML metadata records in the METS format. The HTRC Portal interface lets users build "worksets" that resolve to objects like literary texts, which can in turn be submitted to HTRC for processing under a predetermined set of algorithmic approaches. After these processes run, users gain access to the data generated from analysis. The primary extracted features dataset spans 4.8 million volumes and consists of volume features as well as page level features like number of tokens on a page, line count, and languages identified on a page.38 These data are accessible via Rsync as compressed TAR files that contain data stored in the JSON format. In addition to this large scale extracted features dataset, users have the option to generate an extracted features dataset from a custom workset. Finally, the HTRC Data Capsule will provide a secure environment for analyzing in-copyright works under a non consumptive paradigm.39 In this framing users are not afforded the possibility of accessing full text resources on their own device. Rather, HTRC mediates computational requests and returns output generated by those requests to users.

The strengths of the community repository approach are many. Community repositories expose disparate collections to a wider audience than source repositories. In doing so they encourage wider use. Organizations that lead community repositories often have greater ability to advocate for community positions on copyright and licensing that better support research and creative works. HathiTrust has been active in this area through pursuit of the non consumptive paradigm and Authors Guild v. HathiTrust. DPLA has been active in this area through international harmonization of rights statements.40 The weaknesses of this model are better understood as opportunities. Given broad reach and broad care over data collections, these repositories bear greater impact than any single institution affiliated repository. The decisions that managers of these repositories make regarding development of collections and technical features to meet target user community need are consequential for a broad spectrum of potential users. While the DPLA is wonderful on the computational side for developers it is an open question how well it is suited for computationally inflected research and pedagogy. On the flip side, HTRC is wonderful on the computational side for research, yet it is an open question how well suited it is for developers. It would be unfair to expect any one community repository to serve all potential users, but given their size and influence, a greater responsibility is borne given the range of possibilities their work engenders.


5 Humanities Data Futures

As the library community dedicates more effort to developing collections and platforms that support digitally inflected research and pedagogy, it will become increasingly important to consider challenges and opportunities inherent in the development of more robust solutions. It must be acknowledged that present Humanities data collection development is diffuse, sometimes focusing on meeting the needs of a broadly (inter)disciplinary and (inter)professional community like the Digital Humanities, and other times having a narrower scope. In order to develop more resonant collections and platforms, needs assessment and other forms of user research must be executed in a more systematic and sustained manner. A small number of targeted studies show initial promise in this area, and the increased activity of the Digital Library Federation Assessment Interest Group is encouraging.41, 42

In the few places where Humanities data collections exist they are siloed and difficult to discover outside of their institutional bounds. It is likely the case that there will never be a one ring to rule them all for Humanities data discovery. Effort in this area could very well trend toward a series of specialized spaces not dissimilar to how repositories have arisen in other areas of inquiry. The repositories maintained by the Inter-university Consortium for Political and Social Research (ICPSR) and the World Historical Dataverse provide cogent examples. In lieu of consolidated effort in this space, librarians should of course still consider local solutions while keeping an eye on future interoperability. One possible approach to doing both could reside in developing technical solutions that integrate with open source efforts like Fedora Commons, Hydra in a Box, and ArchivesSpace.

In the present study, focus has been placed on Humanities data collections derived from library owned collections rather than collections licensed from vendors. Librarians should have greater control over collections the library owns, making for an easier process of transformation that should in turn provide a local precedent that helps frame a productive discussion with vendors, especially with respect to data form and access. The integrity conversation will likely be more challenging. Terms and conditions associated with many vendor products come into direct collision with principles of data reuse and research transparency. For example, terms and conditions show their age particularly around the notion of sharing "snippets" of text where a possible research question is predicated on many thousands of documents and potentially millions of words. Under such a restriction how might research in this vein, predicated on resources with these terms, be properly evaluated by a peer community that may or may not have access to the data through their institution? Where vendors don't actually own content they are in a difficult yet potentially promising position to broker negotiations between content owners and library licensees to make works from a diverse set of sources more usable at scale. On the nonprofit side of things, Portico has expressed interest in thinking about how they might help the library community develop services and tools to support scholarship that relies on text and data mining.

Present Humanities data provision is predicated on a push paradigm. In a push paradigm, Humanities data collection development is focused on enabling data access. The push paradigm does not consider how to incentivize pulling Humanities data back into the collection once it has been used for a given application. A push-pull paradigm for Humanities data collection development aligns technical infrastructure and data preparation practices with a growing emphasis on the value of data reuse, research reproducibility, and transparency.43 Operating under a push-pull paradigm, a library could make a data collection available, a faculty member teaching a Digital Humanities course could subset it and transform it into a graph data format in order to teach a network analysis class, following the class the faculty member could place their data back into the repository in such a way that the provenance between the source data collection and the network data is established. Ideally the link between the source data collection and the derivative data produced by the faculty member would be represented as a point of metadata that would increase both the discoverability of the source data as well as the derivative data for audiences seeking collections that can readily support digitally inflected research and pedagogy. The library benefits in this scenario by gaining a firmer handle on how often collections are being used and for what purposes they are being used for. A broader community of users benefits from ready access to source data collections and data derived from them. Maintaining public connection between source data and derived data has a corollary benefit of giving users a sense of collection possibility. Creators of data uploaded back to the library benefit from having a place to make their data accessible in such a way that reusers of that data, and peer reviewers have a concrete sense of where the data in question originated.

This article has advanced a series of principles to inform Humanities data collection development in light of demand posed by digitally inflected scholarship and pedagogy. By paying keen attention to integrity, form, and access librarians and others operating in the cultural heritage sector situate themselves well to adapt these principles to reshape their collections to become more amenable to computational methods and tools. Fluency gained with data throughout the process of Humanities data collection development positions the librarian as research partner as well as content provider. This article extended to review current practice, both explicitly focused on Humanities data provision as well as those that inspire this effort, and concluded with suggestions for future directions in this space. While efforts in this area are nascent, they offer exciting opportunities for thinking anew about how librarians and the collections they steward can catalyze research, pedagogy, and our collective creative potential.



[1] "Grinnell College, University of Iowa Join Forces to Expand Use of Digital Technology."
[2] Livadas, Greg. "RIT to Offer Bachelor's Degree in Digital Humanities and Social Sciences."
[3] Roy Rosenzweig Center for History and New Media, Department of History, Art History, George Mason University, 4400 University Drive, and MSN 1E7 Fairfax. "Getty Foundation Funds Institute for Art History Graduate Students at RRCHNM.
[4] "Humannameades Digitales — ACERCA DE."
[5] "Global Outlook::Digital Humanities."
[6] "Gunameelines for Evaluating Work in Digital Humanities and Digital Media | Modern Language Association."
[7] Denbo, Seth. "AHA Council Approves Gunameelines for Evaluation of Digital Projects."
[8] Green, Harriett E., and Angela Courtney. "Beyond the Scanned Image: A Needs Assessment of Scholarly Users of Digital Collections." College & Research Libraries, September 10, 2014, crl14—612. http://doi.org/10.5860/crl.76.5.690
[9] Underwood, Ted. Page-Level Genre Metadata for English-Language Volumes in HathiTrust, 1700-1922. figshare, 2014. https://doi.org/10.6084/m9.figshare.1279201
[10] "Tedunderwood/genre." GitHub.
[11] Jackson, Andy, "The Provenance of Web Archives — UK Web Archive Blog."
[12] Rosenthal, Davname, "DSHR's Blog: You Get What You Get and You Don't Get Upset."
[13] Francois, Pieter. "The Sample Generator — Part 1: Origins — Digital Scholarship Blog."
[14] Lincoln, Matthew D. "Some Problems with GLAM Data on GitHub." Matthew Lincoln, January 6, 2016.
[15] Posner, Miriam. How Did They Make That?, 2014.
[16] Stodden, Victoria. "The Scientific Method in Practice: Reproducibility in the Computational Sciences." SSRN Electronic Journal, 2010. http://doi.org/10.2139/ssrn.1550193
[17] Stodden, Victoria. "Trust Your Science? Open Your Data and Code."
[18] Risam, Roopika. "Beyond the Margins: Intersectionality and the Digital Humanities". Digital Humanities Quarterly 9, no. 2. 2015.
[19] Sadler, Bess and Chris Bourg. "Feminism and the Future of Library Discover". Code4lib Journal 28. 2015.
[20] Masters, Christine L. "Women's Ways of Structuring Data." Ada: A Journal of Gender, New Media, and Technology, November 1, 2015.
[21] Bowker, Geoffrey C., and Susan Leigh Star. Sorting Things out: Classification and Its Consequences. Cambrnamege, Mass.: MIT Press, 2000.
[22] Drucker, Johanna and Davname Kim. "Introduction to Digital Humanities | Concepts, Methods, and Tutorials for Students and Instructors."
[23] Sinclair, Stéfan and Geoffrey Rockwell. Voyeur Tools (Home Page), 2009.
[24] "Cytoscape: An Open Source Platform for Complex Network Analysis and Visualization."
[25] Houston, Natalie M., "Visual Page."
[26] Elizabeth Lorang et al., "Developing an Image-Based Classifier for Detecting Poetic Content in Historic Newspaper Collections," D-Lib Magazine 21, no. 7/8 (July 2015), http://doi.org/10.1045/july2015-lorang
[27] British Library. "The Mechanical Curator."
[28] Kaufman, Micki. "Everything on Paper Will Be Used Against Me: Quantifying Kissinger."
[29] Manus, Susan. "Digging Up the Recent Past: An Interview With Doug Resnamee." The Signal: Digital Preservation, n.d.
[30] Greene, Mark, and Dennis Meissner. "More Product, Less Process: Revamping Traditional Archival Processing." The American Archivist 68, no. 2 (September 2005): 208—63. http://doi.org/10.17723/aarc.68.2.c741823776k65863
[31] Padilla, Thomas and Matthew Lincoln. "Data-Driven Art History: Framing, Adapting, Documenting". dh+lib Data Praxis.
[32] Dalmau, Michelle. "TEI and Plain Text from Digital Collections Services, Indiana University Libraries."
[33] Drass, Eric. "Tate Explorer."
[34] Kräutli, Florian. "The Tate Collection on Github."
[35] Nelson, Chad. "Color Browse."
[36] Farr, Dean. "Term Frequency Map."
[37] Farr, Dean. "DPLA Licenses."
[38] Capitanu, Boris Ted Underwood, Peter Organisciak, Sayan Bhattacharyya, Loretta Auvil, Colleen Fallaw, and J. Stephen Downie. Extracted Feature Dataset from 4.8 Million HathiTrust Digital Library Public Domain Volumes (0.2) [Dataset]. HathiTrust Research Center, 2015. http://doi.org/10.13012/j8td9v7m
[39] Plale, Beth, Atul Prakash, and Robert McDonald, "The Data Capsule for Non-Consumptive Research: Final Report," February 4, 2015.
[40] Gore, Emily, Getting It Right on Rights." Digital Public Library of America.
[41] Green, Harriett E., and Angela Courtney. "Beyond the Scanned Image: A Needs Assessment of Scholarly Users of Digital Collections." College & Research Libraries, September 10, 2014, crl14—612.
[42] Green, Harriett E. 2013. "An analysis of the use and preservation of MONK text mining research software." Literary and Linguistic Computing 29, no.1: 23-40. http://doi.org/10.1093/llc/fqt014
[43] Ixchel M. Faniel and Ann Zimmerman, "Beyond the Data Deluge: A Research Agenda for Large-Scale Data Sharing and Reuse," International Journal of Digital Curation 6, no. 1 (August 3, 2011): 58—69. http://doi.org/10.2218/ijdc.v6i1.172

About the Author


Thomas Padilla is Digital Scholarship Librarian at Michigan State University Libraries. He publishes, presents, and teaches widely on Humanities data, data curation, and data information literacy. Recent national and international presentation and teaching venues include but are not limited to: the annual meeting of the American Historical Association, the Humanities Intensive Learning and Teaching Institute, Digital Humanities, the Digital Library Federation, and Advancing Research Communication and Scholarship. Thomas serves as an Editor for DHCommons Journal and dh + lib Data Praxis. Thomas also currently serves as co-convener of the Association of College and Research Libraries Digital Humanities Interest Group.

P R I N T E R - F R I E N D L Y  F O R M A T Return to Article