How a network of data curators can unlock the tremendous reuse value of research data

Lisa R. Johnston

data_reuse

Data reuse is a major focus for institutional research groups and their funders and it’s easy to see why. After the (often) expensive process of collecting, analyzing, and mining research data for valuable new knowledge, any additional attention, such as the publication, reference, or reuse of that data, multiplies its value.

But understanding researchers’ behaviors and needs when it comes to data sharing and reuse is challenging. Each discipline has unique norms and practices for how they collect and manage data, when (and if) they share their data, and how they determine a dataset’s fitness for reuse. Data curators—as information science practitioners—make a wealth of decisions and take well-informed actions to ensure that selected data have meaningful and enduring value to future research.

It is nearly impossible, though, for any institution to hire curators skilled in all the format types and disciplinary backgrounds needed to properly ensure that data are complete and well-structured for reuse. Which is why the Data Curation Network (DCN) team set out to create a cross-institutional network that would enable researchers to share their data in ethical and reusable ways, regardless of their final repository destination.

Many policies, many platforms: one seamless “human layer”

The DCN was born from a simple reality: many data repositories receive a wide range of datasets (from neuroscience to entomology data) and in all conceivable file types/formats (such as Python files or 3d Images).

The network expands the range of data curation experts that repositories rely on to work with data authors to clean, validate, document, and transform their data files for future users. Since our launch, more than 100 data sets have been matched to a curator across our ten partner institutions. Each year we hold annual in-person training events for data curators, we present webinars to share our training and skill development with the broader librarian audience, and we teach no-charge “specialized data curation” workshops that have been attended by more than 150 librarians and curators in the last three years.

Networked curation is not always easy. Each of our repository platforms is unique. We use a variety of software (including DSpace, Samvera, Fedora, Dataverse) and we have distinct policies and ingest workflows. This is why we modeled the DCN as a seamless “human layer” in the local repository stack. Each partner institution receives its own data and archives it locally. Partners decide what to accept and if/when to send data to the DCN for curation. They also provide repository services of storage, DOI minting, and long-term preservation.

Adding value to the process

By running curation pilots (Johnston et al., 2017), surveying data repositories nationwide (Hudson Vitale et al., 2017), and hosting a series of researcher focus groups, we identified a set of 47 data curation activities that were happening at various repositories. These included actions such as reviewing code, contextualizing data with links to related publications, creating documentation files, and setting/enforcing embargoes (see full list at Johnston et al., 2016). In our focus groups, we asked researchers to rate each activity on a scale of 1-5 for importance.

We observed several areas that are of great importance to researchers … but were simply not happening for the majority of them. For example, having DOIs for data, auditing files for completeness, or contextualizing the data within a larger body of work. This was where the DCN could provide new value-add services. Additionally, there were several very important activities where researchers were not satisfied with work that was being done, and here we saw an opportunity for the DCN to provide better tools and services to assist with documentation, quality assurance, data visualization, and file format transformation.

CURATE Steps and Context for Reuse

Based on this early research and planning prior to launching the DCN, we developed our CURATE(D) steps (Data Curation Network, 2018). This conceptual workflow is the basis of our training (both for DCN curators and our workshops for the community). Each step has its own checklist outlining actions that any dataset might need, such as auditing the contents of a dataset and verifying the accuracy of all metadata provided. The steps represent a baseline set of actions to take for any given dataset, but the specific approach may vary.

For other specialized formats, we use “data curation primers” to capture format-specific actions. These primers are community-authored protocols developed by attendees of our Data Curation Network workshops and published on Github so they may be updated and edited over time. So far our colleagues have authored primers on Microsoft Excel, Microsoft Access, netCDF, ATLAS.TI, Confocal Microscope images, Geodatabases, Jupyter Notebook, netCDF, SPSS,  R, Tableau, and WordPress.com.

Librarians are well-positioned to help leverage research data in valuable new ways. #OCLCnext Click To Tweet

Faniel, Frank, and Yakel’s “Context from the Data Reuser’s Point of View” (2019) is an excellent case study from which to aid these curatorial decisions. Looking across three data-intensive domains (archeology, zoology, and qualitative social science), the project collected evidence for what data reusers look for when evaluating data produced by others. Based on that evidence, this paper assembles a framework of contextual information that data repository curators and developers can use to design better, more user-driven systems and curatorial processes.

I found it very useful to compare the findings from the Faniel et al. study with the steps taken by data curator practitioners in the Data Curation Network to think about how we could update our checklists and provide evidence of the value of our current activities.

Table 1: CURATE(D) Steps by the Data Curation Network with Additional Activities and Evidence based on Faniel et al. 2019

table

(1 FAIR Guiding Principles for scientific data management and stewardship)

Three conclusions … and many opportunities

It was gratifying to see how closely our work at the DCN tracked with the Faniel, et al. study. It confirmed three major conclusions for me from this process:

  1. Data curators can benefit from research in the information science community, allowing us to focus on the more essential facets of data curation, because;
  2. Data curation is a pragmatic business. There is no such thing as a perfect dataset. Even embedded data curators working alongside data creators in labs or academic research settings are unable to capture all of the important contextual information necessary for reuse. However;
  3. Working together, we can continue to refine our standards and procedures.

We’ve had such tremendous success with our relatively small group of institutions and a few dozen curators. Imagine how much more we could accomplish with more voices and perspectives, more training, and more organization around this important work.

Librarians, archivists, data curators—information scientists of all kinds—are well-positioned to help leverage our institutions’ research data in valuable new ways. But since those reuse opportunities often occur across disciplines and institutions, we need to reach across those boundaries first.