All that Big Data Is Not Going to Manage Itself: Part One
On February 26, 2003, the National Institutes of Health released the “Final NIH Statement on Sharing Research Data.” As you’ll be reminded when you visit that link, 2003 was eons ago in “Internet time.” Yet the vision NIH had for the expanded sharing of research data couldn’t have been more prescient.
As the Open Government Data site notes, government data is a tremendous resource that can have a positive impact on democracy, civic participation, new products and services, innovation and governmental efficiency.
Since 2003, we’ve seen the National Science Foundation release its requirements for Data Management Plans (DMPs) and the White House address records management, open government data and “big data.” There are now data management and sharing requirements from NASA, the Department of Energy, the National Endowment for the Humanities and many others.
In this two-part series on government data management, we’ll take a look back at some of the guidance that is driving data management practices across the federal government. In the second part, we’ll look at the tools and services that have developed to meet the needs of this expanding data management infrastructure.
It’s 2014, and we’re still struggling to ensure that the outputs of government-funded research are secure and made accessible as building blocks for new knowledge, but it’s not for lack of trying: federal government agencies such as NIH and the NSF recognized the need to preserve and keep data accessible through the requirements tied to their grant funding.
The 2003 NIH Statement referenced above noted that “Starting with the October 1, 2003 receipt date, investigators submitting an NIH application seeking $500,000 or more in direct costs in any single year are expected to include a plan for data sharing or state why data sharing is not possible.”
They followed that up with Implementation Guidance a little later that year. The guidance was very high-level, but did suggest some possible data sharing methods, such as:
- Under the auspices of the PI [Principal Investigator]
- Data archive
- Data enclave
- Mixed mode sharing
They didn’t go into great detail on what constitutes a “data archive” or “enclave” (a “data enclave” looks like a “data archive” with restricted access), but they did include this helpful bit of information on what “under the auspices of the PI” might entail:
“Investigators sharing under their own auspices may simply mail a CD with the data to the requestor, or post the data on their institutional or personal Web site.”
We’re now pretty well-aware of the challenges of over-relying on PIs for archiving and keeping data accessible, concerns that are perfectly summed-up in the “Data Sharing and Management Snafu in 3 Short Acts” video.
The NIH has continued to update its policies, now gathered together on the NIH Sharing Policies and Related Guidance on NIH-Funded Research Resources page. It’s important to note that the NIH has different requirements for “data” and for “publications.” Under section 8.2.1 “Rights in Data (Publication and Copyrighting)” in the 10/2013 version of the NIH Grants Policy Statement, “in all cases, NIH must be given a royalty-free, nonexclusive, and irrevocable license for the Federal government to reproduce, publish, or otherwise use the material and to authorize others to do so for Federal purposes.”
A little further on, in section 8.2.2 “NIH Public Access Policy,” they establish the requirements for access to the published results of NIH funded research. Under this part of their guidelines, NIH-funded investigators are required to submit an electronic version of any final, peer-reviewed manuscript to the PubMed Central archive, to be made available no later than 12 months after the official date of publication.
On January 18, 2011 the National Science Foundation started requiring Data Management Plans be submitted in conjunction with funding proposals. The most recent NSF “Proposal and Award Policies and Procedures Guide,” effective 2/24/14, describes, at a high level, the categories of information that might be included in the required NSF data management plans:
- The types of data, samples, physical collections, software, curriculum materials and other materials to be produced in the course of the project.
- The standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies).
- Policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements.
- Policies and provisions for re-use, re-distribution and the production of derivatives; and plans for archiving data, samples and other research products, and for preservation of access to them.
The NSF has a “Data Management and Sharing Frequently Asked Questions” list, but it’s less prescriptive than it might be. For example, for the question “Am I required to deposit my data in a public database?” the NSF provides this response:
“What constitutes reasonable data management and access will be determined by the community of interest through the process of peer review and program management. In many cases, these standards already exist, but are likely to evolve as new technologies and resources become available.”
The development of data management infrastructures has accelerated over the past five years, catalyzed by wide-ranging guidance from the White House, starting with the December 2009 Open Government Directive, which directs executive departments and agencies to take specific actions to implement the principles of transparency, participation and collaboration in dealing with the data they create.
This was followed by the November 2011 Presidential Memorandum on “Managing Government Records” and a series of government directives on “big data,” including the “Big Data: Seizing Opportunities, Preserving Values” report released this month (though this report includes these problematic sentences: “Volumes of data that were once un-thinkably expensive to preserve are now easy and affordable to store on a chip the size of a grain of rice. As a consequence, data, once created, is in many cases effectively permanent.”)
These requirements are, slowly but surely, leading to a suite of tools and services designed to help researchers prepare plans, while also leading to the creation of repositories for the long-term storage of the resulting research output (or both, as in the case of Penn State University Library’s “Data Repository Tools and Services”).
In part two of this series, we’ll look at some of the data management support tools, but feel free to point them out now in the comments.