One approach to ensuring that data analysis projects and research reports are reproducible

Date and Time: 
Tuesday, April 5th, 2016
Center Green
Janine Aquino

Authors: Mike Daniels, William Cooper, Janine Aquino, Teresa Campos, William Brown (All from NCAR/EOL)

New tools now enable new ways of achieving reproducibility in data analysis, especially as scientific workflows are more integrated with papers and reports. Reproducibility is growing in visibility because skeptics among colleagues, funding agencies, politicians, and professional societies can damage professional reputations and can be hard to combat without good records. Reproducibility is also beneficial to the original authors when it becomes desirable to extend the work and memories have grown stale.

A reproducible project should at least have these characteristics:

  1. Data sources should be available in public repositories, with assigned DOI numbers, and there should be sufficient provenance to trace their origins and those responsible for their collection and curation.

  2. Analysis software should similarly be preserved and where possible assigned identifiers that ensure their preservation and public access.

  3. Workflow and/or provenance descriptions should accompany the work. It is not sufficient to preserve the software, which depends on a reproducer being able to understand the program and run it in a suitable environment. A workflow description can include the requisite conditions (software packages, version numbers of commercial packages, the computing environment) and can also describe parts of the investigation pursued and found unproductive, how choices were made and figures constructed, etc. These descriptions need to be archived with the analysis programs so that others might duplicate or extend the work.

In this talk, we present one specific model for achieving these aspects of reproducibility and illustrate that model with examples of varying complexity. The tool used in these examples is R with knitr running in RStudio. This provides the advantage of integrating the text and the analysis code so that the generation of numbers, figures, tables, etc., occurs in one process and there is no need for error-prone and reproducibility-breaking procedures like copying numbers, cutting-and-pasting figures, or using separate software to generate results. We argue that even simple data-analysis tasks, when documented in a memo or report, should follow this or a similar structure to achieve reproducibility in the most integrated and automated way. The examples will include a simple data-analysis memo, a journal-article manuscript, and a 170-page technical note, all following this approach to being thoroughly reproducible by others using archived material
Speaker Description: 

Janine manages research data from the two NCAR/EOL research aircrafts: HIAPER, a modified Gulfstream V jet, and a four-engine turboprop C-130. Data are made available online as part of comprehensive project websites that support cutting edge atmospheric research.

PDF icon Aquino_SEA2016.pdf1.76 MB

Event Category: