Introduction

There are two main ways to use this template:

  • As a project starter. For this, use Git clone and change the remote to your repository. This starts a clean, fresh project.
  • As a reference. In that case, you use the project template and parts as reference.

In both cases, no one must utilize all provided packages and patterns, one can gradually improve reproducibility. This section will walk you through some of the core aspects to utilize this work.

In order to get reproducible and traceable, one must version control ideally documentation, code, data and execution environment.

knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())

TL;DR

How to use this package:

  1. start by cloning/forking this git repository
  2. use git+github, git lfs, tidyverse and RMarkdown
  3. add/modify Drake Plans to for your Pipeline(s)
  4. Adapt the documentation in Vignettes
  5. (optional) start the RStudio IDE within Docker + use renv

Prerequisites

  • Your Project should utilize R - although it may be possible to adapt to Python code aswell, it’s native to R.

  • Your Project should use Git.

  • Your Project MUST have a README file

Getting Started

In order to start with your reproducible Project use the following steps to get the template running:

  • Create a RStudio-Project in the structure of a package - the template already established a project template.

  • Define a license for your Project. By default the template already uses gpl3 - usethis::use_gpl3_license() is the command setting the license file.

  • Utilize an environment tool (renv). The project template comes with an renv.lock file and accompanying other files.

  • Lint your code to follow good styleguides. Project Template recommends running lintr:::addin_lint_package() - if your package is in package structure, this lints all project files.

  • Decide where to host your data. The project template utilizes git for version control, and may be extended with Git-LFS. Use data or data-raw folders.

  • Use version control for everything - including code, data, the environment, and even the packaged dependencies.

  • Publish your work with a repository. The template is hosted under https://github.com/g4challenge/repro-fair-neuro-ds-template - when using GitHub change the path and name of your template.

  • Use continous integration. The template utizes .github/workflows folders with the incorporation of pkgdown build of the Documentation, codecov and R-CMD-Check of the Execution in order to trace the results.

  • Use code tests. The template suggests using testthat and documentation based tests like roxytest.

  • Use software engineering practices like docstrings, roxytest. The template utilizes the vignettes folder, roxygen2 and roxytest for documentation and good practices.

  • use workflows like drake, targets. The vignettes/drake_spec.Rmd file is one example on how to utilize the drake package. A workflow captures inputs and outputs of executions in order to trace results. The template version controls even the executions within the .drake folder, this must be curated for good release practices.

  • utilize PROV-Tools (rdtlite) - which creates a PROV-based representation of the workflow executions.

  • package your results - this work packages everything on Zenodo.org in the specifications according to ROCrate

Then you need to consider a few questions on how you want to start building your package:

  • Hypotheses you need to investigate
  • Long-term traceability
  • Base data quality
  • ETL-Process
  • Analytic processes

Scenario 1: walk-through

The scenario 1 is one simple scenario involving several steps in a linear manner.

The file is in vignettes/scenario_1.Rmd, it utilizes R/packages.R" for package loading, R/functions.R for general functions, R/functions_scenario_1.R for scenario specific functions and the plan is defined in R/plan.R

Open the file scenario_1.Rmd where the statement make(scenario_1) does the execution, and the workflow of drake is visualized as graph.

This graph shows, the sequence from input question, to reading the dataset to producing a report.

Test-Results

testthat is able to test the local package of your analysis using test_local() the following code chunk tests all targets and executions of this package’s workflows.

testthat::test_local(".")
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.3.3     ✔ purrr   0.3.4
## ✔ tibble  3.1.1     ✔ dplyr   1.0.6
## ✔ tidyr   1.1.3     ✔ stringr 1.4.0
## ✔ readr   1.4.0     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::is_null() masks testthat::is_null()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ dplyr::matches() masks tidyr::matches(), testthat::matches()
## Loading required package: rmarkdown
## Loading required package: knitr
## Loading required package: magrittr
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
## The following objects are masked from 'package:testthat':
## 
##     equals, is_less_than, not
## ✔ |  OK F W S | Context
## 
⠏ |   0       | roxytest-tests-functions_scenario_1                             
⠏ |   0       | File R/functions_scenario_1.R: @tests                           
⠙ |   2       | File R/functions_scenario_1.R: @tests                           
✔ |   4       | File R/functions_scenario_1.R: @tests [0.3 s]
## 
⠏ |   0       | roxytest-tests-functions_scenario_2                             
⠏ |   0       | File R/functions_scenario_2.R: @tests                           
✔ |   1       | File R/functions_scenario_2.R: @tests
## 
⠏ |   0       | roxytest-tests-functions                                        
⠏ |   0       | File R/functions.R: @tests                                      
✔ |   2       | File R/functions.R: @tests
## 
⠏ |   0       | roxytest-tests-plan                                             
⠏ |   0       | File R/plan.R: @tests                                           
✔ |   1       | File R/plan.R: @tests
## 
## ══ Results ═════════════════════════════════════════════════════════════════════
## Duration: 0.3 s
## 
## [ FAIL 0 | WARN 0 | SKIP 0 | PASS 8 ]

Provenance Curation

If you want to curate your results as release - it may be useful and suitable to curate the traces. The following commands help to achieve this:

drake::clean() #clean the cache to force rebuild
drake::clean(destroy=TRUE) # clean cache and curate "one" clean run in the drake cache.

The second command cleans the .drake folder and allows for version control of only the curated final release.

Complete Provenance

In order to achieve a track record of complete provenance within the template, rdtLite is used in conjunction with a makefile, the make.R is comparable and similar to the scenario_1.Rmd except it can be captured by rdtLite as workflow run.

getwd() #validate path to project root
rdtLite::prov.run(r.script.path = "make.R", prov.dir=".prov/")

This results in the fresh creation of a .prov subdirectory - which has provenance execution traces according to the rdtlite specification.

Development Steps

If you may want to continue development on this project and extend it further to your needs. Developing with the project template is comparable to using the template.

Description of steps during development, starting with Docker + RStudio right away.

  1. clone this template
git clone https://github.com/g4challenge/repro-fair-neuro-ds-template
  1. within the git repository, start the docker container:
docker run --rm -p 8787:8787 -e PASSWORD="1234" -v $(pwd):/home/rstudio my_fair_project 
  1. open your browser at http://localhost:8787

  2. login using user=rstudio password=1234

  3. click open on the .rproj file.

  4. Start using the project, adapt it to your needs + change the git remote

  • Local Docker build

    docker build . -t my_fair_project

  • Local Docker run with default user rstudio and PASSWORD=“1234” - use different password

docker run --rm -p 8787:8787 -e PASSWORD="1234" -v $(pwd):/home/rstudio g4challenge/repro-fair-neuro-ds-template

Release Curation

  • Version the DESCRIPTION file
  • Write the News.md file
  • Build the Package
  • Create a signed and tagged Git Commit git tag -s v1.5 -m 'my signed 1.5 tag'
  • Push the release git push origin main
  • Validate Documentation (the pkgdown action must complete)
  • Tag the DockerHub Image docker tag
  • Validate Zenodo Publication

Known Limitations

  • Currently, not all Versions are Working
  • Binder: Rsession-proxy has challenges with R 4.0.3 – unresolved
  • Renv: in continous integration, Python environments are not restored
  • Provenance Integration – rdtLite not fully working
  • NeuroShapes INCF neuroshapes for defined data format - Common Data model.

FAQ

Q: “Where should I start my documentation?”

A: start with the Readme.md with the base context and efforts involved in your project. Then if necessary adapt and change the vignettes to your projects implementation.

Q: How do the various documentations relate to each other?

A: The readme serves as starting and entrypoint, while the main documentation by pkgdown is hosted as github page. There, all function documentation and vignettes are rendered and described.

Q: What adaptations do I have to make?

A: Ideally, you adapt/fork the template and start by clicking on “use this template” in GitHub. There a new traceable analysis project is created. You need to maintain Links to GitHub Pages, Zenodo and DockerHub by yourself.

Q: How can a prioritize these adaptations?

A: You start by the most pressing needs first - getting the template externally working. Priority one is fixing the traces and validating these.

Q: What is the minimal set of adaptations that I need to make?

A: Minimal is readme, links and the vignettes folder cleaning with one drake_spec, one function.R, one packages.R, one plan.R and finally one docker build. Then your analysis is version controlled and mostly traceable with comparatively low effort.

Q: Can I also apply this to other contexts within medicine, or even outside of a medical context or?

A: Yes, the main target is utilizing good clinical practice, STROBE and DAQCORD, but it may suit other needs as well.