Welcome
Targeted Learning in R
: Causal Data Science with the tlverse
Software
Ecosystem is an fully reproducible, open source, electronic handbook for
applying Targeted Learning methodology in practice using the software stack
provided by the tlverse
ecosystem. This work is
a draft phase and is publicly available to solicit input from the community. To
view or contribute, visit the GitHub
repository.
Outline
The contents of this handbook are meant to serve as a reference guide for both applied research and for the teaching of short courses illustrating successful applications of the Targeted Learning statistical paradigm. Each section introduces a set of distinct causal inference questions, often motivated by a case study, alongside statistical methodology and open source software for assessing the scientific (causal) claim of interest. The set of materials currently includes
- Motivation: Why we need a statistical revolution
- The Roadmap and introductory case study: the WASH Benefits Bangladesh dataset
- Introduction to the
tlverse
software ecosystem - Cross-validation with the
origami
package - Ensemble machine learning with the
sl3
package - Targeted learning for causal inference with the
tmle3
package - Optimal treatments regimes and the
tmle3mopttx
package - Stochastic treatment regimes and the
tmle3shift
package - Causal mediation analysis with the
tmle3mediate
package - Coda: Why we need a statistical revolution
What this book is not
This book does not focus on providing in-depth technically sophisticated descriptions of modern statistical methodology or recent advancements in Targeted Learning. Instead, the goal is to convey key details of these state-of-the-art statistical techniques in a manner that is clear, complete, and intuitive, while simultaneously avoiding the cognitive burden carried by extraneous details (e.g., mathematically niche theoretical arguments). Our aim is for the presentations herein to serve as a coherent reference for researchers – applied methodologists and domain specialists alike – that empower them to deploy the central statistical tools of Targeted Learning in a manner efficient for their scientific pursuits. For a mathematically sophisticated treatment of some of these topics, inclusive of in-depth technical details, in the field of Targeted Learning, the interested reader is invited to consult van der Laan and Rose (2011) and van der Laan and Rose (2018), among numerous other works, as appropriate. The primary literature in causal inference, machine learning, and non/semi-parametric statistical theory include many of the most recent advances in Targeted Learning and related areas. For background in causal inference, Hernán and Robins (2022) serves as an introductory modern reference.
Reproduciblity
The tlverse
software ecosystem is a growing collection of packages, several of
which are quite early on in the software lifecycle. The team does its best to
maintain backwards compatibility. Once this work reaches completion, the
specific versions of the tlverse
packages used will be archived and tagged to
produce it.
This book was written using bookdown, and the complete source is available on GitHub. This version of the book was built with R version 4.3.1 (2023-06-16), pandoc version 2.19.2, and the following packages:
package | version | source |
---|---|---|
bookdown | 0.34.2 | Github (rstudio/bookdown@e3cae95282f497c55864057e9e8255e2aed75120) |
bslib | 0.3.1 | CRAN (R 4.3.1) |
dagitty | 0.3-1 | CRAN (R 4.3.1) |
data.table | 1.14.2 | CRAN (R 4.3.1) |
delayed | 0.3.0 | CRAN (R 4.3.1) |
downlit | 0.4.0 | CRAN (R 4.3.1) |
dplyr | 1.0.9 | CRAN (R 4.3.1) |
forecast | 8.16 | CRAN (R 4.3.1) |
future | 1.26.1 | CRAN (R 4.3.1) |
ggdag | 0.2.4 | CRAN (R 4.3.1) |
ggfortify | 0.4.14 | CRAN (R 4.3.1) |
ggplot2 | 3.3.6 | CRAN (R 4.3.1) |
kableExtra | 1.3.4.9000 | Github (kupietz/kableExtra@3bf9b21a769c9e6c21c955689bf5f8175dc83350) |
knitr | 1.42 | CRAN (R 4.3.1) |
mvtnorm | 1.1-3 | CRAN (R 4.3.1) |
origami | 1.0.5 | Github (tlverse/origami@e1b8fe6f5e75fff1d48eed115bb81475c9bd506e) |
randomForest | 4.7-1.1 | CRAN (R 4.3.1) |
readr | 2.1.2 | CRAN (R 4.3.1) |
rmarkdown | 2.14 | CRAN (R 4.3.1) |
skimr | 2.1.4 | CRAN (R 4.3.1) |
sl3 | 1.4.5 | Github (tlverse/sl3@de445c210eefa5aa9dd4c0d1fab8126f0d7c5eeb) |
stringr | 1.4.0 | CRAN (R 4.3.1) |
tibble | 3.1.7 | CRAN (R 4.3.1) |
tidyr | 1.2.0 | CRAN (R 4.3.1) |
tmle3 | 0.2.0 | Github (tlverse/tmle3@ed72f8a20e64c914ab25ffe015d865f7a9963d27) |
tmle3mediate | 0.0.3 | Github (tlverse/tmle3mediate@70d1151c4adb54d044f355d06d07bcaeb7f8ae07) |
tmle3mopttx | 1.0.0 | Github (tlverse/tmle3mopttx@c8c675f051bc5ee6d51fa535fe6dc80791d4d1b7) |
tmle3shift | 0.2.0 | Github (tlverse/tmle3shift@4ed52b50af501a5fa2e6257b568d17fd485d3f42) |
Learning resources
To effectively utilize this handbook, the reader need not be a fully trained
statistician to begin understanding and applying these methods. However, it is
highly recommended for the reader to have an understanding of basic statistical
concepts such as confounding, probability distributions, confidence intervals,
hypothesis tests, and regression. Advanced knowledge of mathematical statistics
may be useful but is not necessary. Familiarity with the R
programming
language will be essential. We also recommend an understanding of introductory
causal inference.
For learning the R
programming language we recommend the following (free)
introductory resources:
- Software Carpentry’s Programming with
R
- Software Carpentry’s
R
for Reproducible Scientific Analysis - Garret Grolemund and Hadley Wickham’s
R
for Data Science
For a general, modern introduction to causal inference, we recommend
- Miguel A. Hernán and James M. Robins’ Causal Inference: What If (2022)
- Jason A. Roy’s A Crash Course in Causality: Inferring Causal Effects from Observational Data on Coursera
Feel free to suggest a resource!
Want to help?
Any feedback on the book is very welcome. Feel free to open an issue, or to make a Pull Request if you spot a typo.