Information
This open source, reproducible vignette is for a four-day short course on use of
the Targeted Learning framework for statistical and causal inference with
machine learning in modern “Big Data” applications. Beyond introducing Targeted
Learning, this course focuses on applying the methodology in practice
using the tlverse
software ecosystem. These
materials are based on a working draft of the book Targeted Learning in R
:
Causal Data Science with the tlverse
Software
Ecosystem, which includes in-depth
discussion of these topics and much more, and will serve as a useful reference
to accompany these short course materials.
Preliminaries
All students in this short course will work with the software on their machines and learn how to use these tools in practice. With that in mind, we request all students complete steps 1 and 2 the week/weekend before the short course.
- Join the Slack page dedicated to this short course. Here is the invitation link to join: https://join.slack.com/t/paspe2021tlverse/shared_invite/zt-u4893l2d-n3jzgMvbBcllhpRrhqYNCg.
- Download the necessary software (instructions below). If you are experiencing installation issues, please reach out on the Slack “#install” channel at least a few days before the short course.
Software Download Information:
- Some of the
R
packages we are working with at the short course (e.g., glmnet) requireR
version 4.0 or greater. Links to install/updateR
can be found at https://cloud.r-project.org. We also provide instructions for installingR
and RStudio below in Setup instructions. - After updating
R
to version 4.0 or greater, this script should safely cover all package installations: https://github.com/tlverse/tmlcimx2021-workshop/blob/master/install.R. - Installation errors will likely occur. The first most common error is the
R
version issue mentioned above, and the second most common error relates to exceeding the API rate limit for GitHub packages. The error message and a fix for it can be found here: https://tlverse.org/tlverse-handbook/tlverse.html#installtlverse.
We can help with these installation issues and others on Slack’s #install channel.
About this short course
This short course will provide a comprehensive introduction to the field of Targeted Learning for causal inference, and the corresponding “tlverse” software ecosystem. Targeted Learning is a general statistical approach which unifies advances in causal inference, statistical theory, and machine learning into a cohesive framework to guide pre-specified statistical analysis plans. Emphasis will be placed on efficient targeted estimators of causal effects of a variety of interventions (personalized, treatment versus control, shifting exposures).
In addition to discussion, this short course will incorporate interactive
activities and programming exercises in the R
language with public health
data. After this course, students will be able to use these free tools to
answer scientifically meaningful questions in public health, medicine,
economics, and more.
After completing this short course, students will be able to:
- Follow the roadmap of targeted learning to translate meaningful research questions into realistic statistical estimation problems, and obtain valid inference in terms of confidence intervals and p-values.
- Reliably assess the performance of a machine learning algorithm, or compare
the performance of several algorithms by applying cross-validation schemes
using the
origami
R
package. - Ensemble a combination of algorithms into an improved super learner
algorithm using the
sl3
R
package, and explain why this combination is better than selecting a single algorithm to colleagues. - Estimate the (potentially causal) effect of an intervention on a population
using the
tmle3
R
package.
Schedule
16 August 2021 — The Roadmap
- Before class: Please read the article “Why We Need A Statistical Revolution”.
-
During class: Lecture by Berkeley Professors Mark van der Laan and Alan
Hubbard (slide deck
here),
The Roadmap for Targeted
Learning, and
Introduction to the
tlverse
. - After class: Please complete the Roadmap Chapter Exercises in a text editor (such as Microsoft Word) and by 22 August, send your solutions file to Slack by attaching it to a direct message (DM) with an instructor (Jeremy Coyle, Nima Hejazi, Ivana Malenica, or Rachael Phillips). For guidance on attaching files to Slack DMs, see the “Add files to Slack” help page.
17 August 2021 — Cross-validation
- Before class: Please read “Chapter 5: Cross-validation” of the Targeted Learning Handbook.
-
During class: Cross-validation with
origami
. - After class: Please complete the Cross-validation Exercises and by 22 August, send your solutions file to Slack by attaching it to a direct message (DM) with an instructor (Jeremy Coyle, Nima Hejazi, Ivana Malenica, or Rachael Phillips). For guidance on attaching files to Slack DMs, see the “Add files to Slack” help page.
18 August 2021 — Super Learning
- Before class: Please read “Chapter 6: Super (Machine) Learning” of the Targeted Learning Handbook.
-
During class: Super (Machine) Learning with
sl3
and - After class: Please complete the Super Learning Exercises and by 22 August, send your solutions file to Slack by attaching it to a direct message (DM) with an instructor (Jeremy Coyle, Nima Hejazi, Ivana Malenica, or Rachael Phillips). For guidance on attaching files to Slack DMs, see the “Add files to Slack” help page.
19 August 2021 — Targeted Maximum Likelihood Estimation (TMLE)
- Before class: Please read “Chapter 7: The TMLE Framework” of the Targeted Learning Handbook.
- During class: The TMLE Framework
- After class: Please complete the TMLE Exercises and by 22 August, send your solutions file to Slack by attaching it to a direct message (DM) with an instructor (Jeremy Coyle, Nima Hejazi, Ivana Malenica, or Rachael Phillips). For guidance on attaching files to Slack DMs, see the “Add files to Slack” help page.
Course resources
Slack: Slack will be used as a discussion forum, and students will submit exercise solutions to Slack by sending their solutions file as an attachment in a direct message (DM) with an instructor (Jeremy Coyle, Nima Hejazi, Ivana Malenica, or Rachael Phillips). For guidance on attaching files to Slack DMs, see the “Add files to Slack” help page. The link to join Slack is https://join.slack.com/t/paspe2021tlverse/shared_invite/zt-u4893l2d-n3jzgMvbBcllhpRrhqYNCg.
Code:
R
scripts for each section of the short course are available via the GitHub repository for the short course at https://github.com/tlverse/tmlcimx2021-workshop/tree/master/R_code.Course website: This is the main course website where we will be working during class, and where all course materials, schedule, etc. can be accessed.
Targeted Learning Handbook: These materials are based on a working draft of the book Targeted Learning in
R
: Causal Data Science with thetlverse
Software Ecosystem, which includes in-depth discussion of these topics and much more, and will serve as a useful reference to accompany these short course materials.
About the instructors
Mark van der Laan
Mark van der Laan, PhD, is Professor of Biostatistics and Statistics at UC
Berkeley. His research interests include statistical methods in computational
biology, survival analysis, censored data, adaptive designs, targeted maximum
likelihood estimation, causal inference, data-adaptive loss-based learning, and
multiple testing. His research group developed loss-based super learning in
semiparametric models, based on cross-validation, as a generic optimal tool for
the estimation of infinite-dimensional parameters, such as nonparametric density
estimation and prediction with both censored and uncensored data. Building on
this work, his research group developed targeted maximum likelihood estimation
for a target parameter of the data-generating distribution in arbitrary
semiparametric and nonparametric models, as a generic optimal methodology for
statistical and causal inference. Most recently, Mark’s group has focused in
part on the development of a centralized, principled set of software tools for
targeted learning, the tlverse
.
Alan Hubbard
Alan Hubbard is Professor of Biostatistics, former head of the Division of Biostatistics at UC Berkeley, and head of data analytics core at UC Berkeley’s SuperFund research program. His current research interests include causal inference, variable importance analysis, statistical machine learning, estimation of and inference for data-adaptive statistical target parameters, and targeted minimum loss-based estimation. Research in his group is generally motivated by applications to problems in computational biology, epidemiology, and precision medicine.
Jeremy Coyle
Jeremy Coyle, PhD, is a consulting data scientist and statistical programmer,
currently leading the software development effort that has produced the
tlverse
ecosystem of R packages and related software tools. Jeremy earned his
PhD in Biostatistics from UC Berkeley in 2016, primarily under the supervision
of Alan Hubbard.
Nima Hejazi
Nima Hejazi is a PhD candidate in biostatistics, working under the collaborative direction of Mark van der Laan and Alan Hubbard. Nima is affiliated with UC Berkeley’s Center for Computational Biology and NIH Biomedical Big Data training program, as well as with the Fred Hutchinson Cancer Research Center. Previously, he earned an MA in Biostatistics and a BA (with majors in Molecular and Cell Biology, Psychology, and Public Health), both at UC Berkeley. His research interests fall at the intersection of causal inference and machine learning, drawing on ideas from non/semi-parametric estimation in large, flexible statistical models to develop efficient and robust statistical procedures for evaluating complex target estimands in observational and randomized studies. Particular areas of current emphasis include mediation/path analysis, outcome-dependent sampling designs, targeted loss-based estimation, and vaccine efficacy trials. Nima is also passionate about statistical computing and open source software development for applied statistics.
Ivana Malenica
Ivana Malenica is a PhD student in biostatistics advised by Mark van der Laan. Ivana is currently a fellow at the Berkeley Institute for Data Science, after serving as a NIH Biomedical Big Data and Freeport-McMoRan Genomic Engine fellow. She earned her Master’s in Biostatistics and Bachelor’s in Mathematics, and spent some time at the Translational Genomics Research Institute. Very broadly, her research interests span non/semi-parametric theory, probability theory, machine learning, causal inference and high-dimensional statistics. Most of her current work involves complex dependent settings (dependence through time and network) and adaptive sequential designs.
Rachael Phillips
Rachael Phillips is a PhD student in biostatistics, advised by Alan Hubbard and Mark van der Laan. She has an MA in Biostatistics, BS in Biology, and BA in Mathematics. A student of targeted learning and causal inference, Rachael’s research integrates semiparametric statistical estimation and inference. She is motivated by applied projects and some of her current work involves personalized online learning from data streams of vital signs, human-computer interaction, automated machine learning, and developing statistical analysis plans using targeted learning.
Setup instructions
R and RStudio
R and RStudio are separate downloads and installations. R is the underlying statistical computing environment. RStudio is a graphical integrated development environment (IDE) that makes using R much easier and more interactive. You need to install R before you install RStudio.
0.0.0.1 Windows
0.0.0.1.1 If you already have R and RStudio installed
- Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio.
- To check which version of R you are using, start RStudio and the first thing
that appears in the console indicates the version of R you are
running. Alternatively, you can type
sessionInfo()
, which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it. You can check here for more information on how to remove old versions from your system if you wish to do so.
0.0.0.1.2 If you don’t have R and RStudio installed
- Download R from the CRAN website.
- Run the
.exe
file that was just downloaded - Go to the RStudio download page
- Under Installers select RStudio x.yy.zzz - Windows XP/Vista/7/8 (where x, y, and z represent version numbers)
- Double click the file to install it
- Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
0.0.0.2 macOS / Mac OS X
0.0.0.2.1 If you already have R and RStudio installed
- Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio.
- To check the version of R you are using, start RStudio and the first thing
that appears on the terminal indicates the version of R you are running.
Alternatively, you can type
sessionInfo()
, which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it.
0.0.0.2.2 If you don’t have R and RStudio installed
- Download R from the CRAN website.
- Select the
.pkg
file for the latest R version - Double click on the downloaded file to install R
- It is also a good idea to install XQuartz (needed by some packages)
- Go to the RStudio download page
- Under Installers select RStudio x.yy.zzz - Mac OS X 10.6+ (64-bit) (where x, y, and z represent version numbers)
- Double click the file to install RStudio
- Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
0.0.0.3 Linux
- Follow the instructions for your distribution
from CRAN, they provide information
to get the most recent version of R for common distributions. For most
distributions, you could use your package manager (e.g., for Debian/Ubuntu run
sudo apt-get install r-base
, and for Fedorasudo yum install R
), but we don’t recommend this approach as the versions provided by this are usually out of date. In any case, make sure you have at least R 3.3.1. - Go to the RStudio download page
- Under Installers select the version that matches your distribution, and
install it with your preferred method (e.g., with Debian/Ubuntu
sudo dpkg -i rstudio-x.yy.zzz-amd64.deb
at the terminal). - Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
These setup instructions are adapted from those written for Data Carpentry: R for Data Analysis and Visualization of Ecological Data.
Reproduciblity with the tlverse
The tlverse
software ecosystem is a growing collection of packages, several of
which are quite early on in the software lifecycle. The team does its best to
maintain backwards compatibility. Once this work reaches completion, the
specific versions of the tlverse
packages used will be archived and tagged to
produce it.
This book was written using bookdown, and the complete source is available on GitHub. This version of the book was built with R version 4.1.1 (2021-08-10), pandoc version 2.7.3, and the following packages:
package | version | source |
---|---|---|
bookdown | 0.23.1 | Github (rstudio/bookdown@6643bb9) |
bslib | 0.2.5.9002 | Github (rstudio/bslib@78960cb) |
daggity | NA | NA |
data.table | 1.14.0 | CRAN (R 4.1.0) |
delayed | 0.3.0 | CRAN (R 4.1.0) |
devtools | 2.4.2 | CRAN (R 4.1.0) |
downlit | 0.2.1 | CRAN (R 4.1.0) |
dplyr | 1.0.7 | CRAN (R 4.1.0) |
forecast | 8.15 | CRAN (R 4.1.0) |
ggdag | 0.2.3 | CRAN (R 4.1.1) |
ggfortify | 0.4.12 | CRAN (R 4.1.0) |
ggplot2 | 3.3.5 | CRAN (R 4.1.0) |
here | 1.0.1 | CRAN (R 4.1.0) |
kableExtra | 1.3.4 | CRAN (R 4.1.0) |
knitr | 1.33 | CRAN (R 4.1.0) |
mvtnorm | 1.1-2 | CRAN (R 4.1.0) |
origami | 1.0.3 | CRAN (R 4.1.0) |
readr | 2.0.1 | CRAN (R 4.1.1) |
rmarkdown | 2.10 | CRAN (R 4.1.1) |
skimr | 2.1.3 | CRAN (R 4.1.0) |
sl3 | 1.4.3 | Github (tlverse/sl3@aeeb7e9) |
stringr | 1.4.0 | CRAN (R 4.1.0) |
tibble | 3.1.3 | CRAN (R 4.1.0) |
tidyr | 1.1.3 | CRAN (R 4.1.0) |
tidyverse | 1.3.1 | CRAN (R 4.1.0) |
tmle3 | 0.2.0 | Github (tlverse/tmle3@425e21c) |