The tlverse
Software Ecosystem for Causal Inference
2019 Atlantic Causal Inference Conference
updated: May 22, 2019
Preface
This is an open source and fully-reproducible electronic vignette for a
full-day short-course on applying the targeted learning methodology in practice
using the tlverse
software ecosystem, given at
the 2019 Atlantic Causal Inference
Conference in
Montréal, Québec, Canada on 22 May 2019. The Hitchhiker’s Guide to the
tlverse
, or a Targeted Learning Practitioner’s
Handbook accompanies this vignette and
covers the same topics presented in this vignette, but presents them in more
detail.
Important links
Software installation Please install the relevant software before the workshop.
You will probably exceed the GitHub API rate limit during this installation, and this will throw an error. This issue and the solution are addressed here.
Workshop surveys These pre- and post-workshop surveys help us ensure the effectiveness of our teaching methodology.
Etherpad We will use the Etherpad for discussion, Q&A, and sharing URLs and bits of code.
Code
R
script files for each section of the workshop are available via the GitHub
repository for the short course.
RStudio Cloud We created an RStudio Cloud Workspace for this workshop to serve as an alternative to installing the software locally.
About this workshop
This full-day workshop will provide a comprehensive introduction to the field of
targeted learning for causal inference and the corresponding tlverse
software ecosystem. In particular, we will focus on
targeted minimum loss-based estimators of causal effects, including those of
static, dynamic, optimal dynamic, and stochastic interventions. These multiply
robust, efficient plug-in estimators use state-of-the-art, ensemble machine
learning tools to flexibly adjust for confounding while yielding valid
statistical inference. We will discuss the utility of this robust estimation
strategy in comparison to conventional techniques, which often rely on
restrictive statistical models and may therefore lead to severely biased
inference. In addition to discussion, this workshop will incorporate both
interactive activities and hands-on, guided R
programming exercises, to allow
participants the opportunity to familiarize themselves with methodology and
tools that will translate to real-world causal inference analyses. It is highly
recommended for participants to have an understanding of basic statistical
concepts such as confounding, probability distributions, confidence intervals,
hypothesis tests, and regression. Advanced knowledge of mathematical statistics
may be useful but is not necessary. Familiarity with the R
programming language
will be essential.
Outline
This is a full-day (6-hour) workshop, featuring modules that introduce distinct causal questions, each motivated by a case study, alongside statistical methodology and software for assessing the causal claim of interest. A sample schedule may take the form:
- 08:30AM–09:00AM: Address software installation issues
- 09:00AM–09:10AM: Introductions
- 09:10AM–09:30AM: Introduction to the
tlverse
software ecosystem - 09:30AM–10:00AM: The Roadmap of Targeted Learning, and the WASH Benefits data
- 10:00AM–10:20AM: Morning coffee break
- 10:20AM–10:50AM: Why we need a statistical revolution
- 10:50AM–11:50AM: Ensemble machine learning with the
sl3
R
package - 11:50AM–12:10PM: Targeted learning for causal inference with the
tmle3
R
package - 12:10PM–01:00PM: Lunch break
- 01:00PM–01:30PM: Targeted learning for causal inference with the
tmle3
R
package - 01:30PM–02:00PM: Optimal treatment regimes and the
tmle3mopttx
R
package - 02:00PM–02:20PM: Afternoon coffee break
- 02:20PM–02:50PM: Optimal treatment regimes and the
tmle3mopttx
R
package - 02:50PM–04:00PM: Stochastic treatment regimes and the
tmle3shift
R
package
About the instructors
Mark van der Laan
Mark van der Laan, Ph.D., is Professor of Biostatistics and Statistics at UC
Berkeley. His research interests include statistical methods in computational
biology, survival analysis, censored data, adaptive designs, targeted maximum
likelihood estimation, causal inference, data-adaptive loss-based learning, and
multiple testing. His research group developed loss-based super learning in
semiparametric models, based on cross-validation, as a generic optimal tool for
the estimation of infinite-dimensional parameters, such as nonparametric density
estimation and prediction with both censored and uncensored data. Building on
this work, his research group developed targeted maximum likelihood estimation
for a target parameter of the data-generating distribution in arbitrary
semiparametric and nonparametric models, as a generic optimal methodology for
statistical and causal inference. Most recently, Mark’s group has focused in
part on the development of a centralized, principled set of software tools for
targeted learning, the tlverse
. For more information, see
https://vanderlaan-lab.org.
Alan Hubbard
Alan Hubbard, Ph.D., is Professor of Biostatistics, former head of the Division of Biostatistics at UC Berkeley, and head of data analytics core at UC Berkeley’s SuperFund research program. His current research interests include causal inference, variable importance analysis, statistical machine learning, estimation of and inference for data-adaptive statistical target parameters, and targeted minimum loss-based estimation. Research in his group is generally motivated by applications to problems in computational biology, epidemiology, and precision medicine.
Jeremy Coyle
Jeremy Coyle, Ph.D., is a consulting data scientist and statistical programmer,
currently leading the software development effort that has produced the
tlverse
ecosystem of R packages and related software tools. Jeremy earned his
Ph.D. in Biostatistics from UC Berkeley in 2016, primarily under the supervision
of Alan Hubbard.
Nima Hejazi
Nima is a Ph.D. candidate in biostatistics with a designated emphasis in computational and genomic biology, working jointly with Mark van der Laan and Alan Hubbard. Nima is affiliated with UC Berkeley’s Center for Computational Biology and NIH Biomedical Big Data training program. His research interests span causal inference, nonparametric inference and machine learning, targeted loss-based estimation, survival analysis, statistical computing, reproducible research, and high-dimensional biology. He is also passionate about software development for applied statistics, including software design, automated testing, and reproducible coding practices. For more information, see https://nimahejazi.org.
Ivana Malenica
Ivana is a Ph.D. student in biostatistics advised by Mark van der Laan. Ivana is currently a fellow at the Berkeley Institute for Data Science, after serving as a NIH Biomedical Big Data and Freeport-McMoRan Genomic Engine fellow. She earned her Master’s in Biostatistics and Bachelor’s in Mathematics, and spent some time at the Translational Genomics Research Institute. Very broadly, her research interests span non/semi-parametric theory, probability theory, machine learning, causal inference and high-dimensional statistics. Most of her current work involves complex dependent settings (dependence through time and network) and adaptive sequential designs.
Rachael Phillips
Rachael is a Ph.D. student in biostatistics, advised by Alan Hubbard and Mark van der Laan. She has an M.A. in Biostatistics, B.S. in Biology with a Chemistry minor and a B.A. in Mathematics with a Spanish minor. Her research is applied, and specific to human health. Motivated by issues arising in healthcare, Rachael leverages strategies rooted in causal inference and nonparametric estimation to build clinician-tailored, machine-driven solutions. Her accompanying statistical interests include high-dimensional statistics and experimental design. She is also passionate about free, online-mediated education. She is affiliated with the UC Berkeley Center for Computational Biology, NIH Biomedical Big Data Training Program, and Superfund Research Program.