Preface


This is an open source and fully-reproducible electronic vignette for the software workshops incorporated in the half-day tutorial (4 December 2019) and 2-day short course (5-6 December 2019) on applying Targeted Learning in practice given at the Deming Conference on Applied Statistics. The Hitchhiker’s Guide to the tlverse, or a Targeted Learning Practitioner’s Handbook is an in-draft book covering the tlverse software topics in greater detail and may serve as a useful accompanying resource to these workshop materials.

Abstract

Half-Day Tutorial – 9A-12P on December 4, 2019

Targeted Maximum Likelihood Estimation (TMLE) for Machine Learning: A Gentle Introduction

During this half-day tutorial, we will delve into the utility of the roadmap of targeted learning for translating real-world data applications to a mathematical and statistical formulation of the relevant research question of interest. Participants will perform hands-on implementation of state-of-the-art targeted maximum likelihood estimators using the tlverse software ecosystem in the R programming language. Participants will actively learn and apply the core principles of the Targeted Learning methodology, which (1) generalizes machine learning to any estimand of interest; (2) obtains an optimal estimator of the given estimand, grounded in theory; (3) integrates modern ensemble machine learning techniques; and (4) provides formal statistical inference in terms of confidence intervals and testing of specified null hypotheses of interest. It is highly recommended for participants to have an understanding of basic statistical concepts such as confounding, probability distributions, confidence intervals, hypothesis tests, and regression. Advanced knowledge of mathematical statistics may be useful but is not necessary. Familiarity with the R programming language will be essential.

2-Day Short Course – 8A-5P on December 5-6, 2019

Targeted Learning in Data Science: Causal Inference for Observational and Experimental Data

This 2-day short course will provide a comprehensive introduction to the field of targeted learning for causal inference and the corresponding tlverse software ecosystem. We will focus on targeted minimum loss-based estimators of causal effects, including those of static, dynamic, optimal dynamic, and stochastic interventions. These multiply robust, efficient plug-in estimators use state-of-the-art ensemble machine learning tools to flexibly adjust for confounding while yielding valid statistical inference. Estimators will be explored under various real-world scenarios: when the outcome is subject to missingness, when mediators are present on the causal pathway, in high dimensions, under two-phase sampling designs, and in right-censored survival settings possibly subject to competing risks. We will discuss the utility of this robust estimation strategy in comparison to conventional techniques, which often rely on restrictive statistical models and may therefore lead to severely biased inference. In addition to discussion, this course will incorporate both interactive activities and hands-on, guided R programming exercises, to allow participants the opportunity to familiarize themselves with methodology and tools that will translate to real-world analyses. It is highly recommended for participants to have an understanding of basic statistical concepts such as confounding, probability distributions, confidence intervals, hypothesis tests, and regression. Advanced knowledge of mathematical statistics may be useful but is not necessary. Familiarity with the R programming language will be essential.

Contents

These materials are feature modules centered around distinct causal questions, each motivated by a case study, alongside statistical methodology and software for assessing the causal claim of interest. Topics include

  • Why we need a statistical revolution
  • Introduction to the tlverse software ecosystem
  • Roadmap of statistical learning with causal inference
  • International Stroke Trial (IST), WASH Benefits, and Veterans’ Administration Lung Cancer Trial data
  • Super (ensemble machine) learning with the sl3 tlverse R package
  • Targeted learning for causal inference with the tmle3 tlverse R package
  • Optimal treatment regimes and the tmle3mopttx tlverse R package
  • Stochastic treatment regimes and the tmle3shift tlverse R package
  • One-step TMLE for time-to-event outcomes with the MOSS R package
  • Treatment specific mean outcome or marginal structural model for longitudinal data with the ltmle R package

About the instructors and authors

While this workshop is delivered by Mark van der Laan and Rachael Phillips, the majority of these materials are based on joint work with a team of six co-authors:

Mark van der Laan

Mark van der Laan, Ph.D., is Professor of Biostatistics and Statistics at UC Berkeley. His research interests include statistical methods in computational biology, survival analysis, censored data, adaptive designs, targeted maximum likelihood estimation, causal inference, data-adaptive loss-based learning, and multiple testing. His research group developed loss-based super learning in semiparametric models, based on cross-validation, as a generic optimal tool for the estimation of infinite-dimensional parameters, such as nonparametric density estimation and prediction with both censored and uncensored data. Building on this work, his research group developed targeted maximum likelihood estimation for a target parameter of the data-generating distribution in arbitrary semiparametric and nonparametric models, as a generic optimal methodology for statistical and causal inference. Most recently, Mark’s group has focused in part on the development of a centralized, principled set of software tools for targeted learning, the tlverse. For more information, see https://vanderlaan-lab.org.

Rachael Phillips

Rachael is a Ph.D. student in biostatistics, advised by Alan Hubbard and Mark van der Laan. She has an M.A. in Biostatistics, B.S. in Biology with a Chemistry minor and a B.A. in Mathematics with a Spanish minor. Rachael’s research focuses on narrowing the gap between the theory and application of modern statistics for real-world data science. Specifically, Rachael is motivated by issues arising in healthcare, and she leverages strategies rooted in causal inference and nonparametric estimation to build clinician-tailored, machine-driven solutions. Rachael is also passionate about free, online-mediated education and its corresponding pedagogy.

Jeremy Coyle

Jeremy Coyle, Ph.D., is a consulting data scientist and statistical programmer, currently leading the software development effort that has produced the tlverse ecosystem of R packages and related software tools. Jeremy earned his Ph.D. in Biostatistics from UC Berkeley in 2016, primarily under the supervision of Alan Hubbard.

Alan Hubbard

Alan Hubbard, Ph.D., is Professor of Biostatistics, former head of the Division of Biostatistics at UC Berkeley, and head of data analytics core at UC Berkeley’s SuperFund research program. His current research interests include causal inference, variable importance analysis, statistical machine learning, estimation of and inference for data-adaptive statistical target parameters, and targeted minimum loss-based estimation. Research in his group is generally motivated by applications to problems in computational biology, epidemiology, and precision medicine.

Nima Hejazi

Nima is a Ph.D. candidate in biostatistics with a designated emphasis in computational and genomic biology, working with Mark van der Laan and Alan Hubbard. Nima is affiliated with UC Berkeley’s Center for Computational Biology and is a former NIH Biomedical Big Data fellow. He earned is Master’s in Biostatistics (2017) and a Bachelor’s with a triple major in Molecular and Cell Biology (Neurobiology), Psychology, and Public Health (2015) at UC Berkeley. Nima’s interests span nonparametric estimation, high-dimensional inference, targeted learning, statistical computing, survival analysis, and computational biology, with an emphasis on the development of robust and efficient statistical methodologies that draw on the intersection of causal inference and statistical machine learning. For more information, see https://nimahejazi.org.

Ivana Malenica

Ivana is a Ph.D. candidate in biostatistics advised by Mark van der Laan. Ivana is currently a fellow at the Berkeley Institute for Data Science, after serving as a NIH Biomedical Big Data and Freeport-McMoRan Genomic Engine fellow. She earned her Master’s in Biostatistics and Bachelor’s in Mathematics, and spent some time at the Translational Genomics Research Institute. Very broadly, her research interests span non/semi-parametric theory, probability theory, machine learning, causal inference and high-dimensional statistics. Most of her current work involves complex dependent settings (dependence through time and network) and adaptive sequential designs.