Chapter 2 The Targeted Learning Roadmap

A central goal of the Targeted Learning statistical paradigm is to estimate scientifically relevant parameters in realistic (usually nonparametric) models and to do so with finite-sample robustness and consistent inference.

2.1 The Statistical Model

Assume we have an i.i.d. sample of confounders, a binary intervention of interest, and an outcome, or are observed data is
\[ O = (W, A, Y).\] The distribution of the observed data may be factorized as follows: \[P(O) = P(W, A, Y) = P(W)P (A \mid W) P(Y \mid A, W).\] To estimate a parameter of interest, a researcher need not necessarily be able to specify these whole or conditional distributions. Rather, each estimator only requires that certain parts of the distribution be known; for example, some may require estimates of \(\mathbb{E}(Y \mid A, W)\), the mean of \(Y\) within subgroups \((A, W)\), or the regression of the outcome on the exposure and confounders.

At this stage in the roadmap, the researcher must specify a choice of statistical model to be used in estimating \(\mathbb{E}(Y \mid A, W)\) or other elements of the probability distribution needed to estimate the parameter of interest. Here, statistical model means any constraints on the model form that may be imposed by knowledge about the data-generating process – that is, known aspects of how the data were generated. Typically, the true model is a very large model, placing few constraints, if any, on the data-generating distribution, or a semi-parametric model. With few constraints on the data-generating distribution, and a potentially large number of covariates, data-adaptive, machine-learning approaches remain the only practical option for estimating components of the likelihood. The remainder of this course concerns how to do this as efficiently and robustly as possible, depending on the goal of the analysis.

2.2 The Causal Model

The next step in the roadmap is to use a causal framework to formalize the experiment and thereby define the parameter of interest. Causal graphs are one useful tool to express what we know about the causal relations among variables that are relevant to the question under study (Pearl 2009).

Ignoring error terms, we will assume the following ordering of the variables in \(O\).

While directed acyclic graphs (DAGs) like above provide a convenient means by which to visualize causal relations between variables, the causal relations among variables can be represented via a set of structural equations: \[\begin{align*} W &= f_W(U_W) \\ A &= f_A(W, U_A) \\ Y &= f_Y(W, A, U_Y), \end{align*}\] where \(U_W\), \(U_A\), and \(U_Y\) represent the unmeasured exogenous background characteristics that influence the value of each variable. In the NPSEM, \(f_W\), \(f_A\) and \(f_Y\) denote that each variable (for \(W\), \(A\) and \(Y\), respectively) is a function of its parents and unmeasured background characteristics, but one typically has little information about particular functional constraints (e.g., linear, logit-linear, only one interaction, etc.). For this reason, they are called non-parametric structural equation models (NPSEMs). The DAG and set of nonparametric structural equations represent exactly the same information and so may be used interchangeably.

2.3 The Parameter of Interest

The first hypothetical experiment we will consider is assigning exposure to the whole population and observing the outcome, and then assigning no exposure to the whole population and observing the outcome. On the nonparametric structural equations, this corresponds to a comparison of the outcome distribution in the population under two interventions:

  1. \(A\) is set to \(1\) for all individuals, and
  2. \(A\) is set to \(0\) for all individuals.

These interventions imply two new nonparametric structural equation models. For the case \(A = 1\), we have \[\begin{align*} W &= f_W(U_W) \\ A &= 1 \\ Y(1) &= f_Y(W, 1, U_Y), \end{align*}\] and for the case \(A=0\), \[\begin{align*} W &= f_W(U_W) \\ A &= 0 \\ Y(1) &= f_Y(W, 0, U_Y). \end{align*}\]

In these equations, \(A\) is no longer a function of \(W\) because we have intervened on the system, setting \(A\) deterministically to either of the values \(1\) or \(0\). The new symbols \(Y(1)\) and \(Y(0)\) indicate the outcome variable in our population if it were generated by the respective NPSEMs above; these are often called counterfactuals. The difference between the means of the outcome under these two interventions defines a parameter that is often called the “average treatment effect” (ATE), denoted \[\begin{equation}\label{eqn:ate} ATE = \mathbb{E}_X(Y(1)-Y(0)), \end{equation}\] where \(\mathbb{E}_X\) is the mean under the theoretical (unobserved) full data \(X = (W, Y(1), Y(0))\).

Note, we can define much more complicated interventions on NPSEM’s, such as interventions based upon rules (themselves based upon covariates), stochastic rules, etc. and each results in a different targeted parameter and entails different identifiability assumptions discussed below.

2.4 Identifiability

Because we can never observe both \(Y(0)\) (the counterfactual outcome when \(A=0\)) and \(Y(1)\), we cannot estimate directly. Instead, we have to make assumptions under which this quantity may be estimated from the observed data \(O \sim P_0\) under the data-generating distribution \(P_0\). Fortunately, given the causal model specified in the NPSEM above, we can, with a handful of untestable assumptions, estimate the ATE, even from observational data. These assumptions may be summarized as follows

  1. The causal graph implies \(Y(a) \perp A\) for all \(a \in \mathcal{A}\), which is the randomization assumption. In the case of observational data, the analogous assumption is strong ignorability or no unmeasured confounding \(Y(a) \perp A \mid W\) for all \(a \in \mathcal{A}\);
  2. Although not represented in the causal graph, also required is the assumption of no interference between units, that is, the outcome for unit \(i\) \(Y_i\) is not affected by exposure for unit \(j\) \(A_j\) unless \(i=j\);
  3. Consistency of the treatment mechanism is also required, i.e., the outcome for unit \(i\) is \(Y_i(a)\) whenever \(A_i = a\), an assumption also known as “no other versions of treatment”;
  4. It is also necessary that all observed units, across strata defined by \(W\), have a bounded (non-deterministic) probability of receiving treatment – that is, \(0 < P_0(A = a \mid W) < 1\) for all \(a\) and \(W\). This assumption is referred to as positivity.

Remark: Together, (2) and (3), the assumptions of no interference and consistency, respectively, are jointly referred to as the stable unit treatment value assumption (SUTVA).

Given these assumptions, the ATE may be re-written as a function of \(P_0\), specifically \[\begin{equation}\label{eqn:estimand} ATE = \mathbb{E}_0(Y(1) - Y(0)) = \mathbb{E}_0 \left(\mathbb{E}_0[Y \mid A = 1, W] - \mathbb{E}_0[Y \mid A = 0, W]\right), \end{equation}\] or the difference in the predicted outcome values for each subject, under the contrast of treatment conditions (\(A = 0\) vs. \(A = 1\)), in the population, averaged over all observations. Thus, a parameter of a theoretical “full” data distribution can be represented as an estimand of the observed data distribution. Significantly, there is nothing about the representation in that requires parametric assumptions; thus, the regressions on the right hand side may be estimated freely with machine learning. With different parameters, there will be potentially different identifiability assumptions and the resulting estimands can be functions of different components of \(P_0\). We discuss several more complex estimands in later sections of this workshop.

2.5 Estimators: SuperLearning and Targeted Maximum Likelihood

Although we will discuss more in later sections, the goals of the estimators we desire should be that, among sensible (asymptotically consistent, regular) estimators,

  1. the estimator be asymptotically efficient in the statistical model of interest, and
  2. the estimator can be constructed for finite-sample performance improvements, relative to other estimators in the same class.

These principles guide our approach to estimation: Super Learning for prediction (more generally density estimation) and TMLE for estimation of our intervention parameters of interest.

2.5.1 SuperLearning

  • There is no universally optimal machine learning algorithm for density estimation or prediction.
  • When we have tested different algorithms on actual data and looked at the performance (e.g., MSE of prediction), never does one algorithm always win (see below).
  • For some data, one needs learners that can model a complex function.
  • For others, typically the result of noise or insufficient sample size, a simple, parametric model might fit best.
  • SuperLearner, an ensemble learner, solves this issue, by allowing a convex combination of learners from the simplest (intercept-only) to most complex (neural nets, random forests, SVM, etc).
  • It works by using cross-validation in a manner which guarantees that the resulting fit will be as good as possible, given the learners provided (note, even a convex combination of poor learners can sometimes result in good fit, though better to have good candidates).
knitr::include_graphics("img/misc/vs.pdf")

The figure above shows an example of 10-fold cross-validation.

The schematic below is a visualization of the following algorithm:

  • Break up the sample evenly into V-folds (say V=10).
  • For each of these 10 folds, remove that portion of the sample (kept out as validation sample) and the remaining will be used to fit learners (training sample).
  • Fit each learner on the training sample (note, some learners will have their own internal cross-validation procedure or other methods to select tuning parameters).
  • For each observation in the corresponding training sample, predict the outcome using each of the learners, so if there are \(p\) learners, then where would be \(p\) predictions on each of the observations in validation sample).
  • Take out another validation sample and repeat until each of the V-sets of data are removed.
  • Compare the cross-validated fit of the learners across all observations based on specified loss function (e.g., squared error, negative log-likelihood, …) by calculating the corresponding average loss (risk).
  • Either:

    • choose the learner with smallest risk and apply that learner to entire data set (resulting SL fit),
    • do a weighted average of the learners to minimize the cross-validated risk (construct an ensemble of learners), by

      • re-fitting the learners on the original data set, and
      • use the weights above to get the file SL fit.

Note, this entire procedure can be itself cross-validated to get a consistent estimate of the future performance of the SL fit.

knitr::include_graphics("img/misc/SLKaiserNew.pdf")

For prediction, one can use the cross-validated risk to empirically determine the relative performance of SL and competing methods. Below shows the results of such a study, comparing the fits of several different learners, including the SL algorithms.

knitr::include_graphics("img/misc/ericSL.pdf")

2.5.2 Substitution Estimators

  • Beyond a fit of the prediction function, one might also want to estimate more targeted parameters specific to certain scientific questions.
  • The approach is to plug into the estimand of interest estimates of the relevant distributions.
  • Sometimes, we can use simple empirical distributions, but averaging some function over the observations (e.g., giving weight \(1/n\) for all observations).
  • Other parts of the distribution, like conditional means or probabilities, the estimate will require some sort of smoothing due to the curse of dimensionality.

We give one example using an example of the average treatment effect (see above):

  • \(\Psi(P_0) = \Psi(Q_0) = \mathbb{E}_0 \big[\mathbb{E}_0[Y \mid A = 1, W] - \mathbb{E}_0[Y \mid A = 0, W]\big]\), where \(Q_0\) represents both the distribution of \(Y \mid A,W\) and distribution of \(W\).
  • Let \(\bar{Q}_0(A,W) \equiv E_0(Y \mid A,W)\) and \(Q_{0,W}(w) = P_0 (W=w)\), then \[ \Psi(Q_0) = \sum_w \{ \bar{Q}_0(1,w)-\bar{Q}_0(0,w)\} Q_{0,W}(w) \]
  • The Substitution Estimator plugs in the empirical distribution (weight \(1/n\) for each observation) for \(Q_{0,W}(W_i)\), and some estimate of the regression of \(Y\) on \((A,W)\) (say SL fit): \[ \Psi(Q_n) = \frac{1}{n} \sum_{i=1}^n \{ \bar{Q}_n(1,W_i)-\bar{Q}_n(0,W_i)\} \]
  • Thus, it becomes the average of the differences in predictions from the fit keeping the observed \(W\), but first replacing \(A=1\) and then the same but all \(A=0\).

2.5.3 TMLE

  • Though using SL over an arbitrary parametric regression is an improvement, it’s not sufficient to have the properties of an estimator one needs for rigorous inference.
  • Because the variance-bias trade-off in the SL is focused on the prediction model, it can, for instance, under-fit portions of the distributions that are critical for estimating the parameter of interest, \(\Psi(P_0)\).
  • TMLE keeps the benefits of substitution estimators (it is one), but augments the original estimates to correct for this issue and also results in an asymptotically linear (and thus normally-distributed) estimator with consistent Wald-style confidence intervals.
  • Produces a well-defined, unbiased, efficient substitution estimator of target parameters of a data-generating distribution.
  • Updates an initial (super learner) estimate of the relevant part of the data-generating distribution possibly using an estimate of a nuisance parameter (like the model of intervention given covariates).
  • Removes asymptotic residual bias of initial estimator for the target parameter, if it uses a consistent estimator of \(g_0\).
  • If initial estimator was consistent for the target parameter, the additional fitting of the data in the targeting step may remove finite sample bias, and preserves consistency property of the initial estimator.
  • If the initial estimator and the estimator of \(g_0\) are both consistent, then it is also asymptotically efficient according to semi-parametric statistical model efficiency theory.
  • Thus, every effort is made to achieve minimal bias and the asymptotic semi-parametric efficiency bound for the variance.
knitr::include_graphics("img/misc/TMLEimage.pdf")
  • There are different types of TMLE, sometimes for the same set of parameters, but below is an example of the algorithm for estimating the ATE.
  • In this case, one can present the estimator as:

\[ \Psi(Q^*_n) = \frac{1}{n} \sum_{i=1}^n \{ \bar{Q}^*_n(1,W_i)-\bar{Q}^*_n(0,W_i)\} \] where \(\bar{Q}^*_n(A,W)\) is the TMLE augmented estimate. \(f(\bar{Q}^*_n(A,W)) = f(\bar{Q}_n(A,W)) + \epsilon_n * h_n(A,W)\), where \(f(\cdot)\) is the appropriate link function (e.g., logit), \(\epsilon_n\) is an estimated coefficient and \(h_n(A,W)\) is a “clever covariate”.

  • In this case, \(h_n(A,W) = \frac{A}{g_n(W)}-\frac{1-A}{1-g_n(W)}\), with \(g_n(W) = P_n(A=1 \mid W)\) being the estimated (also by SL) propensity score, so the estimator depends both on initial SL fit of the outcome regression (\(\bar{Q}_0\)) and an SL fit of the propensity score (\(g_n\)).
  • There are further robust augmentations that are used in tlverse, such as an added layer of cross-validation to avoid over-fitting bias (CV-TMLE), and so called methods that can more robustly estimated several parameters simultaneously (e.g., the points on a survival curve).

2.5.4 Inference

  • The estimators we discuss are asymptotically linear, meaning that the difference in the estimate \(\Psi(P_n)\) and the true parameter (\(\Psi(P_0)\)) can be represented in first order by a i.i.d. sum: \[\begin{equation}\label{eqn:IC} \Psi(P_n) - \Psi(P_0) = \frac{1}{n} IC(O_i; \nu) + o_p(1/\sqrt{n}) \end{equation}\]

where \(IC(O_i; \nu)\) (the influence curve or function) is a function of the data and possibly other nuisance parameters \(\nu\). Importantly, such estimators have mean-zero Gaussian limiting distributions; thus, in the univariate case, one has that \[\begin{equation}\label{eqn:limit_dist} \sqrt{n}(\Psi(P_n) - \Psi(P_0)) = N(0, (IC(O_i; \nu))^2), \end{equation}\] so that inference for the estimator of interest may be obtained in terms of the influence function. For this simple case, a 95% confidence interval may be derived as: \[\begin{equation}\label{eqn:CI} \Psi(P^{\star}_n) \pm 1.96 \sqrt{\frac{\hat{\sigma}^2}{n}}, \end{equation}\] where \(SE=\sqrt{\frac{\hat{\sigma}^2}{n}}\) and \(\hat{\sigma}^2\) is the sample variance of the estimated IC’s: \(IC(O; \hat{\nu})\). One can use the functional delta method to derive the influence curve if a parameter of interest may be written as a function of other asymptotically linear estimators.

  • Thus, we can derive robust inference for parameters that are estimated by fitting complex, machine learning algorithms and these methods are computationally quick (do not rely on re-sampling based methods like the bootstrap).

2.6 The WASH Benefits Example Dataset

The data come from a study of the effect of water quality, sanitation, hand washing, and nutritional interventions on child development in rural Bangladesh (WASH Benefits Bangladesh): a cluster-randomised controlled trial (“Temporary,” n.d.). The study enrolled pregnant women in their first or second trimester from the rural villages of Gazipur, Kishoreganj, Mymensingh, and Tangail districts of central Bangladesh, with an average of eight women per cluster. Groups of eight geographically adjacent clusters were block-randomised, using a random number generator, into six intervention groups (all of which received weekly visits from a community health promoter for the first 6 months and every 2 weeks for the next 18 months) and a double-sized control group (no intervention or health promoter visit). The six intervention groups were:

  1. chlorinated drinking water;
  2. improved sanitation;
  3. hand-washing with soap;
  4. combined water, sanitation, and hand washing;
  5. improved nutrition through counseling and provision of lipid-based nutrient supplements; and
  6. combined water, sanitation, handwashing, and nutrition.

In the workshop, we concentrate on child growth (size for age) as the outcome of interest. For reference, this trial was registered with ClinicalTrials.gov as NCT01590095.

library(tidyverse)

# read in data
dat <- read_csv("https://raw.githubusercontent.com/tlverse/tlverse-data/master/wash-benefits/washb_data.csv")
dat
# A tibble: 4,695 x 28
     whz tr    fracode month  aged sex   momage momedu momheight hfiacat Nlt18
   <dbl> <chr> <chr>   <dbl> <dbl> <chr>  <dbl> <chr>      <dbl> <chr>   <dbl>
 1  0    Cont… N05265      9   268 male      30 Prima…      146. Food S…     3
 2 -1.16 Cont… N05265      9   286 male      25 Prima…      149. Modera…     2
 3 -1.05 Cont… N08002      9   264 male      25 Prima…      152. Food S…     1
 4 -1.26 Cont… N08002      9   252 fema…     28 Prima…      140. Food S…     3
 5 -0.59 Cont… N06531      9   336 fema…     19 Secon…      151. Food S…     2
 6 -0.51 Cont… N06531      9   304 male      20 Secon…      154. Severe…     0
 7 -2.46 Cont… N08002      9   336 fema…     19 Prima…      151. Food S…     2
 8 -0.6  Cont… N06528      9   312 fema…     25 No ed…      142. Food S…     2
 9 -0.23 Cont… N06528      9   322 male      30 Secon…      153. Food S…     1
10 -0.14 Cont… N06453      9   376 male      30 No ed…      156. Modera…     2
# … with 4,685 more rows, and 17 more variables: Ncomp <dbl>, watmin <dbl>,
#   elec <dbl>, floor <dbl>, walls <dbl>, roof <dbl>, asset_wardrobe <dbl>,
#   asset_table <dbl>, asset_chair <dbl>, asset_khat <dbl>, asset_chouki <dbl>,
#   asset_tv <dbl>, asset_refrig <dbl>, asset_bike <dbl>, asset_moto <dbl>,
#   asset_sewmach <dbl>, asset_mobile <dbl>

For the purposes of this workshop, we we start by treating the data as independent and identically distributed (i.i.d.) random draws from a very large target population. We could, with available options, account for the clustering of the data (within sampled geographic units), but, for simplification, we avoid these details in these workshop presentations, although modifications of our methodology for biased samples, repeated measures, etc., are available.

We have 28 variables measured, of which 1 variable is set to be the outcome of interest. This outcome, \(Y\), is the weight-for-height Z-score (whz in dat); the treatment of interest, \(A\), is the randomized treatment group (tr in dat); and the adjustment set, \(W\), consists simply of everything else. This results in our observed data structure being \(n\) i.i.d. copies of \(O_i = (W_i, A_i, Y_i)\), for \(i = 1, \ldots, n\).

Using the skimr package, we can quickly summarize the variables measured in the WASH Benefits data set:

library(skimr)
skim(dat)
Skim summary statistics
 n obs: 4695 
 n variables: 28 

── Variable type:character ─────────────────────────────────────────────────────
 variable missing complete    n min max empty n_unique
  fracode       0     4695 4695   2   6     0       20
  hfiacat       0     4695 4695  11  24     0        4
   momedu       0     4695 4695  12  15     0        3
      sex       0     4695 4695   4   6     0        2
       tr       0     4695 4695   3  15     0        7

── Variable type:numeric ───────────────────────────────────────────────────────
       variable missing complete    n    mean    sd     p0    p25   p50    p75
           aged       0     4695 4695 266.32  52.17  42    230    266   303   
     asset_bike       0     4695 4695   0.32   0.47   0      0      0     1   
    asset_chair       0     4695 4695   0.73   0.44   0      0      1     1   
   asset_chouki       0     4695 4695   0.78   0.41   0      1      1     1   
     asset_khat       0     4695 4695   0.61   0.49   0      0      1     1   
   asset_mobile       0     4695 4695   0.86   0.35   0      1      1     1   
     asset_moto       0     4695 4695   0.066  0.25   0      0      0     0   
   asset_refrig       0     4695 4695   0.079  0.27   0      0      0     0   
  asset_sewmach       0     4695 4695   0.065  0.25   0      0      0     0   
    asset_table       0     4695 4695   0.73   0.44   0      0      1     1   
       asset_tv       0     4695 4695   0.3    0.46   0      0      0     1   
 asset_wardrobe       0     4695 4695   0.17   0.37   0      0      0     0   
           elec       0     4695 4695   0.6    0.49   0      0      1     1   
          floor       0     4695 4695   0.11   0.31   0      0      0     0   
         momage      18     4677 4695  23.91   5.24  14     20     23    27   
      momheight      31     4664 4695 150.5    5.23 120.65 147.05 150.6 154.06
          month       0     4695 4695   6.45   3.33   1      4      6     9   
          Ncomp       0     4695 4695  11.04   6.35   2      6     10    14   
          Nlt18       0     4695 4695   1.6    1.25   0      1      1     2   
           roof       0     4695 4695   0.99   0.12   0      1      1     1   
          walls       0     4695 4695   0.72   0.45   0      0      1     1   
         watmin       0     4695 4695   0.95   9.48   0      0      0     1   
            whz       0     4695 4695  -0.59   1.03  -4.67  -1.28  -0.6   0.08
   p100     hist
 460    ▁▁▂▇▇▅▁▁
   1    ▇▁▁▁▁▁▁▃
   1    ▃▁▁▁▁▁▁▇
   1    ▂▁▁▁▁▁▁▇
   1    ▅▁▁▁▁▁▁▇
   1    ▁▁▁▁▁▁▁▇
   1    ▇▁▁▁▁▁▁▁
   1    ▇▁▁▁▁▁▁▁
   1    ▇▁▁▁▁▁▁▁
   1    ▃▁▁▁▁▁▁▇
   1    ▇▁▁▁▁▁▁▃
   1    ▇▁▁▁▁▁▁▂
   1    ▆▁▁▁▁▁▁▇
   1    ▇▁▁▁▁▁▁▁
  60    ▅▇▅▂▁▁▁▁
 168    ▁▁▁▂▇▇▂▁
  12    ▅▃▇▃▂▇▃▅
  52    ▇▇▃▁▁▁▁▁
  10    ▇▃▂▁▁▁▁▁
   1    ▁▁▁▁▁▁▁▇
   1    ▃▁▁▁▁▁▁▇
 600    ▇▁▁▁▁▁▁▁
   4.97 ▁▁▅▇▃▁▁▁

A convenient summary of the relevant variables is given just above, complete with a small visualization describing the marginal characteristics of each covariate. Note that the asset variables reflect socio-economic status of the study participants. Notice also the uniform distribution of the treatment groups (with twice as many controls); this is, of course, by design.

References

Pearl, Judea. 2009. Causality: Models, Reasoning, and Inference. Cambridge University Press.

“Temporary.” n.d.