Chapter 2 The Roadmap of Statistical Learning
A central goal of the Targeted Learning statistical paradigm is to estimate scientifically relevant parameters in realistic (usually nonparametric) models and to do so with finite-sample robustness and consistent inference.
2.1 The Observed Data and Statistical Model
Assume we have an i.i.d. sample of confounders, a binary intervention of interest, and an outcome, or are observed data is \[ O = (W, A, Y).\] The distribution of the observed data may be factorized as follows: \[P(O) = P(W, A, Y) = P(W)P (A \mid W) P(Y \mid A, W).\] To estimate a parameter of interest, a researcher need not necessarily be able to specify these whole or conditional distributions. Rather, each estimator only requires that certain parts of the distribution be known; for example, some may require estimates of \(\mathbb{E}(Y \mid A, W)\), the mean of \(Y\) within subgroups \((A, W)\), or the regression of the outcome on the exposure and confounders.
At this stage in the roadmap, the researcher must specify a choice of statistical model to be used in estimating \(\mathbb{E}(Y \mid A, W)\) or other elements of the probability distribution needed to estimate the parameter of interest. Here, statistical model means any constraints on the model form that may be imposed by knowledge about the data-generating process – that is, known aspects of how the data were generated. Typically, the true model is a very large model, placing few constraints, if any, on the data-generating distribution, or a semi-parametric model. With few constraints on the data-generating distribution, and a potentially large number of covariates, data-adaptive, machine-learning approaches remain the only practical option for estimating components of the likelihood. The remainder of this course concerns how to do this as efficiently and robustly as possible, depending on the goal of the analysis.
2.2 The Causal Model
The next step in the roadmap is to use a causal framework to formalize the experiment and thereby define the parameter of interest. Causal graphs are one useful tool to express what we know about the causal relations among variables that are relevant to the question under study (Pearl 2009).
Ignoring error terms, we will assume the following ordering of the variables in \(O\).
While directed acyclic graphs (DAGs) like above provide a convenient means by which to visualize causal relations between variables, the causal relations among variables can be represented via a set of structural equations: \[\begin{align*} W &= f_W(U_W) \\ A &= f_A(W, U_A) \\ Y &= f_Y(W, A, U_Y), \end{align*}\] where \(U_W\), \(U_A\), and \(U_Y\) represent the unmeasured exogenous background characteristics that influence the value of each variable. In the NPSEM, \(f_W\), \(f_A\) and \(f_Y\) denote that each variable (for \(W\), \(A\) and \(Y\), respectively) is a function of its parents and unmeasured background characteristics, but one typically has little information about particular functional constraints (e.g., linear, logit-linear, only one interaction, etc.). For this reason, they are called non-parametric structural equation models (NPSEMs). The DAG and set of nonparametric structural equations represent exactly the same information and so may be used interchangeably.
2.3 The Parameter of Interest
The first hypothetical experiment we will consider is assigning exposure to the whole population and observing the outcome, and then assigning no exposure to the whole population and observing the outcome. On the nonparametric structural equations, this corresponds to a comparison of the outcome distribution in the population under two interventions:
- \(A\) is set to \(1\) for all individuals, and
- \(A\) is set to \(0\) for all individuals.
These interventions imply two new nonparametric structural equation models. For the case \(A = 1\), we have \[\begin{align*} W &= f_W(U_W) \\ A &= 1 \\ Y(1) &= f_Y(W, 1, U_Y), \end{align*}\] and for the case \(A=0\), \[\begin{align*} W &= f_W(U_W) \\ A &= 0 \\ Y(1) &= f_Y(W, 0, U_Y). \end{align*}\]
In these equations, \(A\) is no longer a function of \(W\) because we have intervened on the system, setting \(A\) deterministically to either of the values \(1\) or \(0\). The new symbols \(Y(1)\) and \(Y(0)\) indicate the outcome variable in our population if it were generated by the respective NPSEMs above; these are often called counterfactuals. The difference between the means of the outcome under these two interventions defines a parameter that is often called the “average treatment effect” (ATE), denoted \[\begin{equation}\label{eqn:ate} ATE = \mathbb{E}_X(Y(1)-Y(0)), \end{equation}\] where \(\mathbb{E}_X\) is the mean under the theoretical (unobserved) full data \(X = (W, Y(1), Y(0))\).
Note, we can define much more complicated interventions on NPSEM’s, such as interventions based upon rules (themselves based upon covariates), stochastic rules, etc. and each results in a different targeted parameter and entails different identifiability assumptions discussed below.
2.4 Identifiability
Because we can never observe both \(Y(0)\) (the counterfactual outcome when \(A=0\)) and \(Y(1)\), we cannot estimate directly. Instead, we have to make assumptions under which this quantity may be estimated from the observed data \(O \sim P_0\) under the data-generating distribution \(P_0\). Fortunately, given the causal model specified in the NPSEM above, we can, with a handful of untestable assumptions, estimate the ATE, even from observational data. These assumptions may be summarized as follows
- The causal graph implies \(Y(a) \perp A\) for all \(a \in \mathcal{A}\), which is the randomization assumption. In the case of observational data, the analogous assumption is strong ignorability or no unmeasured confounding \(Y(a) \perp A \mid W\) for all \(a \in \mathcal{A}\);
- Although not represented in the causal graph, also required is the assumption of no interference between units, that is, the outcome for unit \(i\) \(Y_i\) is not affected by exposure for unit \(j\) \(A_j\) unless \(i=j\);
- Consistency of the treatment mechanism is also required, i.e., the outcome for unit \(i\) is \(Y_i(a)\) whenever \(A_i = a\), an assumption also known as “no other versions of treatment”;
- It is also necessary that all observed units, across strata defined by \(W\), have a bounded (non-deterministic) probability of receiving treatment – that is, \(0 < P_0(A = a \mid W) < 1\) for all \(a\) and \(W\). This assumption is referred to as positivity.
Remark: Together, (2) and (3), the assumptions of no interference and consistency, respectively, are jointly referred to as the stable unit treatment value assumption (SUTVA).
Given these assumptions, the ATE may be re-written as a function of \(P_0\), specifically \[\begin{equation}\label{eqn:estimand} ATE = \mathbb{E}_0(Y(1) - Y(0)) = \mathbb{E}_0 \left(\mathbb{E}_0[Y \mid A = 1, W] - \mathbb{E}_0[Y \mid A = 0, W]\right), \end{equation}\] or the difference in the predicted outcome values for each subject, under the contrast of treatment conditions (\(A = 0\) vs. \(A = 1\)), in the population, averaged over all observations. Thus, a parameter of a theoretical “full” data distribution can be represented as an estimand of the observed data distribution. Significantly, there is nothing about the representation in that requires parametric assumptions; thus, the regressions on the right hand side may be estimated freely with machine learning. With different parameters, there will be potentially different identifiability assumptions and the resulting estimands can be functions of different components of \(P_0\). We discuss several more complex estimands in later sections of this workshop.
2.5 Estimation: Targeted Maximum Likelihood Estimation
Although we will discuss more in later sections, the goals of the estimators we desire should be that, among sensible (asymptotically consistent, regular) estimators,
- the estimator be asymptotically efficient in the statistical model of interest, and
- the estimator can be constructed for finite-sample performance improvements, relative to other estimators in the same class.
These principles guide our approach to estimation: Super Learning for prediction (more generally density estimation) and TMLE for estimation of our intervention parameters of interest.
2.6 Inference
- The estimators we discuss are asymptotically linear, meaning that the difference in the estimate \(\Psi(P_n)\) and the true parameter (\(\Psi(P_0)\)) can be represented in first order by a i.i.d. sum: \[\begin{equation}\label{eqn:IC} \Psi(P_n) - \Psi(P_0) = \frac{1}{n} \sum_{i=1}^n IC(O_i; \nu) + o_p(1/\sqrt{n}) \end{equation}\]
where \(IC(O_i; \nu)\) (the influence curve or function) is a function of the data and possibly other nuisance parameters \(\nu\). Importantly, such estimators have mean-zero Gaussian limiting distributions; thus, in the univariate case, one has that \[\begin{equation}\label{eqn:limit_dist} \sqrt{n}(\Psi(P_n) - \Psi(P_0)) \xrightarrow[]{D}N(0,\mathbb{V}IC(O_i;\nu)), \end{equation}\] so that inference for the estimator of interest may be obtained in terms of the influence function. For this simple case, a 95% confidence interval may be derived as: \[\begin{equation}\label{eqn:CI} \Psi(P^{\star}_n) \pm z_{1 - \frac{\alpha}{2}} \sqrt{\frac{\hat{\sigma}^2}{n}}, \end{equation}\] where \(SE=\sqrt{\frac{\hat{\sigma}^2}{n}}\) and \(\hat{\sigma}^2\) is the sample variance of the estimated IC’s: \(IC(O; \hat{\nu})\). One can use the functional delta method to derive the influence curve if a parameter of interest may be written as a function of other asymptotically linear estimators.
- Thus, we can derive robust inference for parameters that are estimated by fitting complex, machine learning algorithms and these methods are computationally quick (do not rely on re-sampling based methods like the bootstrap).
References
Pearl, Judea. 2009. Causality: Models, Reasoning, and Inference. Cambridge University Press.