\(\DeclareMathOperator{\expit}{expit}\) \(\DeclareMathOperator{\logit}{logit}\) \(\DeclareMathOperator*{\argmin}{\arg\!\min}\) \(\newcommand{\indep}{\perp\!\!\!\perp}\) \(\newcommand{\coloneqq}{\mathrel{=}}\) \(\newcommand{\R}{\mathbb{R}}\) \(\newcommand{\E}{\mathbb{E}}\) \(\newcommand{\M}{\mathcal{M}}\) \(\renewcommand{\P}{\mathbb{P}}\) \(\newcommand{\I}{\mathbb{I}}\) \(\newcommand{\1}{\mathbbm{1}}\)

4 Learning from Data: A Roadmap

Learning Objectives

  1. Translate scientific questions to statistical questions.
  2. Define a statistical model based on knowledge about the scientific experiment or study that generated the data.
  3. Identify a causal parameter as a function of the observed data distribution.
  4. Explain the following statistical and causal assumptions alongside their implications: independent and identically distributed (i.i.d.), consistency, no unmeasured confounding, interference, positivity.


The roadmap of statistical learning is concerned with the process of translating real-world scientific questions to mathematical formalisms necessary for formulating relevant statistical inference problems. This involves viewing data as a random variable (complete with its own underlying probability distribution), incorporating scientific knowledge into the choice of statistical model, selecting a statistical target parameter that represents an answer to the scientific question of interest, and developing efficient estimators of the statistical estimand.

4.1 The Roadmap

The roadmap is a six-stage process:

  1. Define the data as a random variable with a probability distribution, \(O \sim P_0\)
  2. Specify the statistical model \(\M\) realistically, such that \(P_0 \in \M\)
  3. Translate the scientific question of interest into a statistical target parameter \(\Psi\) and establish the target population
  4. Choose an estimator \(\hat{\Psi}\) for \(\Psi\) under realistic \(\M\)
  5. Construct a measure of uncertainty for the estimate \(\hat{\Psi}(P_n)\)
  6. Make substantive conclusion

(1) Data: A random variable with a probability distribution, \(O \sim P_0\)

The dataset we are confronted with is the collection of the results of a scientific (or natural) experiment. We can view the data as a random variable; that is, if the same experiment were to be repeated, we should expect to see a different realization of the data generated by the same underlying law governing the experiment in question. In particular, if the experiment were repeated many times, the underlying probability distribution generating the data, \(P_0\), would be revealed. The observed data on a single unit, \(O\), may be thought of as being drawn from this probability distribution \(P_0\). Most often, we have \(n\) independent and identically distributed (i.i.d.) observations of the random variable \(O\) in our dataset. Then, the observed data is the collection \(O_1, \ldots, O_n\), where the subscripts denote the individual observational units. While not all data are i.i.d., this is certainly the most common case in applied data analysis. There are a number of techniques for handling non-i.i.d. data, including establishing conditional independence, such that conditional on some variable (e.g., subject ID for repeated measures data) the i.i.d. assumption holds, and incorporating inferential corrections for repeated or clustered observations, to name but a few.

The empirical probability measure, \(P_n\)

With \(n\) i.i.d. observations in hand, we can define an empirical probability measure, \(P_n\). The empirical probability measure is an approximation of the true probability measure, \(P_0\), allowing us to learn from the observed data. For example, we can define the empirical probability measure of a set of variables, say \(W\), to be the proportion of observations that belong in \(W\). That is, \[\begin{equation*} P_n(W) = \frac{1}{n}\sum_{i=1}^{n} \I(O_i \in W) \end{equation*}\]

In order to understand the scope for learning from a particular dataset, we next need to ask “What do we know about the process that led to the data’s generation?” This brings us on to Step 2.

(2) Defining the statistical model \(\M\) such that \(P_0 \in \M\)

The statistical model \(\M\) is the set of all possible probability distributions that could describe the process by which our observed data have been generated, appropriately constrained by background scientific knowledge. Often, \(\M\) is necessarily very large (i.e., non-parametric), reflecting the fact that statistical knowledge about \(P_0\) is limited.

If \(P_0\) is described by a finite number of parameters, then the statistical model is referred to as parametric. Such an assumption is made, for example, by the proposition that \(O\) has a Normal distribution with mean \(\mu\) and variance \(\sigma^2\). More generally, a parametric model may be defined as

\[\begin{equation*} \M(\theta) = \{P_{\theta} : \theta \in \R^d \}, \end{equation*}\] which describes a constrained statistical model consisting of all distributions \(P_{\theta}\) that are indexed by some finite, \(d\)-dimensional parameter \(\theta\).

The assumption that \(P_0\) has a specific, parametric form is made quite commonly. Unfortunately, this is even the case when such assumptions are not supported by domain knowledge about the data-generating process. This practice of oversimplification in the current, and traditional, culture of statistical data analysis typically complicates or entirely thwarts any attempt to reliably answer the scientific question at hand. Why, you ask? Consider how much knowledge one must have to know (beyond a shadow of a doubt) that the data-generating distribution underlying a given dataset is, in fact, governed by just two parameters, as is the case with the ubiquitously relied upon Normal distribution. Similarly, main terms Cox proportional hazards, logistic regression, and linear models imply a highly constrained statistical model, and if any of the assumptions are unwarranted then there will be bias in their result (except when treatment is randomized). The philosophy used to justify parametric assumptions is rooted in misinterpretations of the often-quoted saying of George Box, that “All models are wrong but some are useful,” which has been irresponsibly used to encourage the data analyst to make arbitrary modeling choices. However, when one makes such unfounded assumptions, it is more likely that \(\M\) does not contain \(P_0\), in which case the statistical model is said to be misspecified. Statistical model misspecification introduces a bias that leads to misleading, unreliable results and inference.

The result of unwarranted assumptions and oversimplifications is a practice of statistical data science in which starkly disparate answers to the same scientific problem emerge. Practically, this is owed to the application of distinct statistical techniques under differing modeling decisions and assumptions made (but not communicated well) by different data analysts. Even in the nascent days of statistical data analysis, it was recognized that it is “far better [to develop] an approximate answer to the right question…than an exact answer to the wrong question, which can always be made precise” (Tukey, 1962), though traditional statistics failed to heed this advice for a number of decades (Donoho, 2017). The roadmap avoids this bias by defining the statistical model through a representation of the true data-generating distribution underlying the observed data. The ultimate goal is to formulate the statistical estimation problem precisely (up to the constraints imposed by available scientific knowledge), so that one can then tailor the estimation procedure to the motivating scientific problem.

It is crucial that the domain scientist(s) have absolute clarity about what is actually known about the process/experiment that generated the data, and that this is communicated to data scientists with as much detail as possible. This knowledge is rarely ground truth itself, but instead comes in the form of scientific conventions, accepted hypotheses, and operational assumptions. It is then the data scientist’s responsibility to translate the domain knowledge into statistical knowledge about \(P_0\), and then to define the statistical model \(\M\) so that it respects what is known about \(P_0\) and makes no further restrictions. In this manner, we can ensure that \(P_0\) is contained in \(\M\), which we refer to generally as defining a realistic statistical model \(\M\).

Defining \(\M\) realistically requires a shift in the paradigm of statistical problem solving. Instead of considering the methods/software one is familiar with and then trying to solve most problems with that toolbox, one must obtain a deep understanding of the experiment and scientific question first and then formulate a plan for learning from the data in a way that respects this. This requires statisticians to have not only solid methodological and theoretical foundations, but good communication skills, as several meetings with domain experts are typically required to review details of the study, possibly refine of the question of interest, translate technical details, and interpret the findings in a way that is statistically correct and agreeable with non-statistician domain experts. Unfortunately, communication between statisticians and non-statistician researchers is often fraught with misinterpretation. This is to be expected, as each have their own expertise, but proper communication about the underlying science and the motivating study can help to ensure each have appropriate context for a given statistical data analysis. The roadmap provides a principled mechanism for learning from data realistically, so that what is learned from the data represents a reliable and reproducible approximation of the answer to the scientific question of interest. As the roadmap provides a rigorous method for translating scientific knowledge and questions into a statistical framework that can be used to learn from data, it is an invaluable tool to guide communication between statisticians and non-statistician domain scientists. This brings us to our next step in the roadmap, “What are we trying to learn from the data?”

(3) The statistical target parameter \(\Psi\) and statistical estimand \(\psi_0\)

The statistical target parameter, \(\Psi\), is defined as a mapping from the statistical model, \(\M\), to the parameter space. Usually, the parameter space is a real number (but not necessarily so), in which case we can formally define the target parameter as the mapping \(\Psi: \M \rightarrow \R\). The statistical estimand may be seen as a representation of the quantity that we wish to learn from the data, the answer to a well-specified – often causal – question of interest about a particular target population. In contrast to ordinary statistical estimands, causal estimands require an extra set of assumptions to allow for their identification from the observed data. Based on causal models (Hernán and Robins, 2022; Pearl, 2009), identification assumptions are untestable and must be justified through a combination of knowledge about the system under study or the process by which the experiment was conducted. These assumptions are described in greater detail in the following section on causal target parameters.

For a simple example, consider a dataset containing observations of a survival time on every adult, for which our question of interest is “What’s the probability that an adult lives longer than five years?” We have,

\[\begin{equation*} \psi_0 = \Psi(P_0) = \E_{P_0}(O > 5) = \int_5^{\infty} dP_0(o). \end{equation*}\]

This answer to this question is the statistical estimand, \(\Psi(P_0)=\psi_0\), which is the quantity we wish to learn from the data. As discussed above, back-and-forth communication between domain scientists and statisticians is often required to define \(\M\) realistically, and to finalize \(\Psi\) and the target population such that the question is supported in the data. For instance, say we are interested in learning the average effect of a headache medication for treating migraines in adults and we learn that no one with high blood pressure can receive the medication. In the next meeting with domain scientists, we might suggest that the target population be modified to adults without high blood pressure or ask a question involving a dynamic treatment such that within \(\Psi\) adults with high blood pressure are never considered as individuals who could receive treatment. Once we have defined \(O\), \(\M\) realistically and \(\Psi\), we have formally defined the statistical estimation problem. Next comes Step 4: “How do we learn from the data the approximate answer to the question of interest?

(4) The estimator \(\hat{\Psi}\) and estimate \(\psi_n\)

To obtain a good approximation of the statistical estimand, we need an estimator \(\hat{\Psi}\), an a priori-specified algorithm defined as a mapping from the set of the set of possible empirical distributions \(P_n\) (which live in a non-parametric statistical model \(\M_{NP}\)) to the parameter space for our target parameter of interest: \(\hat{\Psi} : \M_{NP} \rightarrow \R\). In other words, \(\hat{\Psi}\) is a function that takes as input the observed data, a realization of \(P_n\), and then outputs a value in the parameter space. Where the estimator may be seen as an operator that maps the observed data’s corresponding empirical distribution to a value in the parameter space, the numerical output produced by such a function is the estimate, \(\hat{\Psi}(P_n)=\psi_n\). Thus, \(\psi_n\) is an element of the parameter space as informed by the empirical probability distribution \(P_n\) of the observed data \(O_1, \ldots, O_n\). If we plug in a realization of \(P_n\) (based on a sample size \(n\) of the random variable \(O\), we get back an estimate \(\psi_n\) of the true parameter value \(\psi_0\). As we have motivated in step 2, it is imperative to consider realistic statistical models for estimation. Therefore, flexible estimators that allow for parts of the data-generating process to be unrestricted are necessary. Semiparametric theory and empirical process theory provide a framework for constructing, benchmarking, and understanding the behavior of estimators that depend on flexible estimation strategies in realistic statistical models. In general, desirable properties of an estimator are that it is regular asymptotically linear (RAL) and efficient, thereby admitting a Normal limit distribution that has minimal variance. Substitution/plug-in RAL estimators are also advantageous: they are guaranteed to remain within the bounds of \(\M\) and, relative to estimators that are not plug-in, have improved bias and variance in finite samples. In-depth discussion of the theory and these properties are available in the literature (e.g., Kennedy, 2016; van der Laan and Rose, 2011). We review a few key concepts in the following step.

In order to quantify the uncertainty in our estimate of the target parameter, part of the process of conducting statistical inference, an understanding of the sampling distribution of our estimator is necessary. This brings us to Step 5: “How confident should we be in our statistical answer to the scientific question?

(5) A measure of uncertainty for the estimate \(\psi_n\)

Since the estimator \(\hat{\Psi}\) is a function of the empirical distribution \(P_n\), the estimator itself is a random variable with a sampling distribution. Therefore, if we repeat the experiment of drawing \(n\) observations, we would every time end up with a different realization of our estimate. The hypothetical distribution of these estimates is the sampling distribution of the estimator.

A primary goal in the construction of estimators is to be able to derive their asymptotic sampling distribution through a theoretical analysis involving empirical process theory. In this regard, an important property of the estimators on which we focus is their asymptotic linearity. In particular, asymptotic linearity states that the difference between the estimator and the target parameter (i.e., the truth) can be represented, asymptotically, as an average of i.i.d. random variables plus an asymptotically negligible remainder term:

\[\begin{equation*} \hat{\Psi}(P_n) - \Psi(P_0) = \frac{1}{n} \sum_{i=1}^n IC(P_0)(O_i) + o_p(n^{-1/2}), \end{equation*}\] where the influence curve (IC) is a function of the observed data \(O\) but the function itself is defined by the underlying data-generating distribution \(P_0\). Based on this asymptotic approximation, the Central Limit Theorem can be used to show

\[\begin{equation*} \sqrt{n} \left(\hat{\Psi}(P_n) - \Psi(P_0)\right) \sim N(0, \sigma^2_{IC}), \end{equation*}\] where \(\sigma^2_{IC}\) is the variance of \(IC(P_0)(O)\). Given an estimate of \(\sigma^2_{IC}\), it is then possible to construct classic, asymptotically accurate Wald-type confidence intervals (CIs) and hypothesis tests. For example, a standard \((1 - \alpha)\) CI takes the form

\[\begin{equation*} \psi_n \pm Z \frac{\hat{\sigma}_{IC}}{\sqrt{n}} \ , \end{equation*}\] where \(Z\) is the \((1 - \alpha / 2)^\text{th}\) quantile of the standard Normal distribution. Following convention, we will often be interested in constructing 95% two-tailed CIs, corresponding to probability mass \(\alpha/2 = 0.025\) in each tail of the limit distribution; thus, we will take \(Z \approx 1.96\) as the quantile.

Steps (1)–(5) of the roadmap define the statistical analysis plan, all of which can be done before any data is revealed. The last step of the roadmap involves interpreting the results obtained in step (4) and (5) and therefore requires the data to be analyzed; however, any additional analysis that may take place as part of step (6) can be pre-specified as well. This final step of the roadmap addresses the question, “what is the interpretation and robustness of the study’s findings, and what conclusions can be drawn from them?

(6) Make substantive conclusion

Making the substantive conclusion involves interpreting the study findings. It also provides an opportunity to ask follow-up questions that might be addressed later and/or discuss issues that can inform future studies. Statistical estimands \(\psi_0\) can have statistical (noncausal) and causal interpretations. Both are often of interest and can be provided. The target population should be clearly mentioned in the interpretation, regardless of whether it’s a purely statistical or causal interpretation, to curtail extrapolation of results.

The major distinction between statistical versus causal interpretations is that the latter relies on untestable so-called “identifiability” assumptions. In the following section, we review these assumptions one-by-one. Here, we focus on the interpretation and robustness of the study findings with respect to them. Specifically, causal target parameters cannot be estimated from observed data without additional identifiability assumptions, and so the validity of a result’s causal interpretation hinges on them holding in the data. The more these assumptions do not hold, the larger the causal gap, the difference between the statistical estimand and the causal estimand. In a perfect randomized control trial with no loss to follow-up, the causal gap will be zero as the statistical and causal estimands are equivalent. In Dı́az and van der Laan (2013), a non-parametric sensitivity analysis for assessing the impact of a hypothesized causal gaps on estimates and inference is proposed. In Gruber et al. (2023) and Gruber et al. (2022), there are example implementations of the methods proposed in Dı́az and van der Laan (2013); in particular, the difference between adjusted and unadjusted effect estimates is used to define a range of possible causal gaps relative to this difference. If the question of interest is causal, then such a model-free sensitivity analysis (possibly as a complement to other sensitivity analyses) is recommended to assess the robustness of the study findings.

4.2 Summary of the Roadmap

Data collected across \(n\) i.i.d. units, \(O_1, \ldots, O_n\), may be viewed as a collection of random variables arising from the same underlying probability distribution \(\P_0\). This is expressed by denoting the collection of data as being generated as \(O_1, \ldots, O_n \sim P_0\). Domain knowledge about the experiment that generated the data (e.g., if the treatment was randomized, if the treatment decision or loss to follow-up depended on a subset of covariates, time ordering in which the variables were added to the data) is translated by the statistician / data scientist to define the statistical model \(\M\), a postulated space of candidate probability distributions that is supposed to contain \(P_0\). In particular, the roadmap emphasizes the critical role of defining \(\M\) such that \(P_0\) is guaranteed to be encapsulated by it, \(P_0 \in \M\). By only limiting \(\M\) based on domain knowledge about the experiment (i.e., reality) — opposed to constraining it unrealistically (e.g., assuming a restrictive functional form, like a main terms linear/logistic model, describes \(P_0\)) — it can be ensured that \(P_0 \in \M\), and we refer to this as defining a realistic statistical model. Often, knowledge that can be used to constrain \(\M\) is very limited, and so \(\M\) must be very large to define it such that \(P_0 \in \M\); hence, realistic statistical models are often termed semi- or non-parametric, since they are too large to be indexed by a finite-dimensional set of parameters. Necessarily, our statistical query must begin with, “What are we trying to learn from the data?”, a question whose answer is captured by the statistical target parameter, \(\Psi\), a function defined by the true data-generating distribution \(P_0\), that maps \(\M\) into the statistical estimand, \(\psi_0\). At this stage, the statistical estimation problem is formally defined, allowing for the use of statistical theory to guide the construction of estimators, which are algorithms that approximate the answer the question of interest by learning from the data. Desirable properties of an estimator are that it is unbiased, efficient, plug-in, and robust in finite samples. If the question of interest is causal, then a model-free sensitivity analysis is recommended to assess the robustness of the study’s findings under various hypothesized causal gaps.

4.3 Causal Target Parameters

In many cases, we are interested in problems that ask questions regarding the causal effect of an intervention, whether an assigned treatment (e.g., a prescribed drug) or a “naturally occurring” exposure (e.g., pollution from a nearby factory), on a future outcome of interest. These causal effects may be defined as summaries of the population of interest (e.g., population mean of a particular outcome) under contrasting interventions (e.g., comparing the treated to the untreated condition). For example, a causal effect could be defined as the mean difference of a disease outcome between two causal contrasts, counterfactual cases in which the study population were set to uniformly experience low pollution levels for some pollutant, and in which the same population were set to uniformly experience high levels of the same pollutant.

There are different ways of operationalizing the theoretical experiments that generate the counterfactual data necessary for describing such causal contrasts of interest. We could simply assume that the counterfactual outcomes exist in theory for all treatment contrasts of interest (Imbens and Rubin, 2015; Neyman, 1938; Rubin, 2005), which may be encoded in so-called “science tables”. Alternatively, we could consider interventions on structural causal models (SCMs) (Pearl, 1995, 2009), which may be represented by directed acyclic graphs (DAGs). Both frameworks allow for the known or hypothesized set of relationships between variables in the system under study to be encoded and mathematically formalized.

The Causal Model

Throughout, we will focus on the use of DAGs and SCMs for the description of causal parameters. Estimators of statistical parameters that correspond, under standard but untestable identifiability assumptions, to these causal parameters are introduced below. DAGs are a particularly useful tool for visually expressing what we know about the causal relations among variables in the system under study. Ignoring exogenous \(U\) terms (explained below), we assume the following ordering of the variables that compose the observed data \(O\). We demonstrate the construction of a DAG below using DAGitty (Textor et al., 2011):


# make DAG by specifying dependence structure
dag <- dagitty(
  "dag {
    W -> A
    W -> Y
    A -> Y
    W -> A -> Y
exposures(dag) <- c("A")
outcomes(dag) <- c("Y")
tidy_dag <- tidy_dagitty(dag)

# visualize DAG
ggdag(tidy_dag) +

While DAGs like the above provide a convenient means by which to express the causal relations between variables, these same causal relations can be equivalently represented by an SCM: \[\begin{align*} W &= f_W(U_W) \\ A &= f_A(W, U_A) \\ Y &= f_Y(W, A, U_Y), \end{align*}\] where the \(f\)’s are unspecified deterministic functions that generate the corresponding random variables as a function of the variable’s “parents” (i.e., upstream nodes with arrows into the given random variable) in the DAG, and the unobserved, exogenous error terms (i.e., the \(U\)’s). An SCM may be thought of as a representation of the algorithm that produces the data, \(O\), in the population of interest. Much of statistics and data science is devoted to discovering properties of this system of equations (e.g., estimation of the functional form \(f_Y\) governing the outcome variable \(Y\)).

The first hypothetical experiment we will consider is assigning exposure to the entire population and observing the outcome, and then withholding exposure to the same population and observing the outcome. This corresponds to a comparison of the outcome distribution in the population under two distinct interventions:

  1. \(A\) is set to \(1\) for all individuals, and
  2. \(A\) is set to \(0\) for all individuals.

These interventions may be thought of as operations that imply changes to the structural equations in the system under study. For the case \(A = 1\), we have \[\begin{align*} W &= f_W(U_W) \\ A &= 1 \\ Y(1) &= f_Y(W, 1, U_Y) \ , \end{align*}\] while, for the case \(A=0\), \[\begin{align*} W &= f_W(U_W) \\ A &= 0 \\ Y(0) &= f_Y(W, 0, U_Y) \ . \end{align*}\]

In these equations, \(A\) is no longer a function of \(W\) because the intervention on the system set \(A\) deterministically to one of the values \(1\) or \(0\) consistent with the intervention performed. The new symbols \(Y(1)\) and \(Y(0)\) indicate the values the outcome variable would take in the population of interest when it is generated by removing the contribution of \(A\) to \(f_Y\) and instead setting \(A\) to the values \(1\) and \(0\), respectively. The variables \(Y(1)\) and \(Y(0)\) are often called counterfactuals (since they arise from interventions that run contrary to fact) and are, in other frameworks, called the potential outcomes of \(Y\) [Neyman (1938); rubin2005causal; imbens2015causal]. The difference in the counterfactual means of the outcome under these two interventions defines a well known causal parameter that is most often called the “average treatment effect” (ATE) and is denoted

\[\begin{equation} ATE = \E_X[Y(1) - Y(0)], \tag{4.1} \end{equation}\] where \(\E_X(\cdot)\) is the expectation taken over the theoretical (unobservable) full data (i.e., \(X = (W, Y(1), Y(0))\)) distribution \(P_X\). Note that the full data structure \(X\) is, by its very definition, unobservable since one can never observe both of \(Y(1)\) and \(Y(0)\) for the same observational unit.

We can define much more complicated interventions on SCMs, such as interventions based upon dynamic rules (which assign particular interventions based on a function of the covariates \(W\)), stochastic rules (which can even account for the natural value of \(A\) observed in the absence of the intervention), and much more. Each results in a different target causal parameter and entails different identifiability assumptions discussed below.


Since we can never simultaneously observe \(Y(0)\), the counterfactual outcome when \(A=0\), and \(Y(1)\), the counterfactual outcome when \(A=1\), we cannot estimate their difference \(Y(1) - Y(0)\) (the individual treatment effect), which appears in Equation (4.1) (inside the expectation \(\E_X(\cdot)\) that defines ATE). This is called the Fundamental Problem of Causal Inference (Holland, 1986). Thus, one of the primary activities in causal inference is to identify the assumptions necessary to express causal quantities of interest as functions of the data-generating distribution of the observed data. To do this, we must make assumptions under which such quantities may be estimated from the observed data \(O \sim P_0\) and its corresponding data-generating distribution \(P_0\). Fortunately, given the causal model specified in the SCM above, we can, with a handful of untestable assumptions, estimate the ATE from observational data. These assumptions may be summarized as follows.

Definition 4.1 (Consistency) The outcome for unit \(i\) is \(Y_i(a)\) whenever \(A_i = a\), which may be thought of as “no other versions of treatment” or “no side effects of treatment.”

Definition 4.2 (No Interference) The outcome for unit \(i\), \(Y_i\), cannot be affected by the exposure of unit \(j\), \(A_j\), for all \(i \neq j\).

Definition 4.3 (No Unmeasured Confounding) \(A \perp Y(a) \mid W\) for all \(a \in \mathcal{A}\), which states that the potential outcomes \((Y(a) : a \in \mathcal{A})\) arise independently from exposure status \(A\), conditional on the observed covariates \(W\). This is the analog of the randomization assumption in data arising from natural experiments, ensuring that the effect of \(A\) on \(Y\) can be disentangled from that of \(W\) on \(Y\), even though \(W\) affects both.

Definition 4.4 (Positivity/Overlap) All observed units, across strata defined by \(W\), must have a bounded probability of receiving treatment – that is, \(\epsilon < \P(A = a \mid W) < 1 - \epsilon\) for all \(a\) and \(W\) and for some \(\epsilon > 0\))  .

Technically speaking, only the latter two of these assumptions are necessary when working within the SCM framework, as the first two are implied properties of an SCM for i.i.d. data (if you’re really curious, see this commentary of Pearl (2010) for an extended discussion). We introduce all four identification assumptions because they are most often considered together, and all four are necessary when working within the potential outcomes framework (Imbens and Rubin, 2015; Rubin, 2005).

Under these assumptions, the ATE may be re-written as a function of \(P_0\), the distribution of the observed data:

\[\begin{align} \psi_{\text{ATE}} &= \E_0[Y(1) - Y(0)] \\ \nonumber &= \E_0 [\E_0[Y \mid A = 1, W] - \E_0[Y \mid A = 0, W]] \ . \tag{4.2} \end{align}\] In words, the ATE is the mean difference in the predicted outcome values for each subject, under the contrast of treatment conditions (\(A = 0\) versus \(A = 1\)), in the population (when averaged over all observations). Thus, a parameter of a theoretical complete (or “full”) data distribution can be represented as an estimand of the observed data distribution. Significantly, there is nothing about the representation in Equation (4.2) that requires parameteric assumptions; thus, the regression functions on the right hand side may be estimated without restrictive assumptions about their underlying functional forms. With different parameters, there will be potentially different identifiability assumptions and the resulting estimands can be functions of different components of \(P_0\). We discuss several more complex estimands in subsequent chapters.