`shift_tmle.Rmd`

Stochastic treatment regimes present a relatively simple manner in which to assess the effects of continuous treatments by way of parameters that examine the effects induced by the counterfactual shifting of the observed values of a treatment of interest. Here, we present an implementation of a new algorithm for computing targeted minimum loss-based estimates of treatment shift parameters defined based on a shifting function \(d(A,W)\). For a technical presentation of the algorithm, the interested reader is invited to consult Díaz and van der Laan (2018). For additional background on Targeted Learning and previous work on stochastic treatment regimes, please consider consulting van der Laan and Rose (2011), van der Laan and Rose (2018), and Díaz and van der Laan (2012).

To start, let’s load the packages we’ll use and set a seed for simulation:

Consider \(n\) observed units \(O_1, \ldots, O_n\), where each random variable \(O = (W, A, Y)\) corresponds to a single observational unit. Let \(W\) denote baseline covariates (e.g., age, sex, education level), \(A\) an intervention variable of interest (e.g., nutritional supplements), and \(Y\) an outcome of interest (e.g., disease status). Though it need not be the case, let \(A\) be continuous-valued, i.e. \(A \in \mathbb{R}\). Let \(O_i \sim \mathcal{P} \in \mathcal{M}\), where \(\mathcal{M}\) is the nonparametric statistical model defined as the set of continuous densities on \(O\) with respect to some dominating measure. To formalize the definition of stochastic interventions and their corresponding causal effects, we introduce a nonparametric structural equation model (NPSEM), based on Pearl (2000), to define how the system changes under posited interventions: \[\begin{align*}\label{eqn:npsem} W &= f_W(U_W) \\ A &= f_A(W, U_A) \\ Y &= f_Y(A, W, U_Y), \end{align*}\] We denote the observed data structure \(O = (W, A, Y)\)

Letting \(A\) denote a continuous-valued treatment, we assume that the distribution of \(A\) conditional on \(W = w\) has support in the interval \((l(w), u(w))\) – for convenience, let this support be *a.e.* That is, the minimum natural value of treatment \(A\) for an individual with covariates \(W = w\) is \(l(w)\); similarly, the maximum is \(u(w)\). Then, a simple stochastic intervention, based on a shift \(\delta\), may be defined \[\begin{equation}\label{eqn:shift}
d(a, w) =
\begin{cases}
a - \delta & \text{if } a > l(w) + \delta \\
a & \text{if } a \leq l(w) + \delta,
\end{cases}
\end{equation}\] where \(0 \leq \delta \leq u(w)\) is an arbitrary pre-specified value that defines the degree to which the observed value \(A\) is to be shifted, where possible. For the purpose of using such a shift in practice, the present software provides the functions `shift_additive`

and `shift_additive_inv`

, which define a variation of this shift, assuming that the density of treatment \(A\), conditional on the covariates \(W\), has support *a.e.*

```
# simulate simple data for tmle-shift sketch
n_obs <- 1000 # number of observations
n_w <- 1 # number of baseline covariates
tx_mult <- 2 # multiplier for the effect of W = 1 on the treatment
## baseline covariates -- simple, binary
W <- as.numeric(replicate(n_w, rbinom(n_obs, 1, 0.5)))
## create treatment based on baseline W
A <- as.numeric(rnorm(n_obs, mean = tx_mult * W, sd = 1))
## create outcome as a linear function of A, W + white noise
Y <- A + W + rnorm(n_obs, mean = 0, sd = 0.5)
```

The above composes our observed data structure \(O = (W, A, Y)\). To formally express this fact using the `tlverse`

grammar introduced by the `tmle3`

package, we create a single data object and specify the functional relationships between the nodes in the *directed acyclic graph* (DAG) via *nonparametric structural equation models* (NPSEMs), reflected in the node list that we set up:

```
# organize data and nodes for tmle3
data <- data.table(W, A, Y)
node_list <- list(W = "W", A = "A", Y = "Y")
head(data)
```

```
## W A Y
## 1: 1 2.4031607 3.7157578
## 2: 1 4.4973744 5.9651611
## 3: 1 2.0330871 2.2531970
## 4: 0 -0.8089023 -0.8849531
## 5: 1 1.8432067 2.7193091
## 6: 1 1.3555863 2.5705832
```

We now have an observed data structure (`data`

) and a specification of the role that each variable in the data set plays as the nodes in a DAG.

To start, we will initialize a specification for the TMLE of our parameter of interest (called a `tmle3_Spec`

in the `tlverse`

nomenclature) simply by calling `tmle_shift`

. We specify the argument `shift_val = 0.5`

when initializing the `tmle3_Spec`

object to communicate that we’re interested in a shift of \(0.5\) on the scale of the treatment \(A\) – that is, we specify \(\delta = 0.5\) (note that this is an arbitrarily chosen value for this example).

```
# initialize a tmle specification
tmle_spec <- tmle_shift(shift_val = 0.5,
shift_fxn = shift_additive,
shift_fxn_inv = shift_additive_inv)
```

As seen above, the `tmle_shift`

specification object (like all `tmle3_Spec`

objects) does *not* store the data for our specific analysis of interest. Later, we’ll see that passing a data object directly to the `tmle3`

wrapper function, alongside the instantiated `tmle_spec`

, will serve to construct a `tmle3_Task`

object internally (see the `tmle3`

documentation for details). Note that, by default, the `tmle_spec`

object is set up to facilitate cross-validated estimation of likelihood components, ensuring certain empirical process conditions may be circumvented by reducing the contribution of an empirical process term to the estimated influence function (Zheng and Laan 2011). In practice, this automatic incorporation of cross-validation (CV-TMLE) means that the user need not be concerned with these theoretical conditions being satisfied; moreover, cross-validated estimation of the efficient influence function is expected to control the estimated variance.

`sl3`

To easily incorporate ensemble machine learning into the estimation procedure, we rely on the facilities provided in the `sl3`

R package. For a complete guide on using the `sl3`

R package, consider consulting https://tlverse.org/sl3, or https://tlverse.org for the `tlverse`

ecosystem, of which `sl3`

is a core component.

Using the framework provided by the `sl3`

package, the nuisance parameters of the TML estimator may be fit with ensemble learning, using the cross-validation framework of the Super Learner algorithm of van der Laan, Polley, and Hubbard (2007). To estimate the treatment mechanism (denoted g, we must make use of learning algorithms specifically suited to conditional density estimation; a list of such learners may be extracted from `sl3`

using `sl3_list_learners()`

:

`sl3_list_learners("density")`

```
## [1] "Lrnr_condensier" "Lrnr_density_discretize"
## [3] "Lrnr_density_hse" "Lrnr_density_semiparametric"
## [5] "Lrnr_haldensify" "Lrnr_rfcde"
## [7] "Lrnr_solnp_density"
```

To proceed, we’ll select two of the above learners, `Lrnr_haldensify`

for using the highly adaptive lasso for conditional density estimation, based on an algorithm given by Díaz and van der Laan (2011), and `Lrnr_rfcde`

, an approach for using random forests for conditional density estimation (Pospisil and Lee 2018):

```
# learners used for conditional density regression (i.e., propensity score)
lrn_haldensify <- Lrnr_haldensify$new(
n_bins = 10, grid_type = "equal_mass",
lambda_seq = exp(seq(-1, -13, length = 500))
)
lrn_rfcde <- Lrnr_rfcde$new(
n_trees = 1000, node_size = 5,
n_basis = 31, output_type = "observed"
)
sl_lrn_dens <- Lrnr_sl$new(
learners = list(lrn_haldensify, lrn_rfcde),
metalearner = Lrnr_solnp_density$new()
)
```

We also require an approach for estimating the outcome regression (denoted Q). For this, we build a Super Learner composed of an intercept model, a main terms GLM, and the xgboost algorithm for gradient boosting:

```
# learners used for conditional expectation regression (e.g., outcome)
lrn_mean <- Lrnr_mean$new()
lrn_fglm <- Lrnr_glm_fast$new()
lrn_xgb <- Lrnr_xgboost$new(nrounds = 10)
sl_lrn <- Lrnr_sl$new(
learners = list(lrn_mean, lrn_fglm, lrn_xgb),
metalearner = Lrnr_nnls$new()
)
```

We can make the above explicit with respect to standard notation by bundling the ensemble learners into a `list`

object below. Note that in making the `learner_list`

object for our TML estimator, for the sake of example, we use only the conditional density estimation approach of the `haldensify`

package to estimate the treatment mechanism; as this is done in an effort to save computation time, we recommend using a stacked regression approach in standard data analytic practice.

```
# specify outcome and treatment regressions and create learner list
Q_learner <- sl_lrn
# in practice, we'd use an ensemble learner; here, we do not to save time
g_learner <- sl_lrn_dens
g_cv_learner <- Lrnr_cv$new(lrn_haldensify, full_fit = TRUE)
learner_list <- list(Y = Q_learner, A = g_cv_learner)
```

The `learner_list`

object above specifies the role that each of the ensemble learners we’ve generated is to play in computing initial estimators to be used in building a TMLE for the parameter of interest here. In particular, it makes explicit the fact that our `Q_learner`

is used in fitting the outcome regression while our `g_learner`

is used in fitting our treatment mechanism regression.

Note that, by default, the

`tmle_fit <- tmle3(tmle_spec, data, node_list, learner_list)`

```
## Warning: `lang_tail()` is deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
```

```
## Warning: `mut_node_cdr()` is deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
```

```
## A tmle3_Fit that took 1 step(s)
## type param init_est tmle_est se lower upper
## 1: TSM E[Y_{A=NULL}] 2.051405 2.052554 0.06028563 1.934397 2.170712
## psi_transformed lower_transformed upper_transformed
## 1: 2.052554 1.934397 2.170712
```

The `print`

method of the resultant `tmle_fit`

object conveniently displays the results from computing our TML estimator.

Recall that the asymptotic distribution of TML estimators has been studied thoroughly: \[\psi_n - \psi_0 = (P_n - P_0) \cdot D(\bar{Q}_n^*, g_n) + R(\hat{P}^*, P_0),\] which, provided the following two conditions:

- If \(D(\bar{Q}_n^{\star}, g_n)\) converges to \(D(P_0)\) in \(L_2(P_0)\) norm, and
- the size of the class of functions considered for estimation of \(\bar{Q}_n^{\star}\) and \(g_n\) is bounded (technically, \(\exists \mathcal{F}\) s.t. \(D(\bar{Q}_n^{\star}, g_n) \in \mathcal{F}\)
, where \(\mathcal{F}\) is a Donsker class), readily admits the conclusion that \(\psi_n - \psi_0 = (P_n - P_0) \cdot D(P_0) + R(\hat{P}^{\star}, P_0)\).**whp**

Under the additional condition that the remainder term \(R(\hat{P}^*, P_0)\) decays as \(o_P \left( \frac{1}{\sqrt{n}} \right),\) we have that \[\psi_n - \psi_0 = (P_n - P_0) \cdot D(P_0) + o_P \left( \frac{1}{\sqrt{n}} \right),\] which, by a central limit theorem, establishes a Gaussian limiting distribution for the estimator:

\[\sqrt{n}(\psi_n - \psi) \to N(0, V(D(P_0))),\] where \(V(D(P_0))\) is the variance of the efficient influence curve (canonical gradient) when \(\psi\) admits an asymptotically linear representation.

The above implies that \(\psi_n\) is a \(\sqrt{n}\)-consistent estimator of \(\psi\), that it is asymptotically normal (as given above), and that it is locally efficient. This allows us to build Wald-type confidence intervals in a straightforward manner:

\[\psi_n \pm z_{\alpha} \cdot \frac{\sigma_n}{\sqrt{n}},\] where \(\sigma_n^2\) is an estimator of \(V(D(P_0))\). The estimator \(\sigma_n^2\) may be obtained using the bootstrap or computed directly via the following

\[\sigma_n^2 = \frac{1}{n} \sum_{i = 1}^{n} D^2(\bar{Q}_n^{\star}, g_n)(O_i)\]

Having now re-examined these facts, let’s simply examine the results of computing our TML estimator:

```
## A tmle3_Fit that took 1 step(s)
## type param init_est tmle_est se lower upper
## 1: TSM E[Y_{A=NULL}] 2.051405 2.052554 0.06028563 1.934397 2.170712
## psi_transformed lower_transformed upper_transformed
## 1: 2.052554 1.934397 2.170712
```

Díaz, Iván, and Mark J van der Laan. 2011. “Super Learner Based Conditional Density Estimation with Application to Marginal Structural Models.” *The International Journal of Biostatistics* 7 (1). De Gruyter: 1–20.

———. 2012. “Population Intervention Causal Effects Based on Stochastic Interventions.” *Biometrics* 68 (2). Wiley Online Library: 541–49.

———. 2018. “Stochastic Treatment Regimes.” In *Targeted Learning in Data Science: Causal Inference for Complex Longitudinal Studies*, 167–80. Springer Science & Business Media.

Pearl, Judea. 2000. *Causality*. Cambridge university press.

Pospisil, Taylor, and Ann B Lee. 2018. “RFCDE: Random Forests for Conditional Density Estimation.” *arXiv Preprint arXiv:1804.05753*.

van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” *Statistical Applications in Genetics and Molecular Biology* 6 (1).

van der Laan, Mark J, and Sherri Rose. 2011. *Targeted Learning: Causal Inference for Observational and Experimental Data*. Springer Science & Business Media.

———. 2018. *Targeted Learning in Data Science: Causal Inference for Complex Longitudinal Studies*. Springer Science & Business Media.

Zheng, Wenjing, and Mark J van der Laan. 2011. “Cross-Validated Targeted Minimum-Loss-Based Estimation.” In *Targeted Learning*, 459–74. Springer.