# Chapter 4 Ensemble Machine Learning

*Rachael Phillips*

Based on the `sl3`

`R`

package by *Jeremy
Coyle, Nima Hejazi, Ivana Malenica, and Oleg Sofrygin*.

Updated: 2020-01-17

## 4.1 Learning Objectives

By the end of this chapter you will be able to:

- Select a loss function that is appropriate for the functional parameter to be estimated.
- Assemble an ensemble of learners based on the properties that identify what features they support.
- Customize learner hyperparameters to incorporate a diversity of different settings.
- Select a subset of available covariates and pass only those variables to the modeling algorithm.
- Fit an ensemble with nested cross-validation to obtain an estimate of the performance of the ensemble itself.
- Obtain
`sl3`

variable importance metrics. - Interpret the discrete and continuous super learner fits.
- Rationalize the need to remove bias from the super learner to make an optimal bias–variance tradeoff for the parameter of interest.

## 4.2 Introduction

In Chapter 1, we introduced the road map for targeted learning as a
general template to translate real-world data applications into formal
statistical estimation problems. The first steps of this roadmap define the
*statistical estimation problem*, which establish

- Data as a realization of a random variable, or equivalently, an outcome of a particular experiment.
- A statistical model, representing the true knowledge about the data-generating experiment.
- A translation of the scientific question, which is often causal, into a target parameter.

Note that if the target parameter is causal, step 3 also requires establishing identifiability of the target quantity from the observed data distribution, under possible non-testable assumptions that may not necessarily be reasonable. Still, the target quantity does have a valid statistical interpretation. See causal target parameters for more detail on causal models and identifiability.

Now that we have defined the statistical estimation problem, we are ready to
construct the TMLE; an asymptotically linear and efficient substitution
estimator of this target quantity. The first step in this estimation procedure
is an initial estimate of the data-generating distribution, or the relevant part
of this distribution that is needed to evaluate the target parameter. For this
initial estimation, we use the Super Learner (van der Laan, Polley, and Hubbard 2007). The super learner
provides an important step in creating a robust estimator. It is a
loss-function-based tool that uses cross-validation to obtain the best
prediction of our target parameter, based on a weighted average of a library of
machine learning algorithms. This library of machine learning algorithms
consists of functions (“learners” in the `sl3`

nomenclature) that we think
might be consistent with the true data-generating distribution. The
ensembling of algorithms with weights (“metalearning” in the `sl3`

nomenclature)
has been shown to be adaptive and robust, even in small samples
(Polley and van der Laan 2010). The Super Learner has been proven to be asymptotically as
accurate as the best possible prediction algorithm in the library
(van der Laan and Dudoit 2003; van der Vaart, Dudoit, and van der Laan 2006).

### 4.2.1 Background

A *loss function* \(L\) is defined as a function of the observed data and a
candidate parameter value \(\psi\), which has unknown true value \(\psi_0\),
\(L(\psi)(O)\). We can estimate the loss by substituting the empirical
distribution \(P_n\) for the true (but unknown) distribution of the observed data
\(P_0\). A valid loss function will have expectation (risk) that is minimized at
the true value of the parameter \(\psi_0\). For example, the conditional mean
minimizes the risk of the squared error loss. Thus, it is a valid loss function
when estimating the conditional mean.

The *discrete super learner*, or *cross-validation selector*, is the algorithm
in the library that minimizes the cross-validated empirical risk. The
cross-validated empirical risk of an algorithm is defined as the empirical mean
over a validation sample of the loss of the algorithm fitted on the training
sample, averaged across the splits of the data.

The *continuous/ensemble super learner* is a weighted average of the library of
algorithms, where the weights are chosen to minimize the cross-validated
empirical risk of the library. Restricting the weights to be positive and sum to
one (i.e., a convex combination) has been shown to improve upon the discrete
super learner (Polley and van der Laan 2010; van der Laan, Polley, and Hubbard 2007). This notion of weighted
combinations was introduced in Wolpert (1992) for neural networks and
adapted for regressions in Breiman (1996).

For more detail on super learner we refer the reader to van der Laan, Polley, and Hubbard (2007) and Polley and van der Laan (2010). The optimality results for the cross-validation selector among a family of algorithms were established in van der Laan and Dudoit (2003) and extended in van der Vaart, Dudoit, and van der Laan (2006).

## 4.3 Basic Implementation

We begin by illustrating the basic functionality of the Super Learner
algorithm as implemented in `sl3`

. The `sl3`

implementation consists of the
following steps:

- Load the necessary libraries and data
- Define the machine learning task
- Make a super learner by creating library of base learners and a metalearner
- Train the super learner on the machine learning task
- Obtain predicted values

### 4.3.1 WASH Benefits Study Example

Using the WASH data, we are interested in predicting weight-for-height z-score
`whz`

using the available covariate data. Let’s begin!

### 0. Load the necessary libraries and data

First, we will load the relevant `R`

packages, set a seed, and load the data.

```
library(here)
library(data.table)
library(knitr)
library(kableExtra)
library(tidyverse)
library(origami)
library(SuperLearner)
library(sl3)
set.seed(7194)
# load data set and take a peek
washb_data <- fread("https://raw.githubusercontent.com/tlverse/tlverse-data/master/wash-benefits/washb_data.csv",
stringsAsFactors = TRUE)
head(washb_data) %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
```

whz | tr | fracode | month | aged | sex | momage | momedu | momheight | hfiacat | Nlt18 | Ncomp | watmin | elec | floor | walls | roof | asset_wardrobe | asset_table | asset_chair | asset_khat | asset_chouki | asset_tv | asset_refrig | asset_bike | asset_moto | asset_sewmach | asset_mobile |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0.00 | Control | N05265 | 9 | 268 | male | 30 | Primary (1-5y) | 146.40 | Food Secure | 3 | 11 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |

-1.16 | Control | N05265 | 9 | 286 | male | 25 | Primary (1-5y) | 148.75 | Moderately Food Insecure | 2 | 4 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |

-1.05 | Control | N08002 | 9 | 264 | male | 25 | Primary (1-5y) | 152.15 | Food Secure | 1 | 10 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |

-1.26 | Control | N08002 | 9 | 252 | female | 28 | Primary (1-5y) | 140.25 | Food Secure | 3 | 5 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |

-0.59 | Control | N06531 | 9 | 336 | female | 19 | Secondary (>5y) | 150.95 | Food Secure | 2 | 7 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |

-0.51 | Control | N06531 | 9 | 304 | male | 20 | Secondary (>5y) | 154.20 | Severely Food Insecure | 0 | 3 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |

### 1. Define the machine learning task

To define the machine learning **“task”** (predict weight-for-height z-score `whz`

using the available covariate data), we need to create an `sl3_Task`

object. The
`sl3_Task`

keeps track of the roles the variables play in the machine learning
problem, the data, and any metadata (e.g., observational-level weights, id,
offset).

```
# specify the outcome and covariates
outcome <- "whz"
covars <- colnames(washb_data)[-which(names(washb_data) == outcome)]
# create the sl3 task
washb_task <- make_sl3_Task(
data = washb_data,
covariates = covars,
outcome = outcome
)
```

```
Warning in process_data(data, nodes, column_names = column_names, flag = flag, :
Missing covariate data detected: imputing covariates.
```

This warning is important. The task just imputed missing covariates for us.
Specifically, for each covariate column with missing values, `sl3`

uses the
median to impute missing continuous covariates, and the mode to impute binary or
categorical covariates. Also, for each covariate column with missing values,
`sl3`

adds an additional column indicating whether or not the value was imputed,
which is particularly handy when the missingness in the data might be
informative.

Also, notice that we did not specify the number of folds, or the loss function in the task. The default cross-validation scheme is V-fold, with the number of folds \(V=10\).

Let’s visualize our `washb_task`

.

```
A sl3 Task with 4695 obs and these nodes:
$covariates
[1] "tr" "fracode" "month" "aged"
[5] "sex" "momage" "momedu" "momheight"
[9] "hfiacat" "Nlt18" "Ncomp" "watmin"
[13] "elec" "floor" "walls" "roof"
[17] "asset_wardrobe" "asset_table" "asset_chair" "asset_khat"
[21] "asset_chouki" "asset_tv" "asset_refrig" "asset_bike"
[25] "asset_moto" "asset_sewmach" "asset_mobile" "delta_momage"
[29] "delta_momheight"
$outcome
[1] "whz"
$id
NULL
$weights
NULL
$offset
NULL
```

### 2. Make a super learner

Now that we have defined our machine learning problem with the task, we are
ready to **“make”** the super learner. This requires specification of

- A library of base learning algorithms that we think might be consistent with the true data-generating distribution.
- A metalearner, to ensemble the base learners.

We might also incorporate

- Feature selection, to pass only a subset of the predictors to the algorithm.
- Hyperparameter specification, to tune base learners.

Learners have properties that indicate what features they support. We may use
`sl3_list_properties()`

to get a list of all properties supported by at least
one learner.

```
[1] "binomial" "categorical" "continuous"
[4] "cv" "density" "ids"
[7] "multivariate_outcome" "offset" "preprocessing"
[10] "timeseries" "weights" "wrapper"
```

Since we have a continuous outcome, we may identify the learners that support
this outcome type with `sl3_list_learners()`

.

```
[1] "Lrnr_arima" "Lrnr_bartMachine"
[3] "Lrnr_bilstm" "Lrnr_caret"
[5] "Lrnr_condensier" "Lrnr_dbarts"
[7] "Lrnr_earth" "Lrnr_expSmooth"
[9] "Lrnr_gam" "Lrnr_gbm"
[11] "Lrnr_glm" "Lrnr_glm_fast"
[13] "Lrnr_glmnet" "Lrnr_grf"
[15] "Lrnr_h2o_glm" "Lrnr_h2o_grid"
[17] "Lrnr_hal9001" "Lrnr_HarmonicReg"
[19] "Lrnr_lstm" "Lrnr_mean"
[21] "Lrnr_nnls" "Lrnr_optim"
[23] "Lrnr_pkg_SuperLearner" "Lrnr_pkg_SuperLearner_method"
[25] "Lrnr_pkg_SuperLearner_screener" "Lrnr_polspline"
[27] "Lrnr_randomForest" "Lrnr_ranger"
[29] "Lrnr_rpart" "Lrnr_rugarch"
[31] "Lrnr_screener_corP" "Lrnr_screener_corRank"
[33] "Lrnr_screener_randomForest" "Lrnr_solnp"
[35] "Lrnr_stratified" "Lrnr_svm"
[37] "Lrnr_tsDyn" "Lrnr_xgboost"
```

Now that we have an idea of some learners, we can construct them using the
`make_learner`

function.

```
# choose base learners
lrnr_glm <- make_learner(Lrnr_glm)
lrnr_mean <- make_learner(Lrnr_mean)
lrnr_glmnet <- make_learner(Lrnr_glmnet)
```

We can customize learner hyperparameters to incorporate a diversity of different settings.

Documentation for the learners and their hyperparameters can be found
in the `sl3`

Learners
Reference.

We can also include learners from the `SuperLearner`

`R`

package.

```
lrnr_ranger100 <- make_learner(Lrnr_ranger, num.trees = 100)
lrnr_hal_simple <- make_learner(Lrnr_hal9001, degrees = 1, n_folds = 2)
lrnr_gam <- Lrnr_pkg_SuperLearner$new("SL.gam")
lrnr_bayesglm <- Lrnr_pkg_SuperLearner$new("SL.bayesglm")
```

*Are you interested in creating a new base learning algorithm?* If so,
instructions are provided in Defining New `sl3`

Learners.

In order to assemble the library of learners, we need to **“stack”** them
together. A `Stack`

is a special learner and it has the same interface as all
other learners. What makes a stack special is that it combines multiple learners
by training them simultaneously, so that their predictions can be either
combined or compared.

```
stack <- make_learner(
Stack,
lrnr_glm, lrnr_mean, lrnr_ranger100, lrnr_glmnet,
lrnr_gam, lrnr_bayesglm
)
```

We will fit a non-negative least squares metalearner using `Lrnr_nnls`

. Note
that any learner can be used as a metalearner. `Lrnr_nnls`

is a solid choice
for a metalearner, since it creates a convex combination of the learners when
combining them. To

We can optionally select a subset of available covariates and pass only those variables to the modeling algorithm.

Let’s consider screening covariates based on their correlation with our outcome
of interest (`cor.test`

p-value \(\leq 0.1\)).

```
screen_cor <- Lrnr_pkg_SuperLearner_screener$new("screen.corP")
# which covariates are selected on the full data?
screen_cor$train(washb_task)
```

```
[1] "Lrnr_pkg_SuperLearner_screener_screen.corP"
$selected
[1] "tr" "fracode" "aged" "momage"
[5] "momedu" "momheight" "hfiacat" "Nlt18"
[9] "elec" "floor" "walls" "asset_wardrobe"
[13] "asset_table" "asset_chair" "asset_khat" "asset_chouki"
[17] "asset_tv" "asset_refrig" "asset_moto" "asset_sewmach"
[21] "asset_mobile"
```

To **“pipe”** only the selected covariates to the modeling algorithm, we need to
make a `Pipeline`

, which is a just set of learners to be fit sequentially, where
the fit from one learner is used to define the task for the next learner. Note
the difference between `Pipeline`

and `Stack`

here- one is necessary in order
to define a sequential process, whereas the other one establishes parallel
function of learners.

Now our learners will be preceded by a screening step.

We also consider the original `stack`

, just to compare how the feature selection
methods perform in comparison to the methods without feature selection.

Analogous to what we have seen before, we have to stack the pipeline and
original `stack`

together, so we may use them as base learners in our super
learner.

```
fancy_stack <- make_learner(Stack, cor_pipeline, stack)
# we can visualize the stack
dt_stack <- delayed_learner_train(fancy_stack, washb_task)
plot(dt_stack, color = FALSE, height = "400px", width = "100%")
```

We have made a library/stack of base learners and a metalearner, so we are ready to make the super learner. The super learner algorithm fits a metalearner on the validation-set predictions.

### 3. Train the super learner on the machine learning task

The super learner algorithm fits a metalearner on the validation-set
predictions in a cross-validated manner, thereby avoiding overfitting. This
procedure is referred to as the *continuous* super learner. The cross-validation
selector, or *discrete* super learner, is the base learner with the lowest
cross-validated risk.

Now we are ready to **“train”** our super learner on our `sl3_task`

object,
`washb_task`

.

### 4. Obtain predicted values

Now that we have fit the super learner, we are ready to obtain our predicted values for each subject.

`[1] -0.5350742 -0.8963147 -0.7691624 -0.7668657 -0.6906887 -0.7220518`

We can also obtain a summary of the results.

```
[1] "SuperLearner:"
List of 2
$ : chr "Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)"
$ : chr "Stack"
[1] "Lrnr_nnls_FALSE"
lrnrs
1: Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_glm_TRUE
2: Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_mean
3: Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_ranger_100_TRUE_1
4: Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
5: Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_pkg_SuperLearner_SL.gam
6: Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_pkg_SuperLearner_SL.bayesglm
7: Stack_Lrnr_glm_TRUE
8: Stack_Lrnr_mean
9: Stack_Lrnr_ranger_100_TRUE_1
10: Stack_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
11: Stack_Lrnr_pkg_SuperLearner_SL.gam
12: Stack_Lrnr_pkg_SuperLearner_SL.bayesglm
weights
1: 0.00000000
2: 0.00000000
3: 0.06062127
4: 0.18182887
5: 0.17455573
6: 0.00000000
7: 0.00000000
8: 0.01491966
9: 0.32501659
10: 0.00000000
11: 0.24360058
12: 0.00000000
[1] "Cross-validated risk (MSE, squared error loss):"
learner
1: Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_glm_TRUE
2: Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_mean
3: Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_ranger_100_TRUE_1
4: Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
5: Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_pkg_SuperLearner_SL.gam
6: Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_pkg_SuperLearner_SL.bayesglm
7: Stack_Lrnr_glm_TRUE
8: Stack_Lrnr_mean
9: Stack_Lrnr_ranger_100_TRUE_1
10: Stack_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
11: Stack_Lrnr_pkg_SuperLearner_SL.gam
12: Stack_Lrnr_pkg_SuperLearner_SL.bayesglm
13: SuperLearner
coefficients mean_risk SE_risk fold_SD fold_min_risk fold_max_risk
1: NA 1.015128 0.02363317 0.07629401 0.8927540 1.131594
2: NA 1.065282 0.02502664 0.09191791 0.9264292 1.196647
3: NA 1.024937 0.02365692 0.08231236 0.8778104 1.154654
4: NA 1.011705 0.02358588 0.07881693 0.8822292 1.130762
5: NA 1.011497 0.02357149 0.07449866 0.8919503 1.132290
6: NA 1.015119 0.02363328 0.07631510 0.8926608 1.131570
7: NA 1.018612 0.02380402 0.07799191 0.8956048 1.134940
8: NA 1.065282 0.02502664 0.09191791 0.9264292 1.196647
9: NA 1.016893 0.02349240 0.08138677 0.8775006 1.139465
10: NA 1.012390 0.02358515 0.07971601 0.8827155 1.130114
11: NA 1.012122 0.02358982 0.07486427 0.8981537 1.135950
12: NA 1.018596 0.02380414 0.07801948 0.8954820 1.134909
13: NA 1.005375 0.02338984 0.07833095 0.8754770 1.128689
```

From the table of the printed super learner fit, we note that the super learner
had a mean risk of 1.005375
and that this ensemble weighted the `ranger`

and `glmnet`

learners highest while
not weighting the `mean`

learner highly. We can also see that the `glmnet`

learner had the lowest cross-validated mean risk, thus making it the
cross-validated selector (or the *discrete* super learner). The mean risk of the
(continuous) super learner is calculated using the hold-out set that we
visualized in the plot above.

## 4.4 Cross-validated Super Learner

We can cross-validate the super learner to see how well the super learner performs on unseen data, and obtain an estimate of the cross-validated risk of the super learner.

This estimation procedure requires an “external” layer of cross-validation,
also called nested cross-validation, which involves setting aside a separate
holdout sample that we don’t use to fit the super learner. This
external cross validation procedure may also incorporate 10 folds, which is the
default in `sl3`

. However, we will incorporate 2 outer/external folds of
cross-validation for computational efficiency.

We also need to specify a loss function to evaluate super learner.
Documentation for the available loss functions can be found in the `sl3`

Loss
Function Reference.

```
washb_task_new <- make_sl3_Task(
data = washb_data,
covariates = covars,
outcome = outcome,
folds = make_folds(washb_data, fold_fun = folds_vfold, V = 2)
)
```

```
Warning in process_data(data, nodes, column_names = column_names, flag = flag, :
Missing covariate data detected: imputing covariates.
```

```
CVsl_fancy <- CV_lrnr_sl(sl_fit, washb_task_new, loss_squared_error)
CVsl_fancy %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
```

learner | coefficients | mean_risk | SE_risk | fold_SD | fold_min_risk | fold_max_risk |
---|---|---|---|---|---|---|

Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_glm_TRUE | NA | 1.0300 | 0.0238 | 0.0075 | 1.0247 | 1.0353 |

Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_mean | NA | 1.0663 | 0.0250 | 0.0034 | 1.0639 | 1.0687 |

Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_ranger_100_TRUE_1 | NA | 1.0447 | 0.0239 | 0.0008 | 1.0441 | 1.0452 |

Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE | NA | 1.0199 | 0.0236 | 0.0020 | 1.0185 | 1.0213 |

Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_pkg_SuperLearner_SL.gam | NA | 1.0309 | 0.0238 | 0.0076 | 1.0255 | 1.0363 |

Pipeline(Lrnr_pkg_SuperLearner_screener_screen.corP->Stack)_Lrnr_pkg_SuperLearner_SL.bayesglm | NA | 1.0299 | 0.0238 | 0.0073 | 1.0247 | 1.0351 |

Stack_Lrnr_glm_TRUE | NA | 1.0360 | 0.0246 | 0.0160 | 1.0246 | 1.0473 |

Stack_Lrnr_mean | NA | 1.0663 | 0.0250 | 0.0034 | 1.0639 | 1.0687 |

Stack_Lrnr_ranger_100_TRUE_1 | NA | 1.0297 | 0.0236 | 0.0034 | 1.0273 | 1.0321 |

Stack_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE | NA | 1.0199 | 0.0236 | 0.0021 | 1.0184 | 1.0213 |

Stack_Lrnr_pkg_SuperLearner_SL.gam | NA | 1.0398 | 0.0277 | 0.0288 | 1.0195 | 1.0602 |

Stack_Lrnr_pkg_SuperLearner_SL.bayesglm | NA | 1.0359 | 0.0247 | 0.0159 | 1.0246 | 1.0472 |

SuperLearner | NA | 1.0156 | 0.0235 | 0.0017 | 1.0144 | 1.0169 |

## 4.5 Variable Importance Measures with `sl3`

Variable importance can be interesting and informative. The `sl3`

`varimp`

function returns a table with variables listed in decreasing order of
importance, in which the measure of importance is based on a risk difference
between the learner fit with a permuted covariate and the learner fit with the
true covariate, across all covariates. In this manner, the larger the risk
difference, the more important the variable is in the prediction. Let’s explore
the `sl3`

variable importance measurements for the `washb`

data.

```
washb_varimp <- varimp(sl_fit, loss_squared_error)
washb_varimp %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
```

X | risk_diff |
---|---|

aged | 0.0358 |

momedu | 0.0071 |

asset_refrig | 0.0059 |

month | 0.0053 |

momheight | 0.0052 |

tr | 0.0049 |

asset_chair | 0.0027 |

Nlt18 | 0.0017 |

elec | 0.0015 |

asset_moto | 0.0014 |

momage | 0.0012 |

hfiacat | 0.0011 |

fracode | 0.0010 |

asset_khat | 0.0010 |

asset_chouki | 0.0009 |

asset_wardrobe | 0.0006 |

sex | 0.0006 |

asset_sewmach | 0.0005 |

walls | 0.0004 |

floor | 0.0003 |

delta_momage | -0.0001 |

watmin | -0.0002 |

asset_table | -0.0002 |

asset_bike | -0.0002 |

roof | -0.0002 |

asset_mobile | -0.0003 |

delta_momheight | -0.0003 |

asset_tv | -0.0010 |

Ncomp | -0.0017 |

## 4.6 Exercise 1 – Predicting Myocardial Infarction with `sl3`

Answer the questions below to predict myocardial infarction (`mi`

) using the
available covariate data. Thanks to Professor David Benkeser at Emory University
for making the this Cardiovascular Health Study (CHS) data easily accessible.

```
# load the data set
db_data <-
url("https://raw.githubusercontent.com/benkeser/sllecture/master/chspred.csv")
chspred <- read_csv(file = db_data, col_names = TRUE)
# take a quick peek
head(chspred) %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
```

waist | alcoh | hdl | beta | smoke | ace | ldl | bmi | aspirin | gend | age | estrgn | glu | ins | cysgfr | dm | fetuina | whr | hsed | race | logcystat | logtrig | logcrp | logcre | health | logkcal | sysbp | mi |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

110.1642 | 0.0000 | 66.4974 | 0 | 0 | 1 | 114.2162 | 27.9975 | 0 | 0 | 73.5179 | 0 | 159.9314 | 70.3343 | 75.0078 | 1 | 0.1752 | 1.1690 | 1 | 1 | -0.3420 | 5.4063 | 2.0126 | -0.6739 | 0 | 4.3926 | 177.1345 | 0 |

89.9763 | 0.0000 | 50.0652 | 0 | 0 | 0 | 103.7766 | 20.8931 | 0 | 0 | 61.7723 | 0 | 153.3888 | 33.9695 | 82.7433 | 1 | 0.5717 | 0.9011 | 0 | 0 | -0.0847 | 4.8592 | 3.2933 | -0.5551 | 1 | 6.2071 | 136.3742 | 0 |

106.1941 | 8.4174 | 40.5059 | 0 | 0 | 0 | 165.7158 | 28.4554 | 1 | 1 | 72.9312 | 0 | 121.7145 | -17.3017 | 74.6989 | 0 | 0.3517 | 1.1797 | 0 | 1 | -0.4451 | 4.5088 | 0.3013 | -0.0115 | 0 | 6.7320 | 135.1993 | 0 |

90.0566 | 0.0000 | 36.1750 | 0 | 0 | 0 | 45.2035 | 23.9608 | 0 | 0 | 79.1191 | 0 | 53.9691 | 11.7315 | 95.7823 | 0 | 0.5439 | 1.1360 | 0 | 0 | -0.4807 | 5.1832 | 3.0243 | -0.5751 | 1 | 7.3972 | 139.0182 | 0 |

78.6143 | 2.9790 | 71.0642 | 0 | 1 | 0 | 131.3121 | 10.9656 | 0 | 1 | 69.0179 | 0 | 94.3153 | 9.7112 | 72.7109 | 0 | 0.4916 | 1.1028 | 1 | 0 | 0.3121 | 4.2190 | -0.7057 | 0.0053 | 1 | 8.2779 | 88.0470 | 0 |

91.6593 | 0.0000 | 59.4963 | 0 | 0 | 0 | 171.1872 | 29.1317 | 0 | 1 | 81.8346 | 0 | 212.9066 | -28.2269 | 69.2184 | 1 | 0.4621 | 0.9529 | 1 | 0 | -0.2872 | 5.1773 | 0.9705 | 0.2127 | 1 | 5.9942 | 69.5943 | 0 |

- Create an
`sl3`

task, setting myocardial infarction`mi`

as the outcome and using all available covariate data. - Make a library of seven relatively fast base learning algorithms (i.e., do
not consider BART or HAL). Customize hyperparameters for one of your
learners. Feel free to use learners from
`sl3`

or`SuperLearner`

. You may use the same base learning library that is presented above. - Incorporate feature selection with the
`SuperLearner`

screener`screen.corP`

. - Fit the metalearning step with non-negative least squares,
`Lrnr_nnls`

. - With the metalearner and base learners, make the super learner and train it on the task.
- Print your super learner fit by calling
`print()`

with`$`

. - Cross-validate your super learner fit to see how well it performs on unseen
data. Specify
`loss_squared_error`

as the loss function to evaluate the super learner.

## 4.7 Super Learning of a Conditional Density

## 4.8 Exercise 2 – Estimating the Propensity Score with `sl3`

## 4.9 Super Learning of an Optimal Individualized Treatment Rule

## 4.10 Exercise 3 – Estimating the Blip

## 4.11 Concluding Remarks

The general ensemble learning approach of super learner can be applied to a diversity of estimation and prediction problems that can be defined by a loss function. We just discussed conditional mean estimation, and in the appendix we delve into prediction of a conditional density, and the optimal individualized treatment rule. Plug-in estimators of the estimand are desirable because a plug-in estimator respects both the local and global constraints of the statistical model. We could just plug-in the estimator returned by Super Learner; however, this is problematic because the Super Learner estimators are trading off bias and variance in an optimal way and as a result their bias is essentially the rate of convergence of these algorithms, which is always slower than \(1/\sqrt{n}\). Therefore, if we plug-in the estimator returned by super learner into the target parameter mapping, we would end up with an estimator which has the same bias as what we plugged in, which is greater than \(1/\sqrt{n}\). Thus, we end up with an estimator which is not asymptotically normal, since it does not converge to the estimand at \(1/\sqrt{n}\) rate.

An asymptotically linear estimator has no meaningful bias ($ < 1/$), and can be written as an empirical mean in first order of a function of the data, the influence curve, plus some negligible remainder term. Once an estimator is asymptotically linear with an influence curve it’s normally distributed, so the standardized estimator converges to a normal distribution with mean 0 and variance is the variance of the influence curve. Thus, it is advantageous to construct asymptotically linear estimators since they permit formal statistical inference. Among the class of regular asymptotically linear estimators, there is an optimal estimator which is an efficient estimator, and that’s the one with influence curve equal to the canonical gradient of the path-wise derivative of the target parameter. The canonical gradient is the direction of the path through the data distribution where the parameter is steepest. An estimator is efficient if and only if is asymptotically linear with influence curve equal to the canonical gradient. One can calculate the canonical gradient with the statistical model and the statistical target parameter. Techniques for calculating the canonical gradient entail projecting an initial gradient on the tangent space of the model at the particular distribution in the model in which you want to calculate the canonical gradient.

Now we know what it takes to construct an efficient estimator. Namely, we need to construct an estimator which is asymptotically linear with influence curve the canonical gradient. There are three general classes of estimators which succeed in constructing asymptotically linear estimators: (1) the one-step estimator, but it is not a plug-in estimator; (2) the targeted maximum likelihood estimator, which is a super learner targeted towards the target parameter and it is a plug-in estimator; and (3) estimating equation based estimators, which use the canonical gradient but as an estimating function in the target parameter. In the chapters that follow, we focus on the targeted maximum likelihood estimator and the targeted minimum loss-based estimator, both referred to as TMLE.

## 4.12 Appendix

### 4.12.1 Exercise 1 Solution

Here is a potential solution to (Exercise 1 – Predicting Myocardial
Infarction with `sl3`

)(**???**).

```
chspred_task <- make_sl3_Task(
data = chspred,
covariates = head(colnames(chspred), -1),
outcome = "mi"
)
glm_learner <- Lrnr_glm$new()
lasso_learner <- Lrnr_glmnet$new(alpha = 1)
ridge_learner <- Lrnr_glmnet$new(alpha = 0)
enet_learner <- Lrnr_glmnet$new(alpha = 0.5)
curated_glm_learner <- Lrnr_glm_fast$new(formula = "mi ~ smoke + beta + waist")
mean_learner <- Lrnr_mean$new() # That is one mean learner!
glm_fast_learner <- Lrnr_glm_fast$new()
ranger_learner <- Lrnr_ranger$new()
svm_learner <- Lrnr_svm$new()
xgb_learner <- Lrnr_xgboost$new()
screen_cor <- Lrnr_pkg_SuperLearner_screener$new("screen.corP")
glm_pipeline <- make_learner(Pipeline, screen_cor, glm_learner)
stack <- make_learner(
Stack,
glm_pipeline, glm_learner,
lasso_learner, ridge_learner, enet_learner,
curated_glm_learner, mean_learner, glm_fast_learner,
ranger_learner, svm_learner, xgb_learner
)
metalearner <- make_learner(Lrnr_nnls)
sl <- Lrnr_sl$new(
learners = stack,
metalearner = metalearner
)
sl_fit <- sl$train(task)
sl_fit$print()
CVsl <- CV_lrnr_sl(sl_fit, chspred_task, loss_squared_error)
CVsl
```

### 4.12.2 Exercise 2 Solution

Here’s a potential solution to (Exercise 2)(**???**).

### 4.12.3 Exercise 3 Solution

Here’s a potential solution to the (Exercise 3)(**???**).

### References

Breiman, Leo. 1996. “Stacked Regressions.” *Machine Learning* 24 (1). Springer: 49–64.

Polley, Eric C, and Mark J van der Laan. 2010. “Super Learner in Prediction.” bepress.

van der Laan, Mark J, and Sandrine Dudoit. 2003. “Unified Cross-Validation Methodology for Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples.” bepress.

van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” *Statistical Applications in Genetics and Molecular Biology* 6 (1).

van der Vaart, Aad W, Sandrine Dudoit, and Mark J van der Laan. 2006. “Oracle Inequalities for Multi-Fold Cross Validation.” *Statistics & Decisions* 24 (3). Oldenbourg Wissenschaftsverlag: 351–71.

Wolpert, David H. 1992. “Stacked Generalization.” *Neural Networks* 5 (2). Elsevier: 241–59.