# Chapter 3 Super (Machine) Learning

Based on the `sl3`

`R`

package by *Jeremy
Coyle, Nima Hejazi, Ivana Malenica, and Oleg Sofrygin*.

Updated: 2020-02-20

## 3.1 Learning Objectives

By the end of this chapter you will be able to:

- Select a loss function that is appropriate for the functional parameter to be estimated.
- Assemble an ensemble of learners based on the properties that identify what features they support.
- Customize learner hyperparameters to incorporate a diversity of different settings.
- Select a subset of available covariates and pass only those variables to the modeling algorithm.
- Fit an ensemble with nested cross-validation to obtain an estimate of the performance of the ensemble itself.
- Obtain
`sl3`

variable importance metrics. - Interpret the discrete and continuous Super Learner fits.
- Rationalize the need to remove bias from the Super Learner to make an optimal bias–variance tradeoff for the parameter of interest.

## 3.2 Motivation

- A common task in statistical data analysis is estimator selection (e.g., for prediction).
- There is no universally optimal machine learning algorithm for density estimation or prediction.
- For some data, one needs learners that can model a complex function.
- For others, possibly as a result of noise or insufficient sample size, a simple, parametric model might fit best.
- The Super Learner, an ensemble learner, solves this issue, by allowing a combination of learners from the simplest (intercept-only) to most complex (neural nets, random forests, SVM, etc).
- It works by using cross-validation in a manner which guarantees that the resulting fit will be as good as possible, given the learners provided.

## 3.3 Introduction

In Chapter 1, we introduced the Roadmap for Targeted Learning as a
general template to translate real-world data applications into formal
statistical estimation problems. The first steps of this roadmap define the
*statistical estimation problem*, which establish

- Data as a realization of a random variable, or equivalently, an outcome of a particular experiment.
- A statistical model, representing the true knowledge about the data-generating experiment.
- A translation of the scientific question, which is often causal, into a target parameter.

Note that if the target parameter is causal, step 3 also requires establishing identifiability of the target quantity from the observed data distribution, under possible non-testable assumptions that may not necessarily be reasonable. Still, the target quantity does have a valid statistical interpretation. See causal target parameters for more detail on causal models and identifiability.

Now that we have defined the statistical estimation problem, we are ready to
construct the TMLE; an asymptotically linear and efficient substitution
estimator of this target quantity. The first step in this estimation procedure
is an initial estimate of the data-generating distribution, or the relevant part
of this distribution that is needed to evaluate the target parameter. For this
initial estimation, we use the *Super Learner* (van der Laan, Polley, and Hubbard 2007).

The Super Learner provides an important step in creating a robust estimator. It is a loss-function-based tool that uses cross-validation to obtain the best prediction of our target parameter, based on a weighted average of a library of machine learning algorithms.

The library of machine learning algorithms consists of functions (“learners” in
the `sl3`

nomenclature) that we think might be consistent with the true
data-generating distribution (i.e. algorithms selected based on contextual
knowledge of the experiment that generated the data). Also, the library should

contain a large set of “default” algorithms that may range from a simple linear
regression model to multi-step algorithms involving screening covariates,
penalizations, optimizing tuning parameters, etc.

The ensembling of the collection of algorithms with weights (“metalearning” in
the `sl3`

nomenclature) has been shown to be adaptive and robust, even in small
samples (Polley and van der Laan 2010). The Super Learner is proven to be asymptotically as
accurate as the best possible prediction algorithm in the library
(van der Laan and Dudoit 2003; van der Vaart, Dudoit, and van der Laan 2006).

### 3.3.1 Background

**Defining the loss function**

A

*loss function*(\(L\)) is defined as a function of the observed data and a candidate parameter value \(\psi\), which has unknown true value \(\psi_0\), \(L(\psi)(O)\).We can estimate the loss by substituting the empirical distribution \(P_n\) for the true (but unknown) distribution of the observed data \(P_0\).

A valid loss function will have expectation (risk) that is minimized at the true value of the parameter \(\psi_0\). For example, the conditional mean minimizes the risk of the squared error loss. Thus, it is a valid loss function when estimating the conditional mean.

**What is cross-validation and how does it work?**

- There are many different cross-validation schemes, designed to accommodate different study designs and data structures.
- The figure below shows an example of 10-fold cross-validation.

### 3.3.2 Why use the Super Learner?

- For prediction, one can use the cross-validated risk to empirically determine the relative performance of SL and competing methods.
- When we have tested different algorithms on actual data and looked at the performance (e.g., MSE of prediction), never does one algorithm always win (see below).
- Below shows the results of such a study, comparing the fits of several different learners, including the SL algorithms.

- Super Learner performs asymptotically as well as best possible weighted combination.
- By including all competitors in the library of candidate estimators (glm, neural nets, SVMs, random forest, etc.), the Super Learner will asymptotically outperform any of its competitors- even if the set of competitors is allowed to grow polynomial in sample size.
- Motivates the name “Super Learner”: it provides a system of combining many estimators into an improved estimator.

For more detail on Super Learner we refer the reader to van der Laan, Polley, and Hubbard (2007) and Polley and van der Laan (2010). The optimality results for the cross-validation selector among a family of algorithms were established in van der Laan and Dudoit (2003) and extended in van der Vaart, Dudoit, and van der Laan (2006).

## 3.4 `sl3`

“Microwave Dinner” Implementation

We begin by illustrating the core functionality of the Super Learner algorithm
as implemented in `sl3`

. For those who are interested in the internals
of `sl3`

, see this `sl3`

introductory
tutorial.

The `sl3`

implementation consists of the following steps:

- Load the necessary libraries and data
- Define the machine learning task
- Make a Super Learner by creating library of base learners and a metalearner
- Train the Super Learner on the machine learning task
- Obtain predicted values

### WASH Benefits Study Example

Using the WASH data, we are interested in predicting weight-for-height z-score
`whz`

using the available covariate data. Let’s begin!

### 0. Load the necessary libraries and data

First, we will load the relevant `R`

packages, set a seed, and load the data.

```
library(here)
library(data.table)
library(knitr)
library(kableExtra)
library(tidyverse)
library(origami)
library(SuperLearner)
library(sl3)
set.seed(7194)
# my lucky seed! or is it 9174? or 4917? many lucky seeds, thanks lysdexia!
# load data set and take a peek
washb_data <- fread("https://raw.githubusercontent.com/tlverse/tlverse-data/master/wash-benefits/washb_data.csv",
stringsAsFactors = TRUE)
head(washb_data) %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
```

whz | tr | fracode | month | aged | sex | momage | momedu | momheight | hfiacat | Nlt18 | Ncomp | watmin | elec | floor | walls | roof | asset_wardrobe | asset_table | asset_chair | asset_khat | asset_chouki | asset_tv | asset_refrig | asset_bike | asset_moto | asset_sewmach | asset_mobile |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0.00 | Control | N05265 | 9 | 268 | male | 30 | Primary (1-5y) | 146.40 | Food Secure | 3 | 11 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |

-1.16 | Control | N05265 | 9 | 286 | male | 25 | Primary (1-5y) | 148.75 | Moderately Food Insecure | 2 | 4 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |

-1.05 | Control | N08002 | 9 | 264 | male | 25 | Primary (1-5y) | 152.15 | Food Secure | 1 | 10 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |

-1.26 | Control | N08002 | 9 | 252 | female | 28 | Primary (1-5y) | 140.25 | Food Secure | 3 | 5 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |

-0.59 | Control | N06531 | 9 | 336 | female | 19 | Secondary (>5y) | 150.95 | Food Secure | 2 | 7 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |

-0.51 | Control | N06531 | 9 | 304 | male | 20 | Secondary (>5y) | 154.20 | Severely Food Insecure | 0 | 3 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |

### 1. Define the machine learning task

To define the machine learning **“task”** (predict weight-for-height z-score `whz`

using the available covariate data), we need to create an `sl3_Task`

object.

The `sl3_Task`

keeps track of the roles the variables play in the
machine learning problem, the data, and any metadata (e.g., observational-level
weights, id, offset).

Also, if we had missing outcomes, we would need to set
`drop_missing_outcome = TRUE`

when we create the task.

```
# specify the outcome and covariates
outcome <- "whz"
covars <- colnames(washb_data)[-which(names(washb_data) == outcome)]
# create the sl3 task
washb_task <- make_sl3_Task(
data = washb_data,
covariates = covars,
outcome = outcome
)
```

```
Warning in process_data(data, nodes, column_names = column_names, flag = flag, :
Missing covariate data detected: imputing covariates.
```

*This warning is important.* The task just imputed missing covariates for us.
Specifically, for each covariate column with missing values, `sl3`

uses the
median to impute missing continuous covariates, and the mode to impute binary
and categorical covariates.

Also, for each covariate column with missing values, `sl3`

adds an additional
column indicating whether or not the value was imputed, which is particularly
handy when the missingness in the data might be informative.

Also, notice that we did not specify the number of folds, or the loss function in the task. The default cross-validation scheme is \(V\)-fold, with the number of folds \(V=10\).

Let’s visualize our `washb_task`

.

```
A sl3 Task with 4695 obs and these nodes:
$covariates
[1] "tr" "fracode" "month" "aged"
[5] "sex" "momage" "momedu" "momheight"
[9] "hfiacat" "Nlt18" "Ncomp" "watmin"
[13] "elec" "floor" "walls" "roof"
[17] "asset_wardrobe" "asset_table" "asset_chair" "asset_khat"
[21] "asset_chouki" "asset_tv" "asset_refrig" "asset_bike"
[25] "asset_moto" "asset_sewmach" "asset_mobile" "delta_momage"
[29] "delta_momheight"
$outcome
[1] "whz"
$id
NULL
$weights
NULL
$offset
NULL
```

### 2. Make a Super Learner

Now that we have defined our machine learning problem with the task, we are
ready to **“make”** the Super Learner. This requires specification of

- A library of base learning algorithms that we think might be consistent with the true data-generating distribution.
- A metalearner, to ensemble the base learners.

We might also incorporate

- Feature selection, to pass only a subset of the predictors to the algorithm.
- Hyperparameter specification, to tune base learners.

Learners have properties that indicate what features they support. We may use
`sl3_list_properties()`

to get a list of all properties supported by at least
one learner.

```
[1] "binomial" "categorical" "continuous"
[4] "cv" "density" "ids"
[7] "multivariate_outcome" "offset" "preprocessing"
[10] "timeseries" "weights" "wrapper"
```

Since we have a continuous outcome, we may identify the learners that support
this outcome type with `sl3_list_learners()`

.

```
[1] "Lrnr_arima" "Lrnr_bartMachine"
[3] "Lrnr_bilstm" "Lrnr_caret"
[5] "Lrnr_condensier" "Lrnr_dbarts"
[7] "Lrnr_earth" "Lrnr_expSmooth"
[9] "Lrnr_gam" "Lrnr_gbm"
[11] "Lrnr_glm" "Lrnr_glm_fast"
[13] "Lrnr_glmnet" "Lrnr_grf"
[15] "Lrnr_h2o_glm" "Lrnr_h2o_grid"
[17] "Lrnr_hal9001" "Lrnr_HarmonicReg"
[19] "Lrnr_lstm" "Lrnr_mean"
[21] "Lrnr_nnls" "Lrnr_optim"
[23] "Lrnr_pkg_SuperLearner" "Lrnr_pkg_SuperLearner_method"
[25] "Lrnr_pkg_SuperLearner_screener" "Lrnr_polspline"
[27] "Lrnr_randomForest" "Lrnr_ranger"
[29] "Lrnr_rpart" "Lrnr_rugarch"
[31] "Lrnr_screener_corP" "Lrnr_screener_corRank"
[33] "Lrnr_screener_randomForest" "Lrnr_solnp"
[35] "Lrnr_stratified" "Lrnr_svm"
[37] "Lrnr_tsDyn" "Lrnr_xgboost"
```

Now that we have an idea of some learners, we can construct them using the
`make_learner`

function.

We can customize learner hyperparameters to incorporate a diversity of different
settings. Documentation for the learners and their hyperparameters can be found
in the `sl3`

Learners
Reference.

```
lrnr_ranger50 <- make_learner(Lrnr_ranger, num.trees = 50)
lrnr_hal_simple <- make_learner(Lrnr_hal9001, max_degree = 2, n_folds = 2)
lrnr_lasso <- make_learner(Lrnr_glmnet) # alpha default is 1
lrnr_ridge <- make_learner(Lrnr_glmnet, alpha = 0)
lrnr_elasticnet <- make_learner(Lrnr_glmnet, alpha = .5)
```

We can also include learners from the `SuperLearner`

`R`

package.

Here is a fun trick to create customized learners over a grid of parameters.

```
# I like to crock pot my super learners
grid_params <- list(cost = c(0.01, 0.1, 1, 10, 100, 1000),
gamma = c(0.001, 0.01, 0.1, 1),
kernel = c("polynomial", "radial", "sigmoid"),
degree = c(1, 2, 3))
grid <- expand.grid(grid_params, KEEP.OUT.ATTRS = FALSE)
params_default <- list(nthread = getOption("sl.cores.learners", 1))
svm_learners <- apply(grid, MARGIN = 1, function(params_tune) {
do.call(Lrnr_svm$new, c(params_default, as.list(params_tune)))})
```

```
grid_params <- list(max_depth = c(2, 4, 6, 8),
eta = c(0.001, 0.01, 0.1, 0.2, 0.3),
nrounds = c(20, 50))
grid <- expand.grid(grid_params, KEEP.OUT.ATTRS = FALSE)
params_default <- list(nthread = getOption("sl.cores.learners", 1))
xgb_learners <- apply(grid, MARGIN = 1, function(params_tune) {
do.call(Lrnr_xgboost$new, c(params_default, as.list(params_tune)))})
```

Did you see `Lrnr_caret`

when we called `sl3_list_learners(c("continuous"))`

?
All we need to specify is the algorithm to use, which is passed as `method`

to
`caret::train()`

. The default method for parameter selection criterion with
is set to “CV” instead of the `caret::train()`

default `boot`

. The summary
metric to used to select the optimal model is `RMSE`

for continuous outcomes
and `Accuracy`

for categorical and binomial outcomes.

```
# I have no idea how to tune a neural net (or BART machine..)
lrnr_caret_nnet <- make_learner(Lrnr_caret, algorithm = "nnet")
lrnr_caret_bartMachine <- make_learner(Lrnr_caret, algorithm = "bartMachine",
method = "boot", metric = "RMSE",
tuneLength = 10)
```

In order to assemble the library of learners, we need to **“stack”** them
together.

A `Stack`

is a special learner and it has the same interface as all
other learners. What makes a stack special is that it combines multiple learners
by training them simultaneously, so that their predictions can be either
combined or compared.

We can optionally select a subset of available covariates and pass only those variables to the modeling algorithm.

Let’s consider screening covariates based on their `randomForest`

variable
importance ranking (ordered by mean decrease in accuracy).

```
screen_rf <- make_learner(Lrnr_screener_randomForest, nVar = 5, ntree = 20)
# which covariates are selected on the full data?
screen_rf$train(washb_task)
```

```
[1] "Lrnr_screener_randomForest_5_20"
$selected
[1] "month" "aged" "momage" "momheight" "Ncomp"
```

To **“pipe”** only the selected covariates to the modeling algorithm, we need to
make a `Pipeline`

, which is a just set of learners to be fit sequentially, where
the fit from one learner is used to define the task for the next learner.

Now our learners will be preceded by a screening step.

We also consider the original `stack`

, to compare how the feature selection
methods perform in comparison to the methods without feature selection.

Analogous to what we have seen before, we have to stack the pipeline and
original `stack`

together, so we may use them as base learners in our super
learner.

```
fancy_stack <- make_learner(Stack, screen_rf_pipeline, stack)
# we can visualize the stack
dt_stack <- delayed_learner_train(fancy_stack, washb_task)
plot(dt_stack, color = FALSE, height = "400px", width = "100%")
```

We will use the default
metalearner,
which uses `Lrnr_solnp()`

to provide fitting procedures for a pairing of loss
function and
metalearner
function. This
default metalearner selects a loss and metalearner pairing based on the outcome
type. Note that any learner can be used as a metalearner.

We have made a library/stack of base learners, so we are ready to make the super learner. The Super Learner algorithm fits a metalearner on the validation-set predictions.

We can also use `Lrnr_cv`

to build a Super Learner, cross-validate a stack of
learners to compare performance of the learners in the stack, or cross-validate
any single learner (see “Cross-validation” section of this `sl3`

introductory tutorial).

Furthermore, we can Define New `sl3`

Learners which can be used
in all the places you could otherwise use any other `sl3`

learners, including
`Pipelines`

, `Stacks`

, and the Super Learner.

### 3. Train the Super Learner on the machine learning task

The Super Learner algorithm fits a metalearner on the validation-set predictions in a cross-validated manner, thereby avoiding overfitting.

Now we are ready to **“train”** our Super Learner on our `sl3_task`

object,
`washb_task`

.

### 4. Obtain predicted values

Now that we have fit the Super Learner, we are ready to calculate the predicted outcome for each subject.

`[1] -0.6569227 -0.7649573 -0.6537146 -0.6467686 -0.6210493 -0.6823442`

We can also obtain a summary of the results.

```
[1] "SuperLearner:"
List of 2
$ : chr "Pipeline(Lrnr_screener_randomForest_5_20->Stack)"
$ : chr "Stack"
[1] "Lrnr_solnp_TRUE_TRUE_FALSE_1e-05"
$pars
[1] 0.0006203277 0.0001715318 0.0004311380 0.0004202798 0.2262848916
[6] 0.2934473967 0.0001715318 0.1896181234 0.2882656097 0.0005691694
$convergence
[1] 0
$values
[1] 1.019988 1.009846 1.009837
$lagrange
[,1]
[1,] -0.04680127
$hessian
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.96652654 0.11335313 0.08438590 0.08855691 0.16062287 0.42110726
[2,] 0.11335313 0.71803718 0.05178784 0.05150834 0.01654545 0.08245393
[3,] 0.08438590 0.05178784 0.92419505 -0.08127786 0.07371078 0.32989625
[4,] 0.08855691 0.05150834 -0.08127786 0.91285262 0.07365489 0.33061505
[5,] 0.16062287 0.01654545 0.07371078 0.07365489 0.46907652 0.20074814
[6,] 0.42110726 0.08245393 0.32989625 0.33061505 0.20074814 0.87043769
[7,] 0.11335313 -0.28196282 0.05178784 0.05150834 0.01654545 0.08245393
[8,] 0.31989860 0.06096411 0.22546939 0.22567641 0.18242312 0.01190093
[9,] 0.36276774 0.08494153 0.29089366 0.29192314 -0.08167747 0.06143580
[10,] 0.13949328 0.04892777 0.06291865 0.06293138 0.46048231 0.08565623
[,7] [,8] [,9] [,10]
[1,] 0.11335313 0.31989860 0.36276774 0.13949328
[2,] -0.28196282 0.06096411 0.08494153 0.04892777
[3,] 0.05178784 0.22546939 0.29089366 0.06291865
[4,] 0.05150834 0.22567641 0.29192314 0.06293138
[5,] 0.01654545 0.18242312 -0.08167747 0.46048231
[6,] 0.08245393 0.01190093 0.06143580 0.08565623
[7,] 0.71803718 0.06096411 0.08494153 0.04892777
[8,] 0.06096411 1.06828583 0.14210810 0.04593034
[9,] 0.08494153 0.14210810 1.04967236 0.31144189
[10,] 0.04892777 0.04593034 0.31144189 0.73629909
$ineqx0
NULL
$nfuneval
[1] 198
$outer.iter
[1] 2
$elapsed
Time difference of 0.05998635 secs
$vscale
[1] 1.009846 0.000010 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
[9] 1.000000 1.000000 1.000000 1.000000
$coefficients
Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_glm_TRUE
0.0006205406
Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_mean
0.0000000000
Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE
0.0004312859
Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
0.0004204241
Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_xgboost_20_1_4_0.1
0.2263625484
Stack_Lrnr_glm_TRUE
0.2935481024
Stack_Lrnr_mean
0.0000000000
Stack_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE
0.1896831968
Stack_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
0.2883645371
Stack_Lrnr_xgboost_20_1_4_0.1
0.0005693647
$training_offset
[1] FALSE
$name
[1] "solnp"
[1] "Cross-validated risk (MSE, squared error loss):"
learner
1: Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_glm_TRUE
2: Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_mean
3: Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE
4: Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
5: Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_xgboost_20_1_4_0.1
6: Stack_Lrnr_glm_TRUE
7: Stack_Lrnr_mean
8: Stack_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE
9: Stack_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
10: Stack_Lrnr_xgboost_20_1_4_0.1
11: SuperLearner
coefficients mean_risk SE_risk fold_SD fold_min_risk fold_max_risk
1: 0.0006205406 1.035485 0.02446142 0.06008226 0.9352596 1.119394
2: 0.0000000000 1.065401 0.02503198 0.05999366 0.9689145 1.143488
3: 0.0004312859 1.035467 0.02446390 0.06007457 0.9355407 1.119012
4: 0.0004204241 1.035561 0.02445759 0.06023272 0.9352523 1.119315
5: 0.2263625484 1.044729 0.02405570 0.06265341 0.9211017 1.117049
6: 0.2935481024 1.018949 0.02372195 0.05817436 0.9095780 1.088981
7: 0.0000000000 1.065401 0.02503198 0.05999366 0.9689145 1.143488
8: 0.1896831968 1.014359 0.02362759 0.05643860 0.9191569 1.093618
9: 0.2883645371 1.012153 0.02348449 0.05727088 0.9187793 1.095675
10: 0.0005693647 1.035503 0.02371762 0.06206027 0.9341196 1.119005
11: NA 1.009848 0.02345284 0.05818497 0.9055731 1.087758
```

## 3.5 Cross-validated Super Learner

We can cross-validate the Super Learner to see how well the Super Learner performs on unseen data, and obtain an estimate of the cross-validated risk of the Super Learner.

This estimation procedure requires an “external” layer of cross-validation,
also called nested cross-validation, which involves setting aside a separate
holdout sample that we don’t use to fit the Super Learner. This
external cross validation procedure may also incorporate 10 folds, which is the
default in `sl3`

. However, we will incorporate 2 outer/external folds of
cross-validation for computational efficiency.

We also need to specify a loss function to evaluate Super Learner.
Documentation for the available loss functions can be found in the `sl3`

Loss
Function Reference.

```
washb_task_new <- make_sl3_Task(
data = washb_data,
covariates = covars,
outcome = outcome,
folds = make_folds(washb_data, fold_fun = folds_vfold, V = 2)
)
```

```
Warning in process_data(data, nodes, column_names = column_names, flag = flag, :
Missing covariate data detected: imputing covariates.
```

```
CVsl <- CV_lrnr_sl(sl_fit, washb_task_new, loss_squared_error)
CVsl %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
```

learner | coefficients | mean_risk | SE_risk | fold_SD | fold_min_risk | fold_max_risk |
---|---|---|---|---|---|---|

Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_glm_TRUE | 0.0694 | 1.0343 | 0.0244 | 0.0354 | 1.0092 | 1.0593 |

Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_mean | 0.0000 | 1.0653 | 0.0250 | 0.0378 | 1.0386 | 1.0920 |

Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE | 0.0003 | 1.0344 | 0.0244 | 0.0358 | 1.0091 | 1.0598 |

Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE | 0.0010 | 1.0344 | 0.0244 | 0.0356 | 1.0093 | 1.0596 |

Pipeline(Lrnr_screener_randomForest_5_20->Stack)_Lrnr_xgboost_20_1_4_0.1 | 0.2111 | 1.0486 | 0.0241 | 0.0369 | 1.0225 | 1.0747 |

Stack_Lrnr_glm_TRUE | 0.1007 | 1.0389 | 0.0242 | 0.0281 | 1.0190 | 1.0587 |

Stack_Lrnr_mean | 0.0000 | 1.0653 | 0.0250 | 0.0378 | 1.0386 | 1.0920 |

Stack_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE | 0.2658 | 1.0216 | 0.0239 | 0.0357 | 0.9964 | 1.0468 |

Stack_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE | 0.2942 | 1.0204 | 0.0238 | 0.0296 | 0.9995 | 1.0414 |

Stack_Lrnr_xgboost_20_1_4_0.1 | 0.0574 | 1.0378 | 0.0238 | 0.0327 | 1.0147 | 1.0609 |

SuperLearner | NA | 1.0173 | 0.0237 | 0.0322 | 0.9945 | 1.0401 |

## 3.6 Variable Importance Measures with `sl3`

Variable importance can be interesting and informative. It can also be
contradictory and confusing. Nevertheless, we like it, and so do
collaborators, so we created a variable importance function in `sl3`

! The `sl3`

`varimp`

function returns a table with variables listed in decreasing order of
importance (i.e. most important on the first row).

The measure of importance in `sl3`

is based on a risk difference between the
learner fit with a permuted covariate and the learner fit with the true
covariate, across all covariates. In this manner, the larger the risk
difference, the more important the variable is in the prediction.

Let’s explore the `sl3`

variable importance measurements for the `washb`

data.

```
washb_varimp <- varimp(sl_fit, loss_squared_error)
washb_varimp %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
```

X | risk_diff |
---|---|

aged | 0.0313 |

momedu | 0.0060 |

momheight | 0.0052 |

asset_refrig | 0.0047 |

asset_chair | 0.0043 |

month | 0.0039 |

asset_table | 0.0019 |

elec | 0.0019 |

floor | 0.0014 |

tr | 0.0011 |

fracode | 0.0010 |

asset_chouki | 0.0010 |

Nlt18 | 0.0009 |

asset_wardrobe | 0.0009 |

asset_sewmach | 0.0008 |

momage | 0.0007 |

walls | 0.0006 |

asset_mobile | 0.0006 |

asset_moto | 0.0004 |

Ncomp | 0.0003 |

delta_momheight | 0.0002 |

hfiacat | 0.0001 |

asset_khat | -0.0001 |

roof | -0.0001 |

sex | -0.0001 |

delta_momage | -0.0002 |

asset_tv | -0.0003 |

asset_bike | -0.0004 |

watmin | -0.0009 |

## 3.7 Exercises

### 3.7.1 Predicting Myocardial Infarction with `sl3`

Follow the steps below to predict myocardial infarction (`mi`

) using the
available covariate data. We thank Prof. David Benkeser at Emory University for
making the this Cardiovascular Health Study (CHS) data accessible.

```
# load the data set
db_data <-
url("https://raw.githubusercontent.com/benkeser/sllecture/master/chspred.csv")
chspred <- read_csv(file = db_data, col_names = TRUE)
# take a quick peek
head(chspred) %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
```

waist | alcoh | hdl | beta | smoke | ace | ldl | bmi | aspirin | gend | age | estrgn | glu | ins | cysgfr | dm | fetuina | whr | hsed | race | logcystat | logtrig | logcrp | logcre | health | logkcal | sysbp | mi |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

110.1642 | 0.0000 | 66.4974 | 0 | 0 | 1 | 114.2162 | 27.9975 | 0 | 0 | 73.5179 | 0 | 159.9314 | 70.3343 | 75.0078 | 1 | 0.1752 | 1.1690 | 1 | 1 | -0.3420 | 5.4063 | 2.0126 | -0.6739 | 0 | 4.3926 | 177.1345 | 0 |

89.9763 | 0.0000 | 50.0652 | 0 | 0 | 0 | 103.7766 | 20.8931 | 0 | 0 | 61.7723 | 0 | 153.3888 | 33.9695 | 82.7433 | 1 | 0.5717 | 0.9011 | 0 | 0 | -0.0847 | 4.8592 | 3.2933 | -0.5551 | 1 | 6.2071 | 136.3742 | 0 |

106.1941 | 8.4174 | 40.5059 | 0 | 0 | 0 | 165.7158 | 28.4554 | 1 | 1 | 72.9312 | 0 | 121.7145 | -17.3017 | 74.6989 | 0 | 0.3517 | 1.1797 | 0 | 1 | -0.4451 | 4.5088 | 0.3013 | -0.0115 | 0 | 6.7320 | 135.1993 | 0 |

90.0566 | 0.0000 | 36.1750 | 0 | 0 | 0 | 45.2035 | 23.9608 | 0 | 0 | 79.1191 | 0 | 53.9691 | 11.7315 | 95.7823 | 0 | 0.5439 | 1.1360 | 0 | 0 | -0.4807 | 5.1832 | 3.0243 | -0.5751 | 1 | 7.3972 | 139.0182 | 0 |

78.6143 | 2.9790 | 71.0642 | 0 | 1 | 0 | 131.3121 | 10.9656 | 0 | 1 | 69.0179 | 0 | 94.3153 | 9.7112 | 72.7109 | 0 | 0.4916 | 1.1028 | 1 | 0 | 0.3121 | 4.2190 | -0.7057 | 0.0053 | 1 | 8.2779 | 88.0470 | 0 |

91.6593 | 0.0000 | 59.4963 | 0 | 0 | 0 | 171.1872 | 29.1317 | 0 | 1 | 81.8346 | 0 | 212.9066 | -28.2269 | 69.2184 | 1 | 0.4621 | 0.9529 | 1 | 0 | -0.2872 | 5.1773 | 0.9705 | 0.2127 | 1 | 5.9942 | 69.5943 | 0 |

- Create an
`sl3`

task, setting myocardial infarction`mi`

as the outcome and using all available covariate data. - Make a library of seven relatively fast base learning algorithms (i.e., do
not consider BART or HAL). Customize hyperparameters for one of your
learners. Feel free to use learners from
`sl3`

or`SuperLearner`

. You may use the same base learning library that is presented above. - Incorporate feature selection with the
`SuperLearner`

screener`screen.corP`

. - Fit the metalearning step with the default metalearner.
- With the metalearner and base learners, make the Super Learner and train it on the task.
- Print your Super Learner fit by calling
`print()`

with`$`

. - Cross-validate your Super Learner fit to see how well it performs on unseen
data. Specify
`loss_squared_error`

as the loss function to evaluate the Super Learner.

### 3.7.2 Predicting Recurrent Ischemic Stroke in an RCT with `sl3`

For this exercise, we will work with a random sample of 5,000 patients who
participated in the International Stroke Trial (IST). This data is described in
Chapter 3.2 of the `tlverse`

handbook.

- Train a Super Learner to predict recurrent stroke
`DRSISC`

with the available covariate data (the 25 other variables). Of course, you can consider feature selection in the machine learning algorithms. In this data, the outcome is occasionally missing, so be sure to specify`drop_missing_outcome = TRUE`

when defining the task. - Use the SL-based predictions to calculate the area under the ROC curve (AUC).
- Calculate the cross-validated AUC with cross-validated SL-based predictions. If you would like to decrease the number of outer cross-validation folds, then specify the task as described below for 5 outer folds.

```
ist_data <- data.table(read.csv("https://raw.githubusercontent.com/tlverse/tlverse-handbook/master/data/ist_sample.csv"))
# number 3 help
ist_task_CVsl <- make_sl3_Task(
data = ist_data,
outcome = "DRSISC",
covariates = colnames(ist_data)[-which(names(ist_data) == "DRSISC")],
drop_missing_outcome = TRUE,
folds = make_folds(
n = sum(!is.na(ist_data$DRSISC)),
fold_fun = folds_vfold,
V = 5
)
)
```

## 3.8 Concluding Remarks

The general ensemble learning approach of Super Learner can be applied to a diversity of estimation and prediction problems that can be defined by a loss function.

We just discussed conditional mean estimation, outcome prediction and variable importance. In future updates of the handbook, we will delve into prediction of a conditional density, and the optimal individualized treatment rule.

If we plug in the estimator returned by Super Learner into the target parameter mapping, then we would end up with an estimator that has the same bias as what we plugged in, and would not be asymptotically linear. It also would not be a plug-in estimator or efficient.

- An asymptotically linear estimator is important to have, since they converge to the estimand at \(\frac{1}{\sqrt{n}}\) rate, and thereby permit formal statistical inference (i.e. confidence intervals and \(p\)-values).
- Plug-in estimators of the estimand are desirable because they respect both the local and global constraints of the statistical model (e.g., bounds), and have they have better finite-sample properties.
- An efficient estimator is optimal in the sense that it has the lowest possible variance, and is thus the most precise. An estimator is efficient if and only if is asymptotically linear with influence curve equal to the canonical gradient. The canonical gradient is a mathematical object that is specific to the target estimand, and it provides information on the level of difficulty of the estimation problem. The canonical gradient is shown in the chapters that follow. Practitioner’s do not need to know how to calculate a canonical gradient in order to understand efficiency and use Targeted Maximum Likelihood Estimation (TMLE). Metaphorically, you do not need to be Yoda in order to be a Jedi.

TMLE is a general strategy that succeeds in constructing efficient and asymptotically linear plug-in estimators.

Super Learner is fantastic for pure prediction, and for obtaining an initial estimate in the first step of TMLE, but we need the second step of TMLE to have the desirable statistical properties mentioned above.

In the chapters that follow, we focus on the targeted maximum likelihood estimator and the targeted minimum loss-based estimator, both referred to as TMLE.

## 3.9 Appendix

### 3.9.1 Exercise 1 Solution

Here is a potential solution to the `sl3`

Exercise – Predicting Myocardial
Infarction with `sl3`

.

```
# make task
chspred_task <- make_sl3_Task(
data = chspred,
covariates = head(colnames(chspred), -1),
outcome = "mi"
)
# make learners
glm_learner <- Lrnr_glm$new()
lasso_learner <- Lrnr_glmnet$new(alpha = 1)
ridge_learner <- Lrnr_glmnet$new(alpha = 0)
enet_learner <- Lrnr_glmnet$new(alpha = 0.5)
curated_glm_learner <- Lrnr_glm_fast$new(formula = "mi ~ smoke + beta + waist")
mean_learner <- Lrnr_mean$new() # That is one mean learner!
glm_fast_learner <- Lrnr_glm_fast$new()
ranger_learner <- Lrnr_ranger$new()
svm_learner <- Lrnr_svm$new()
xgb_learner <- Lrnr_xgboost$new()
screen_cor <- make_learner(Lrnr_screener_corP)
glm_pipeline <- make_learner(Pipeline, screen_cor, glm_learner)
# stack learners together
stack <- make_learner(
Stack,
glm_pipeline, glm_learner,
lasso_learner, ridge_learner, enet_learner,
curated_glm_learner, mean_learner, glm_fast_learner,
ranger_learner, svm_learner, xgb_learner
)
# choose metalearner
metalearner <- make_learner(Lrnr_nnls)
sl <- Lrnr_sl$new(
learners = stack,
metalearner = metalearner
)
sl_fit <- sl$train(chspred_task)
sl_fit$print()
CVsl <- CV_lrnr_sl(sl_fit, chspred_task, loss_squared_error)
CVsl
```

### References

Polley, Eric C, and Mark J van der Laan. 2010. “Super Learner in Prediction.” *Bepress*. bepress.

van der Laan, Mark J, and Sandrine Dudoit. 2003. “Unified Cross-Validation Methodology for Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples.” *Bepress*. bepress.

van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” *Statistical Applications in Genetics and Molecular Biology* 6 (1).

van der Vaart, Aad W, Sandrine Dudoit, and Mark J van der Laan. 2006. “Oracle Inequalities for Multi-Fold Cross Validation.” *Statistics & Decisions* 24 (3). Oldenbourg Wissenschaftsverlag: 351–71.

Wolpert, David H. 1992. “Stacked Generalization.” *Neural Networks* 5 (2). Elsevier: 241–59.