5 Super (Machine) Learning
Rachael Phillips
Based on the sl3
R
package by Jeremy
Coyle, Nima Hejazi, Ivana Malenica, Rachael Phillips, and Oleg Sofrygin.
Updated: 20210108
5.1 Learning Objectives
By the end of this chapter you will be able to:
 Select a loss function that is appropriate for the functional parameter to be estimated.
 Assemble an ensemble of learners based on the properties that identify what features they support.
 Customize learner hyperparameters to incorporate a diversity of different settings.
 Select a subset of available covariates and pass only those variables to the modeling algorithm.
 Fit an ensemble with nested crossvalidation to obtain an estimate of the performance of the ensemble itself.
 Obtain
sl3
variable importance metrics.  Interpret the discrete and continuous Super Learner fits.
 Rationalize the need to remove bias from the Super Learner to make an optimal bias–variance tradeoff for the parameter of interest.
5.2 Motivation
 A common task in statistical data analysis is estimator selection (e.g., for prediction).
 There is no universally optimal machine learning algorithm for density estimation or prediction.
 For some data, one needs learners that can model a complex function.
 For others, possibly as a result of noise or insufficient sample size, a simple, parametric model might fit best.
 The Super Learner, an ensemble learner, solves this issue, by allowing a combination of learners from the simplest (interceptonly) to most complex (neural nets, random forests, SVM, etc).
 It works by using crossvalidation in a manner which guarantees that the resulting fit will be as good as possible, given the learners provided.
5.3 Introduction
In Chapter 1, we introduced the Roadmap for Targeted Learning as a general template to translate realworld data applications into formal statistical estimation problems. The first steps of this roadmap define the statistical estimation problem, which establish
 Data as a realization of a random variable, or equivalently, an outcome of a particular experiment.
 A statistical model, representing the true knowledge about the datagenerating experiment.
 A translation of the scientific question, which is often causal, into a target parameter.
Note that if the target parameter is causal, step 3 also requires establishing identifiability of the target quantity from the observed data distribution, under possible nontestable assumptions that may not necessarily be reasonable. Still, the target quantity does have a valid statistical interpretation. See causal target parameters for more detail on causal models and identifiability.
Now that we have defined the statistical estimation problem, we are ready to construct the TMLE; an asymptotically linear and efficient substitution estimator of this target quantity. The first step in this estimation procedure is an initial estimate of the datagenerating distribution, or the relevant part of this distribution that is needed to evaluate the target parameter. For this initial estimation, we use the Super Learner (van der Laan, Polley, and Hubbard 2007).
The Super Learner provides an important step in creating a robust estimator. It is a lossfunctionbased tool that uses crossvalidation to obtain the best prediction of our target parameter, based on a weighted average of a library of machine learning algorithms.
The library of machine learning algorithms consists of functions (“learners” in
the sl3
nomenclature) that we think might be consistent with the true
datagenerating distribution (i.e. algorithms selected based on contextual
knowledge of the experiment that generated the data). Also, the library should
contain a large set of “default” algorithms that may range from a simple linear
regression model to multistep algorithms involving screening covariates,
penalizations, optimizing tuning parameters, etc.
The ensembling of the collection of algorithms with weights (“metalearning” in
the sl3
nomenclature) has been shown to be adaptive and robust, even in small
samples (Polley and van der Laan 2010). The Super Learner is proven to be asymptotically as
accurate as the best possible prediction algorithm in the library
(van der Laan and Dudoit 2003; van der Vaart, Dudoit, and van der Laan 2006).
5.3.1 Background
A loss function \(L\) is defined as a function of the observed data and a candidate parameter value \(\psi\), which has unknown true value \(\psi_0\), \(L(\psi)(O)\). We can estimate the loss by substituting the empirical distribution \(P_n\) for the true (but unknown) distribution of the observed data \(P_0\). A valid loss function will have expectation (risk) that is minimized at the true value of the parameter \(\psi_0\). For example, the conditional mean minimizes the risk of the squared error loss. Thus, it is a valid loss function when estimating the conditional mean.
The discrete Super Learner, or crossvalidation selector, is the algorithm in the library that minimizes the crossvalidated empirical risk.
The crossvalidated empirical risk of an algorithm is defined as the empirical mean over a validation sample of the loss of the algorithm fitted on the training sample, averaged across the splits of the data.
The continuous/ensemble Super Learner, often referred to as Super Learner is a weighted average of the library of algorithms, where the weights are chosen to minimize the crossvalidated empirical risk of the library. Restricting the weights to be positive and sum to one (i.e., a convex combination) has been shown to improve upon the discrete Super Learner (Polley and van der Laan 2010; van der Laan, Polley, and Hubbard 2007). This notion of weighted combinations was introduced in Wolpert (1992) for neural networks and adapted for regressions in Breiman (1996).
Crossvalidation is proven to be optimal for selection among estimators. This
result was established through the oracle inequality for the crossvalidation
selector among a collection of candidate estimators (van der Laan and Dudoit 2003; van der Vaart, Dudoit, and van der Laan 2006). The only condition is that loss function is uniformly bounded,
which is guaranteed in sl3
.
5.3.2 Super Learner for Prediction
Say we observe a learning data set \(X_i=(Y_i,W_i)\), for \(i=1, ..., n\), where \(Y_i\) is the outcome of interest, \(W_i\) is a \(p\)dimensional set of covariates, and our objective is to estimate the function \(\psi_0(W) = E(YW)\). This function can be expressed as the minimizer of the expected loss: \(\psi_0(W) = \text{argmin}_{\psi} E[L(X,\psi(W))]\). Here, the loss function is represented as \(L\) (e.g., squared error loss, \(L: (Y\psi(W))^2)\)).
For prediction, one can use the crossvalidated risk to empirically determine the relative performance of the Super Learner. When we have tested different algorithms on actual data and looked at the performance (e.g., MSE of prediction), never does one algorithm always win. Below, we show the results of such a study, comparing the fits of several different learners, including the SL algorithms.
For more detail on Super Learner we refer the reader to van der Laan, Polley, and Hubbard (2007) and Polley and van der Laan (2010). The optimality results for the crossvalidation selector among a family of algorithms were established in van der Laan and Dudoit (2003) and extended in van der Vaart, Dudoit, and van der Laan (2006).
5.4 sl3
“Microwave Dinner” Implementation
We begin by illustrating the core functionality of the super learner algorithm
as implemented in sl3
. For those who are interested in the internals of sl3
,
see this sl3
introductory
tutorial.
The sl3
implementation consists of the following steps:
 Load the necessary libraries and data
 Define the machine learning task
 Make a super learner by creating library of base learners and a metalearner
 Train the super learner on the machine learning task
 Obtain predicted values
5.4.1 WASH Benefits Study Example
Using the WASH data, we are interested in predicting weightforheight zscore
whz
using the available covariate data. More information on this dataset, and
all other data that we will work with in this handbook, is contained in Chapter
3. Let’s begin!
0. Load the necessary libraries and data
First, we will load the relevant R
packages, set a seed, and load the data.
If you would like to use newer sl3
functionality that is available in the
devel branch of the sl3
GitHub repository, you need to install that version of
the package (e.g. devtools::install_github(tlverse/sl3@devel)
), restart your
R
session, and then reload the sl3
package.
library(data.table)
library(tidyverse)
library(origami)
library(SuperLearner)
library(sl3)
library(ggplot2)
library(knitr)
library(kableExtra)
# load data set and take a peek
washb_data < fread("https://raw.githubusercontent.com/tlverse/tlversedata/master/washbenefits/washb_data.csv",
stringsAsFactors = TRUE)
head(washb_data) %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
whz  tr  fracode  month  aged  sex  momage  momedu  momheight  hfiacat  Nlt18  Ncomp  watmin  elec  floor  walls  roof  asset_wardrobe  asset_table  asset_chair  asset_khat  asset_chouki  asset_tv  asset_refrig  asset_bike  asset_moto  asset_sewmach  asset_mobile 

0.00  Control  N05265  9  268  male  30  Primary (15y)  146.40  Food Secure  3  11  0  1  0  1  1  0  1  1  1  0  1  0  0  0  0  1 
1.16  Control  N05265  9  286  male  25  Primary (15y)  148.75  Moderately Food Insecure  2  4  0  1  0  1  1  0  1  0  1  1  0  0  0  0  0  1 
1.05  Control  N08002  9  264  male  25  Primary (15y)  152.15  Food Secure  1  10  0  0  0  1  1  0  0  1  0  1  0  0  0  0  0  1 
1.26  Control  N08002  9  252  female  28  Primary (15y)  140.25  Food Secure  3  5  0  1  0  1  1  1  1  1  1  0  0  0  1  0  0  1 
0.59  Control  N06531  9  336  female  19  Secondary (>5y)  150.95  Food Secure  2  7  0  1  0  1  1  1  1  1  1  1  0  0  0  0  0  1 
0.51  Control  N06531  9  304  male  20  Secondary (>5y)  154.20  Severely Food Insecure  0  3  1  1  0  1  1  0  0  0  0  1  0  0  0  0  0  1 
1. Define the machine learning task
To define the machine learning “task” (predict weightforheight zscore
whz
using the available covariate data), we need to create an sl3_Task
object.
The sl3_Task
keeps track of the roles the variables play in the machine
learning problem, the data, and any metadata (e.g., observationallevel weights,
IDs, offset).
Also, if we had missing outcomes, we would need to set drop_missing_outcome = TRUE
when we create the task. In the next analysis, with the IST stroke trial
data, we do have a missing outcome. In the following chapter, we estimate
the missingness mechanism and account for it in the TMLE.
# specify the outcome and covariates
outcome < "whz"
covars < colnames(washb_data)[which(names(washb_data) == outcome)]
# create the sl3 task
washb_task < make_sl3_Task(
data = washb_data,
covariates = covars,
outcome = outcome
)
#> Warning in process_data(data, nodes, column_names = column_names, flag = flag, :
#> Missing covariate data detected: imputing covariates.
This warning is important. The task just imputed missing covariates for us.
Specifically, for each covariate column with missing values, sl3
uses the
median to impute missing continuous covariates, and the mode to impute binary
and categorical covariates.
Also, for each covariate column with missing values, sl3
adds an additional
column indicating whether or not the value was imputed, which is particularly
handy when the missingness in the data might be informative.
Also, notice that we did not specify the number of folds, or the loss function in the task. The default crossvalidation scheme is Vfold, with the number of folds \(V=10\).
Let’s visualize our washb_task
.
washb_task
#> A sl3 Task with 4695 obs and these nodes:
#> $covariates
#> [1] "tr" "fracode" "month" "aged"
#> [5] "sex" "momage" "momedu" "momheight"
#> [9] "hfiacat" "Nlt18" "Ncomp" "watmin"
#> [13] "elec" "floor" "walls" "roof"
#> [17] "asset_wardrobe" "asset_table" "asset_chair" "asset_khat"
#> [21] "asset_chouki" "asset_tv" "asset_refrig" "asset_bike"
#> [25] "asset_moto" "asset_sewmach" "asset_mobile" "delta_momage"
#> [29] "delta_momheight"
#>
#> $outcome
#> [1] "whz"
#>
#> $id
#> NULL
#>
#> $weights
#> NULL
#>
#> $offset
#> NULL
#>
#> $time
#> NULL
2. Make a Super Learner
Now that we have defined our machine learning problem with the task, we are ready to “make” the Super Learner. This requires specification of
 A library of base learning algorithms that we think might be consistent with the true datagenerating distribution.
 A metalearner, to ensemble the base learners.
We might also incorporate
 Feature selection, to pass only a subset of the predictors to the algorithm.
 Hyperparameter specification, to tune base learners.
Learners have properties that indicate what features they support. We may use
sl3_list_properties()
to get a list of all properties supported by at least
one learner.
sl3_list_properties()
#> [1] "binomial" "categorical" "continuous"
#> [4] "cv" "density" "ids"
#> [7] "multivariate_outcome" "offset" "preprocessing"
#> [10] "sampling" "timeseries" "weights"
#> [13] "wrapper"
Since we have a continuous outcome, we may identify the learners that support
this outcome type with sl3_list_learners()
.
sl3_list_learners("continuous")
#> [1] "Lrnr_arima" "Lrnr_bartMachine"
#> [3] "Lrnr_bilstm" "Lrnr_bound"
#> [5] "Lrnr_caret" "Lrnr_condensier"
#> [7] "Lrnr_cv_selector" "Lrnr_dbarts"
#> [9] "Lrnr_earth" "Lrnr_expSmooth"
#> [11] "Lrnr_gam" "Lrnr_gbm"
#> [13] "Lrnr_glm" "Lrnr_glm_fast"
#> [15] "Lrnr_glmnet" "Lrnr_grf"
#> [17] "Lrnr_h2o_glm" "Lrnr_h2o_grid"
#> [19] "Lrnr_hal9001" "Lrnr_HarmonicReg"
#> [21] "Lrnr_lstm" "Lrnr_mean"
#> [23] "Lrnr_nnls" "Lrnr_optim"
#> [25] "Lrnr_pkg_SuperLearner" "Lrnr_pkg_SuperLearner_method"
#> [27] "Lrnr_pkg_SuperLearner_screener" "Lrnr_polspline"
#> [29] "Lrnr_randomForest" "Lrnr_ranger"
#> [31] "Lrnr_rpart" "Lrnr_rugarch"
#> [33] "Lrnr_screener_corP" "Lrnr_screener_corRank"
#> [35] "Lrnr_screener_randomForest" "Lrnr_solnp"
#> [37] "Lrnr_stratified" "Lrnr_svm"
#> [39] "Lrnr_tsDyn" "Lrnr_xgboost"
Now that we have an idea of some learners, we can construct them using the
make_learner
function.
# choose base learners
lrnr_glm < make_learner(Lrnr_glm)
lrnr_mean < make_learner(Lrnr_mean)
We can customize learner hyperparameters to incorporate a diversity of different
settings. Documentation for the learners and their hyperparameters can be found
in the sl3
Learners
Reference.
lrnr_ranger50 < make_learner(Lrnr_ranger, num.trees = 50)
lrnr_hal_simple < make_learner(Lrnr_hal9001, max_degree = 2, n_folds = 2)
lrnr_lasso < make_learner(Lrnr_glmnet) # alpha default is 1
lrnr_ridge < make_learner(Lrnr_glmnet, alpha = 0)
lrnr_elasticnet < make_learner(Lrnr_glmnet, alpha = .5)
We can also include learners from the SuperLearner
R
package.
lrnr_bayesglm < Lrnr_pkg_SuperLearner$new("SL.bayesglm")
Here is a fun trick to create customized learners over a grid of parameters.
# I like to crock pot my super learners
grid_params < list(cost = c(0.01, 0.1, 1, 10, 100, 1000),
gamma = c(0.001, 0.01, 0.1, 1),
kernel = c("polynomial", "radial", "sigmoid"),
degree = c(1, 2, 3))
grid < expand.grid(grid_params, KEEP.OUT.ATTRS = FALSE)
params_default < list(nthread = getOption("sl.cores.learners", 1))
svm_learners < apply(grid, MARGIN = 1, function(params_tune) {
do.call(Lrnr_svm$new, c(params_default, as.list(params_tune)))})
grid_params < list(max_depth = c(2, 4, 6, 8),
eta = c(0.001, 0.01, 0.1, 0.2, 0.3),
nrounds = c(20, 50))
grid < expand.grid(grid_params, KEEP.OUT.ATTRS = FALSE)
params_default < list(nthread = getOption("sl.cores.learners", 1))
xgb_learners < apply(grid, MARGIN = 1, function(params_tune) {
do.call(Lrnr_xgboost$new, c(params_default, as.list(params_tune)))})
Did you see Lrnr_caret
when we called sl3_list_learners(c("binomial"))
?
All we need to specify is the algorithm to use, which is passed as method
to
caret::train()
. The default method for parameter selection criterion with
is set to “CV” instead of the caret::train()
default boot
. The summary
metric to used to select the optimal model is RMSE
for continuous outcomes
and Accuracy
for categorical and binomial outcomes.
# I have no idea how to tune a neural net (or BART machine..)
lrnr_caret_nnet < make_learner(Lrnr_caret, algorithm = "nnet")
lrnr_caret_bartMachine < make_learner(Lrnr_caret, algorithm = "bartMachine",
method = "boot", metric = "Accuracy",
tuneLength = 10)
In order to assemble the library of learners, we need to “stack” them together.
A Stack
is a special learner and it has the same interface as all other
learners. What makes a stack special is that it combines multiple learners by
training them simultaneously, so that their predictions can be either combined
or compared.
stack < make_learner(
Stack,
lrnr_glm, lrnr_mean, lrnr_ridge, lrnr_lasso, xgb_learners[[10]]
)
We can optionally select a subset of available covariates and pass only those variables to the modeling algorithm.
Let’s consider screening covariates based on their randomForest
variable
importance ranking (ordered by mean decrease in accuracy). We select the top 5
most important covariates according to this ranking, and we decreased the
ntree
to 20.
Before you think it – I will confess. Bob Ross and I both know that 20 trees makes for a lonely forest, and I shouldn’t even consider it, but these are the sacrifices I have to make for this chapter to build in under 50 minutes!
screen_rf < make_learner(Lrnr_screener_randomForest, nVar = 5, ntree = 20)
# which covariates are selected on the full data?
screen_rf$train(washb_task)
#> [1] "Lrnr_screener_randomForest_5_20"
#> $selected
#> [1] "month" "aged" "momage" "momheight" "Ncomp"
To “pipe” only the selected covariates to the modeling algorithm, we need to
make a Pipeline
, which is a just set of learners to be fit sequentially, where
the fit from one learner is used to define the task for the next learner.
screen_rf_pipeline < make_learner(Pipeline, screen_rf, stack)
Now our learners will be preceded by a screening step.
We also consider the original stack
, to compare how the feature selection
methods perform in comparison to the methods without feature selection, and
because
Analogous to what we have seen before, we have to stack the pipeline and
original stack
together, so we may use them as base learners in our super
learner.
fancy_stack < make_learner(Stack, screen_rf_pipeline, stack)
# we can visualize the stack
dt_stack < delayed_learner_train(fancy_stack, washb_task)
plot(dt_stack, color = FALSE, height = "400px", width = "90%")
We will use the default
metalearner,
which uses
Lrnr_solnp()
to
provide fitting procedures for a pairing of loss
function and
metalearner
function. This
default metalearner selects a loss and metalearner pairing based on the outcome
type. Note that any learner can be used as a metalearner.
We have made a library/stack of base learners, so we are ready to make the super learner. The super learner algorithm fits a metalearner on the validationset predictions.
sl < make_learner(Lrnr_sl,
learners = fancy_stack
)
We can also use Lrnr_cv
to build a super learner, crossvalidate a stack of
learners to compare performance of the learners in the stack, or crossvalidate
any single learner (see “Crossvalidation” section of this sl3
introductory tutorial).
Furthermore, we can Define New sl3
Learners which can be used
in all the places you could otherwise use any other sl3
learners, including
Pipelines
, Stacks
, and the Super Learner.
In the plot below, we visualize the steps for executing the Super Learner in the
tlverse/delayed
framework. For those like myself who are not particularly
keen on understanding the intricacies of delayed
, let’s focus on the main
point of this figure: we can see there are 10 realizations of the stack which
represent the 10 crossvalidation folds and there is a separate holdout
(top branch of the figure) that will not be used to fit the Super Learner.
dt_sl < delayed_learner_train(sl, washb_task)
plot(dt_sl, color = FALSE, height = "400px", width = "90%")
3. Train the Super Learner on the machine learning task
The Super Learner algorithm fits a metalearner on the validationset predictions in a crossvalidated manner, thereby avoiding overfitting.
Now we are ready to “train” our Super Learner on our sl3_task
object,
washb_task
.
sl_fit < sl$train(washb_task)
4. Obtain predicted values
Now that we have fit the super learner, we are ready to calculate the predicted outcome for each subject.
# we did it! now we have super learner predictions
sl_preds < sl_fit$predict()
head(sl_preds)
#> [1] 0.6594194 0.7698511 0.6601775 0.6487316 0.6288337 0.6847972
We can also obtain a summary of the results.
sl_fit_summary < sl_fit$print()
#> [1] "SuperLearner:"
#> List of 2
#> $ : chr "Pipeline(Lrnr_screener_randomForest_5_20>Stack)"
#> $ : chr "Stack"
#> [1] "Lrnr_solnp_TRUE_TRUE_FALSE_1e05"
#> $pars
#> [1] 6.184654e05 1.720116e05 5.544575e05 5.898108e05 2.106970e01
#> [6] 2.915191e01 1.720116e05 1.178057e01 3.793167e01 4.507556e04
#>
#> $convergence
#> [1] 0
#>
#> $values
#> [1] 1.020595 1.010955 1.010951
#>
#> $lagrange
#> [,1]
#> [1,] 0.04374293
#>
#> $hessian
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] 1.6979665 0.9035073 0.8210602 0.8982773 0.8667451 1.0239159 0.9035073
#> [2,] 0.9035073 1.5706620 0.9105314 0.9739251 0.7008069 0.8278794 0.5706620
#> [3,] 0.8210602 0.9105314 1.7324722 0.8192425 0.8895203 1.0519616 0.9105314
#> [4,] 0.8982773 0.9739251 0.8192425 1.8711894 0.9460662 1.1077563 0.9739251
#> [5,] 0.8667451 0.7008069 0.8895203 0.9460662 0.8181850 0.9358299 0.7008069
#> [6,] 1.0239159 0.8278794 1.0519616 1.1077563 0.9358299 1.4877842 0.8278794
#> [7,] 0.9035073 0.5706620 0.9105314 0.9739251 0.7008069 0.8278794 1.5706620
#> [8,] 0.9823009 0.8668316 1.0268335 1.0794212 0.9382398 0.7654115 0.8668316
#> [9,] 0.9608695 0.7878954 0.9542916 1.0178275 0.5926686 0.5945269 0.7878954
#> [10,] 0.7806872 0.6019418 0.7778830 0.8412361 0.4721807 0.6097128 0.6019418
#> [,8] [,9] [,10]
#> [1,] 0.9823009 0.9608695 0.7806872
#> [2,] 0.8668316 0.7878954 0.6019418
#> [3,] 1.0268335 0.9542916 0.7778830
#> [4,] 1.0794212 1.0178275 0.8412361
#> [5,] 0.9382398 0.5926686 0.4721807
#> [6,] 0.7654115 0.5945269 0.6097128
#> [7,] 0.8668316 0.7878954 0.6019418
#> [8,] 1.8276642 0.9540252 0.4851260
#> [9,] 0.9540252 1.2662454 0.9510900
#> [10,] 0.4851260 0.9510900 0.8333466
#>
#> $ineqx0
#> NULL
#>
#> $nfuneval
#> [1] 263
#>
#> $outer.iter
#> [1] 2
#>
#> $elapsed
#> Time difference of 0.1036305 secs
#>
#> $vscale
#> [1] 1.010955 0.000010 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
#> [9] 1.000000 1.000000 1.000000 1.000000
#>
#> $coefficients
#> Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_glm_TRUE
#> 0.0000000000
#> Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_mean
#> 0.0000000000
#> Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE
#> 0.0000000000
#> Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
#> 0.0000000000
#> Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_xgboost_20_1_4_0.1
#> 0.2107414178
#> Stack_Lrnr_glm_TRUE
#> 0.2915805478
#> Stack_Lrnr_mean
#> 0.0000000000
#> Stack_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE
#> 0.1178305533
#> Stack_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
#> 0.3793966305
#> Stack_Lrnr_xgboost_20_1_4_0.1
#> 0.0004508506
#>
#> $training_offset
#> [1] FALSE
#>
#> $name
#> [1] "solnp"
#>
#> [1] "Crossvalidated risk (MSE, squared error loss):"
#> learner
#> 1: Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_glm_TRUE
#> 2: Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_mean
#> 3: Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE
#> 4: Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
#> 5: Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_xgboost_20_1_4_0.1
#> 6: Stack_Lrnr_glm_TRUE
#> 7: Stack_Lrnr_mean
#> 8: Stack_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE
#> 9: Stack_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
#> 10: Stack_Lrnr_xgboost_20_1_4_0.1
#> 11: SuperLearner
#> coefficients mean_risk SE_risk fold_SD fold_min_risk fold_max_risk
#> 1: 0.0000000000 1.036483 0.02445142 0.07264280 0.8893079 1.131805
#> 2: 0.0000000000 1.065371 0.02503479 0.06410509 0.9338939 1.135933
#> 3: 0.0000000000 1.036478 0.02445751 0.07234017 0.8898660 1.130969
#> 4: 0.0000000000 1.036446 0.02445051 0.07260733 0.8894397 1.131734
#> 5: 0.2107414178 1.047139 0.02402321 0.07080623 0.9154444 1.128126
#> 6: 0.2915805478 1.020204 0.02395530 0.06749990 0.8944174 1.120024
#> 7: 0.0000000000 1.065371 0.02503479 0.06410509 0.9338939 1.135933
#> 8: 0.1178305533 1.015819 0.02373426 0.06555599 0.8862805 1.109816
#> 9: 0.3793966305 1.013498 0.02361186 0.06559624 0.8914167 1.108439
#> 10: 0.0004508506 1.031917 0.02351705 0.06444276 0.9104842 1.114357
#> 11: NA 1.010944 0.02353629 0.06688407 0.8844468 1.103548
From the table of the printed Super Learner fit, we note that the Super Learner
had a mean risk of 1.0109438
and that this ensemble weighted the ranger
and glmnet
learners highest while
not weighting the mean
learner highly.
We can also see that the glmnet
learner had the lowest crossvalidated mean
risk, thus making it the crossvalidated selector (or the discrete Super
Learner). The mean risk of the Super Learner is calculated using the holdout
set that we visualized in the dt_sl
plot.
5.5 Crossvalidated Super Learner
We can crossvalidate the Super Learner to see how well the Super Learner performs on unseen data, and obtain an estimate of the crossvalidated risk of the Super Learner.
This estimation procedure requires an “external” layer of crossvalidation,
also called nested crossvalidation, which involves setting aside a separate
holdout sample that we don’t use to fit the Super Learner. This
external cross validation procedure may also incorporate 10 folds, which is the
default in sl3
. However, we will incorporate 2 outer/external folds of
crossvalidation for computational efficiency.
We also need to specify a loss function to evaluate Super Learner.
Documentation for the available loss functions can be found in the sl3
Loss
Function Reference.
washb_task_new < make_sl3_Task(
data = washb_data,
covariates = covars,
outcome = outcome,
folds = origami::make_folds(washb_data, fold_fun = folds_vfold, V = 2)
)
#> Warning in process_data(data, nodes, column_names = column_names, flag = flag, :
#> Missing covariate data detected: imputing covariates.
CVsl < CV_lrnr_sl(sl_fit, washb_task_new, loss_squared_error)
CVsl %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
learner  coefficients  mean_risk  SE_risk  fold_SD  fold_min_risk  fold_max_risk 

Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_glm_TRUE  0.0709  1.0348  0.0244  0.0126  1.0259  1.0437 
Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_mean  0.0000  1.0656  0.0250  0.0184  1.0525  1.0786 
Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE  0.0013  1.0350  0.0245  0.0130  1.0257  1.0442 
Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE  0.0000  1.0353  0.0245  0.0133  1.0259  1.0447 
Pipeline(Lrnr_screener_randomForest_5_20>Stack)_Lrnr_xgboost_20_1_4_0.1  0.0799  1.0551  0.0242  0.0066  1.0504  1.0597 
Stack_Lrnr_glm_TRUE  0.0482  1.0350  0.0273  0.0411  1.0060  1.0641 
Stack_Lrnr_mean  0.0000  1.0656  0.0250  0.0184  1.0525  1.0786 
Stack_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE  0.1845  1.0188  0.0238  0.0243  1.0016  1.0360 
Stack_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE  0.5232  1.0145  0.0236  0.0236  0.9978  1.0311 
Stack_Lrnr_xgboost_20_1_4_0.1  0.0920  1.0417  0.0235  0.0159  1.0304  1.0529 
SuperLearner  NA  1.0144  0.0234  0.0226  0.9984  1.0304 
5.6 Variable Importance Measures with sl3
Variable importance can be interesting and informative. It can also be
contradictory and confusing. Nevertheless, we like it, and so do
collaborators, so we created a variable importance function in sl3
! The sl3
varimp
function returns a table with variables listed in decreasing order of
importance (i.e. most important on the first row).
The measure of importance in sl3
is based on a risk difference between the
learner fit with a permuted covariate and the learner fit with the true
covariate, across all covariates. In this manner, the larger the risk
difference, the more important the variable is in the prediction.
The intuition of this measure is that it calculates the risk (in terms of the
average loss in predictive accuracy) of losing one covariate, while keeping
everything else fixed, and compares it to the risk if the covariate was not
lost. If this risk difference is zero then losing that covariate had no
impact, and is thus not important by this measure. We do this across all of the
covariates. As stated above, we don’t actually remove the covariate, we just
permute/shuffle it, but the idea is that this shuffling distorts potentially
meaningful information that was present in the covariate. This idea of permuting
instead of removing saves a lot of time, and is also incorporated in the
randomForest
variable importance measures.
Let’s explore the sl3
variable importance measurements for the washb
data.
washb_varimp < varimp(sl_fit, loss_squared_error)
washb_varimp %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
X  risk_diff 

aged  0.0371 
momedu  0.0079 
asset_refrig  0.0049 
month  0.0038 
tr  0.0035 
momheight  0.0032 
Nlt18  0.0028 
asset_chair  0.0028 
asset_table  0.0022 
elec  0.0018 
floor  0.0015 
walls  0.0011 
momage  0.0009 
asset_mobile  0.0005 
delta_momheight  0.0003 
asset_sewmach  0.0002 
asset_bike  0.0002 
fracode  0.0002 
asset_moto  0.0001 
asset_chouki  0.0001 
asset_wardrobe  0.0000 
hfiacat  0.0001 
sex  0.0001 
delta_momage  0.0002 
roof  0.0002 
Ncomp  0.0003 
asset_tv  0.0003 
asset_khat  0.0004 
watmin  0.0008 
# plot variable importance
washb_varimp %>%
mutate(name = forcats::fct_reorder(X, risk_diff)) %>%
ggplot(aes(x = risk_diff, y = name)) +
geom_dotplot(binaxis = "y") +
labs(x = "Risk Difference", y = "Covariate",
title = "sl3 Variable Importance for WASH Benefits Example Data")
5.7 Exercises
5.7.1 Predicting Myocardial Infarction with sl3
Follow the steps below to predict myocardial infarction (mi
) using the
available covariate data. We thank Prof. David Benkeser at Emory University for
making the this Cardiovascular Health Study (CHS) data accessible.
# load the data set
db_data <
url("https://raw.githubusercontent.com/benkeser/sllecture/master/chspred.csv")
chspred < read_csv(file = db_data, col_names = TRUE)
# take a quick peek
head(chspred) %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
waist  alcoh  hdl  beta  smoke  ace  ldl  bmi  aspirin  gend  age  estrgn  glu  ins  cysgfr  dm  fetuina  whr  hsed  race  logcystat  logtrig  logcrp  logcre  health  logkcal  sysbp  mi 

110.1642  0.0000  66.4974  0  0  1  114.2162  27.9975  0  0  73.5179  0  159.9314  70.3343  75.0078  1  0.1752  1.1690  1  1  0.3420  5.4063  2.0126  0.6739  0  4.3926  177.1345  0 
89.9763  0.0000  50.0652  0  0  0  103.7766  20.8931  0  0  61.7723  0  153.3888  33.9695  82.7433  1  0.5717  0.9011  0  0  0.0847  4.8592  3.2933  0.5551  1  6.2071  136.3742  0 
106.1941  8.4174  40.5059  0  0  0  165.7158  28.4554  1  1  72.9312  0  121.7145  17.3017  74.6989  0  0.3517  1.1797  0  1  0.4451  4.5088  0.3013  0.0115  0  6.7320  135.1993  0 
90.0566  0.0000  36.1750  0  0  0  45.2035  23.9608  0  0  79.1191  0  53.9691  11.7315  95.7823  0  0.5439  1.1360  0  0  0.4807  5.1832  3.0243  0.5751  1  7.3972  139.0182  0 
78.6143  2.9790  71.0642  0  1  0  131.3121  10.9656  0  1  69.0179  0  94.3153  9.7112  72.7109  0  0.4916  1.1028  1  0  0.3121  4.2190  0.7057  0.0053  1  8.2779  88.0470  0 
91.6593  0.0000  59.4963  0  0  0  171.1872  29.1317  0  1  81.8346  0  212.9066  28.2269  69.2184  1  0.4621  0.9529  1  0  0.2872  5.1773  0.9705  0.2127  1  5.9942  69.5943  0 
 Create an
sl3
task, setting myocardial infarctionmi
as the outcome and using all available covariate data.  Make a library of seven relatively fast base learning algorithms (i.e., do
not consider BART or HAL). Customize hyperparameters for one of your
learners. Feel free to use learners from
sl3
orSuperLearner
. You may use the same base learning library that is presented above.  Incorporate feature selection with the screener
screen.corP
.  Fit the metalearning step with the default metalearner.
 With the metalearner and base learners, make the Super Learner and train it on the task.
 Print your Super Learner fit by calling
print()
with$
.  Crossvalidate your Super Learner fit to see how well it performs on unseen
data. Specify
loss_squared_error
as the loss function to evaluate the Super Learner.  Use the
varimp()
function to identify the most important predictor of myocardial infarction.
5.7.2 Predicting Recurrent Ischemic Stroke in an RCT with sl3
For this exercise, we will work with a random sample of 5,000 patients who
participated in the International Stroke Trial (IST). This data is described in
Chapter 3.2 of the tlverse
handbook.
 Train a Super Learner to predict recurrent stroke
DRSISC
with the available covariate data (the 25 other variables). Of course, you can consider feature selection in the machine learning algorithms. In this data, the outcome is occasionally missing, so be sure to specifydrop_missing_outcome = TRUE
when defining the task.  Use the SLbased predictions to calculate the area under the ROC curve (AUC).
 Calculate the crossvalidated AUC to evaluate the performance of the Super Learner on unseen data.
 Which covariates are the most predictive of 14day recurrent stroke,
according to
sl3
variable importance measures?
ist_data < paste0("https://raw.githubusercontent.com/tlverse/",
"tlversehandbook/master/data/ist_sample.csv") %>% fread()
# number 3 help
ist_task_CVsl < make_sl3_Task(
data = ist_data,
outcome = "DRSISC",
covariates = colnames(ist_data)[which(names(ist_data) == "DRSISC")],
drop_missing_outcome = TRUE,
folds = origami::make_folds(
n = sum(!is.na(ist_data$DRSISC)),
fold_fun = folds_vfold,
V = 5
)
)
5.8 Concluding Remarks
The general ensemble learning approach of Super Learner can be applied to a diversity of estimation and prediction problems that can be defined by a loss function.
We just discussed conditional mean estimation, outcome prediction and variable importance. In future updates of this chapter, we will delve into prediction of a conditional density, and the optimal individualized treatment rule.

If we plug in the estimator returned by super learner into the target parameter mapping, then we would end up with an estimator that has the same bias as what we plugged in, and would not be asymptotically linear. It also would not be a plugin estimator or efficient.
 An asymptotically linear estimator is important to have, since they converge to the estimand at \(\frac{1}{\sqrt{n}}\) rate, and thereby permit formal statistical inference (i.e. confidence intervals and \(p\)values).
 Plugin estimators of the estimand are desirable because they respect both the local and global constraints of the statistical model (e.g. bounds), and have they have better finitesample properties.
 An efficient estimator is optimal in the sense that it has the lowest possible variance, and is thus the most precise. An estimator is efficient if and only if is asymptotically linear with influence curve equal to the canonical gradient. The canonical gradient is a mathematical object that is specific to the target estimand, and it provides information on the level of difficulty of the estimation problem. The canonical gradient is shown in the chapters that follow. Practitioner’s do not need to know how to calculate a canonical gradient in order to understand efficiency and use Targeted Maximum Likelihood Estimation (TMLE). Metaphorically, you do not need to be Yoda in order to be a Jedi.
TMLE is a general strategy that succeeds in constructing efficient and asymptotically linear plugin estimators.
Super Learner is fantastic for pure prediction, and for obtaining an initial estimate in the first step of TMLE, but we need the second step of TMLE to have the desirable statistical properties mentioned above.
In the chapters that follow, we focus on the targeted maximum likelihood estimator and the targeted minimum lossbased estimator, both referred to as TMLE.
5.9 Appendix
5.9.1 Exercise 1 Solution
Here is a potential solution to the sl3
Exercise 1 – Predicting Myocardial
Infarction with sl3
.
library(sl3)
library(origami)
library(data.table)
library(tidyverse)
library(ggplot2)
db_data <
url("https://raw.githubusercontent.com/benkeser/sllecture/master/chspred.csv")
chspred < read_csv(file = db_data, col_names = TRUE)
# make task
chspred_task < make_sl3_Task(
data = chspred,
covariates = head(colnames(chspred), 1),
outcome = "mi"
)
# make learners
glm_learner < Lrnr_glm$new()
lasso_learner < Lrnr_glmnet$new(alpha = 1)
ridge_learner < Lrnr_glmnet$new(alpha = 0)
enet_learner < Lrnr_glmnet$new(alpha = 0.5)
# curated_glm_learner uses formula = "mi ~ smoke + beta + waist"
curated_glm_learner < Lrnr_glm_fast$new(covariates = c("smoke, beta, waist"))
mean_learner < Lrnr_mean$new() # That is one mean learner!
glm_fast_learner < Lrnr_glm_fast$new()
ranger_learner < Lrnr_ranger$new()
svm_learner < Lrnr_svm$new()
xgb_learner < Lrnr_xgboost$new()
# screening
screen_cor < make_learner(Lrnr_screener_corP)
glm_pipeline < make_learner(Pipeline, screen_cor, glm_learner)
# stack learners together
stack < make_learner(
Stack,
glm_pipeline, glm_learner,
lasso_learner, ridge_learner, enet_learner,
curated_glm_learner, mean_learner, glm_fast_learner,
ranger_learner, svm_learner, xgb_learner
)
# make and train super learner
sl < Lrnr_sl$new(
learners = stack
)
sl_fit < sl$train(chspred_task)
sl_fit$print()
CVsl < CV_lrnr_sl(sl_fit, chspred_task, loss_squared_error)
CVsl
importance < varimp(sl_fit, loss_squared_error)
importance %>%
mutate(name = forcats::fct_reorder(X, risk_diff)) %>%
ggplot(aes(x = risk_diff, y = name)) +
geom_dotplot(binaxis = "y") +
labs(x = "Risk Difference", y = "Covariate",
title = "sl3 Variable Importance for Myocardian Infarction Prediction")
5.9.2 Exercise 2 Solution
Here is a potential solution to sl3
Exercise 2 – Predicting Recurrent
Ischemic Stroke in an RCT with sl3
.
library(sl3)
library(origami)
library(data.table)
library(tidyverse)
library(ROCR) # for AUC calculation
library(ggplot2)
ist_data < paste0("https://raw.githubusercontent.com/tlverse/",
"tlversehandbook/master/data/ist_sample.csv") %>% fread()
# stack
ist_task < make_sl3_Task(
data = ist_data,
outcome = "DRSISC",
covariates = colnames(ist_data)[which(names(ist_data) == "DRSISC")],
drop_missing_outcome = TRUE
)
# learner library
lrnr_glm < Lrnr_glm$new()
lrnr_lasso < Lrnr_glmnet$new(alpha = 1)
lrnr_ridge < Lrnr_glmnet$new(alpha = 0)
lrnr_enet < Lrnr_glmnet$new(alpha = 0.5)
lrnr_mean < Lrnr_mean$new()
lrnr_ranger < Lrnr_ranger$new()
lrnr_svm < Lrnr_svm$new()
# xgboost grid
grid_params < list(max_depth = c(2, 5, 8),
eta = c(0.01, 0.15, 0.3))
grid < expand.grid(grid_params, KEEP.OUT.ATTRS = FALSE)
params_default < list(nthread = getOption("sl.cores.learners", 1))
xgb_learners < apply(grid, MARGIN = 1, function(params_tune) {
do.call(Lrnr_xgboost$new, c(params_default, as.list(params_tune)))})
learners < unlist(list(xgb_learners, lrnr_ridge, lrnr_mean, lrnr_lasso,
lrnr_glm, lrnr_enet, lrnr_ranger, lrnr_svm),
recursive = TRUE)
# super learner
sl < Lrnr_sl$new(learners)
sl_fit < sl$train(ist_task)
# AUC
preds < sl_fit$predict()
obs < c(na.omit(ist_data$DRSISC))
AUC < performance(prediction(sl_preds, obs), measure = "auc")@y.values[[1]]
plot(performance(prediction(sl_preds, obs), "tpr", "fpr"))
# CVsl
ist_task_CVsl < make_sl3_Task(
data = ist_data,
outcome = "DRSISC",
covariates = colnames(ist_data)[which(names(ist_data) == "DRSISC")],
drop_missing_outcome = TRUE,
folds = origami::make_folds(
n = sum(!is.na(ist_data$DRSISC)),
fold_fun = folds_vfold,
V = 5
)
)
CVsl < CV_lrnr_sl(sl_fit, ist_task_CVsl, loss_loglik_binomial)
CVsl
# sl3 variable importance plot
importance < varimp(sl_fit, loss_loglik_binomial)
importance %>%
mutate(name = forcats::fct_reorder(X, risk_diff)) %>%
ggplot(aes(x = risk_diff, y = name)) +
geom_dotplot(binaxis = "y") +
labs(x = "Risk Difference", y = "Covariate",
title = "Variable Importance for Predicting Recurrent Ischemic Stroke")