6 Super (Machine) Learning
Rachael Phillips
Based on the sl3
R
package by Jeremy
Coyle, Nima Hejazi, Ivana Malenica, Rachael Phillips, and Oleg Sofrygin.
Updated: 20210411
Learning Objectives
By the end of this chapter you will be able to:
 Select a loss function that is appropriate for the functional parameter to be estimated.
 Assemble an ensemble of learners based on the properties that identify what features they support.
 Customize learner tuning parameters to incorporate a diversity of different settings.
 Select a subset of available covariates and pass only those variables to the modeling algorithm.
 Fit an ensemble with nested crossvalidation to obtain an estimate of the performance of the ensemble itself.
 Obtain
sl3
variable importance metrics.  Interpret the discrete and continuous Super Learner fits.
 Rationalize the need to remove bias from the Super Learner to make an optimal biasvariance tradeoff for the parameter of interest.
Motivation
 A common task in data analysis is prediction – using the observed data to learn a function, which can be used to map new input variables into a predicted outcome.
 For some data, algorithms that can model a complex function are necessary to adequately model the data. For other data, a main terms regression model might fit the data quite well.
 The Super Learner (SL), an ensemble learner, solves this issue, by allowing a combination of algorithms from the simplest (interceptonly) to most complex (neural nets, random forests, SVM, etc).
 It works by using crossvalidation in a manner which guarantees that the resulting fit will be as good as possible, given the learners provided.
Introduction
In Chapter 1, we introduced the Roadmap for Targeted Learning as a general template to translate realworld data applications into formal statistical estimation problems. The first steps of this roadmap define the statistical estimation problem, which establish
 Data as a random variable, or equivalently, a realization of a particular experiment/study. We assume the observations in the data are independent and identically distributed.
 A statistical model as the set of possible probability distributions that could have given rise to the observed data.
 A translation of the scientific question, which is often causal, into a target estimand.
Note that if the estimand is causal, step 3 also requires establishing identifiability of the estimand from the observed data, under possible nontestable assumptions that may not necessarily be reasonable. Still, the target quantity does have a valid statistical interpretation. See causal target parameters for more detail on causal models and identifiability.
Now that we have defined the statistical estimation problem, we are ready to construct the TMLE; an asymptotically linear and efficient substitution estimator of this estimand. The first step in this estimation procedure is an initial estimate of the datagenerating distribution, or the relevant part of this distribution that is needed to evaluate the target parameter. For this initial estimation, we use the Super Learner (SL) (van der Laan, Polley, and Hubbard 2007).
The SL provides an important step in creating a robust estimator. It is a
lossfunctionbased tool that uses crossvalidation to obtain the best
prediction of our target parameter, based on a weighted average of a library of
machine learning algorithms. The library of machine learning algorithms consists
of functions (“learners” in the sl3
nomenclature) that we think might be
consistent with the true datagenerating distribution. By “consistent with the
true datagenerating distribution”, we mean that the algorithms selected should
not violate subjectmatter knowledge about the experiment that generated the
data. Also, the library should contain a diversity of algorithms that range from
parametric regression models to multistep algorithms involving screening
covariates, penalizations, optimizing tuning parameters, etc.
The ensembling of the collection of algorithms with weights (“metalearning” in
the sl3
nomenclature) has been shown to be adaptive and robust, even in small
samples (Polley and van der Laan 2010). The SL is proven to be asymptotically as accurate as
the best possible prediction algorithm in the library (van der Laan and Dudoit 2003; Van der Vaart, Dudoit, and Laan 2006).
Stepbystep overview
Consider the scenario in which we have \(n\) independent and identically distributed observations in the data, and our data structure is not a time series. Also, let’s say we have \(k\) number of candidate learners/algorithms.

Create the validation data for all \(V=v\) folds. Break up the data evenly into \(V=v\) splits; such that no observation is contained more than one split, and the splits contain about the same number of observations (e.g., about \(n/V\) observations in each split).
 If a rare binary outcome (or highly important binary predictor, such as
a treatment) is present in the data, we should consider making the
prevalence of this binary outcome in the splits similar to the
prevalence that exists in the data. We can achieve this by specifying,
for the
strata_ids
argument inorigami::make_folds()
, the vector of binary outcomes (or important binary covariate).  If we have repeated measures or clusterlevel dependence in the data, then all observations within a subject/cluster should be placed in the same split.
 If a rare binary outcome (or highly important binary predictor, such as
a treatment) is present in the data, we should consider making the
prevalence of this binary outcome in the splits similar to the
prevalence that exists in the data. We can achieve this by specifying,
for the

For each fold \(v\):
Separate (i) from (ii):
 The data that was selected for fold \(v\) in Step 1, which contains roughly \(n/V\) total observations. We will refer to this subset of the data as the “validation” data, and it’s also commonly referred to as the “test” data. Let’s call \(n_{\text{validation}}\) as the number of observations in then validation data,
 The data that was NOT selected for fold \(v\) in Step 1, which contains roughly \(n  n_{\text{validation}}\) total observations. We will refer to this subset of the data as the “training”, and \(n_{\text{training}}\) as the number of observations in the training data.
 Fit each of the \(k\) learners on the training data (ii).
 Using each of the \(k\) trained learners, predict the outcomes in the validation data (i). We can call these predictions “crossvalidated predictions”; since they were obtained from the validation sample’s covariate information, which was never seen while fitting these models. We end up with a \(n_{validation} \times k\) matrix of crossvalidated predictions.
 Bind together the rows of all \(v\) \(n_{validation} \times k\) matrices of crossvalidated predictions, to obtain an \(n \times k\) matrix of crossvalidated predictions. This matrix \(n \times k\) matrix of crossvalidated predictions is often referred to as the “levelone” or “Z” matrix.
 Retain the observed outcome \(Y\) for all of the \(n\) observations, using them to measure the “loss” of each crossvalidated prediction (e.g., (\(Y  \hat{Y}^2\)).
 For each \(k\) column, take the (potentially weighted) mean across all of the \(n\) losses, which we call the “crossvalidated empirical risk”. The crossvalidated empirical risk provides measure of performance, summarized across all \(n\), for each of the \(k\) learners. The weights that could be incorporated in the data, and used to calculate a weighted mean, serve to to up weight or down weight samples whose loss should be considered less or more important, respectively.

Establish the ensemble/combination of the \(k\) learners by fitting the socalled “metalearner”. The ensemble is just a weighted combination of the learners, so the weights here are just a \(k\)dimensional vector. The metalearner is a function that decides on the weights to be assigned to each of the \(k\) learners; taking as input the crossvalidated empirical risk for all \(k\) learners (a), or taking as input a loss function and the Z matrix (b).
 The discrete SL (or crossvalidated selector) employs a simple metalearner that takes as input the crossvalidated empirical risk for all \(k\) learners. This metalearner assigns weight of 1 to the single learner with smallest crossvalidated risk, and a weight of 0 to all other learners.
 The ensemble SL (often referred to as the “Super Learner”) employs metalearners that take as input the Z matrix, and the loss function of interest (unless the loss is implied by the metalearning function itself). These metalearners assign the weights such that the weighted combination of Z matrix predictions is optimized to minimize the crossvalidated empirical risk. This often results in more than one learner having positive weight. Aggressive metalearning (e.g., assigning negative weight) can be problematic, leading to overfitting.
 Fit the learners (or only the learners with nonzero weight from Step 5) on the entire sample of \(n\) observations, and use the weights that were obtained in Step 5 to get the SL fit. That’s it! The SL fit is just all of \(k\) learner fits — the weights don’t come in to play until we obtain the predictions from the SL; where the SL predictions are the weighted combination of the \(k\) learner predictions, as determined by the metalearner. Notice that, we use this rigorous, optimal, and fair procedure to derive the weights from the \(n \times k\) Z matrix of crossvalidated predictions; but once we’ve done that, we capitalize on the entire sample of observations, transitioning our focus to obtaining the best fit possible of our \(k\) learners.
SL predictions, variable importance, and/or a crossvalidated SL can be obtained from an SL fit (like most other learners). The crossvalidated SL provides an estimate of the performance of the SL on unseen data, and incorporates a outer layer of crossvalidation in order to crossvalidate this entire procedure.
Below is a figure from [ADD REF] describing the same stepbystep procedure. This figure considers \(k=16\) learners, and in the figure \(p=k\); and the squared error loss function, thus mean squared error (MSE) is the risk.
Theoretical Foundations and Further Reading
 Crossvalidation is proven to be optimal for selection among estimators. This
result was established through the oracle inequality for the crossvalidation
selector among a collection of candidate estimators (van der Laan and Dudoit 2003; Van der Vaart, Dudoit, and Laan 2006). The only condition is that loss function is uniformly
bounded, which is guaranteed in
sl3
. 
We use a loss function \(L\) to assign a measure of performance to each learner \(\psi\) when applied to the data \(O\), and subsequently compare performance across the learners. More generally, \(L\) maps every \(\psi \in \R\) to \(L(\psi) : (O) \mapsto L(\psi)(O)\). We use the terms “learner”, “algorithm”, and “estimator” interchangeably.
 It is important to recall that \(\psi\) is an estimator of \(\psi_0\), the unknown and true parameter value under \(P_0\).
 A valid loss function will have mean/expectation (risk) that is minimized at the true value of the parameter \(\psi_0\). Thus, minimizing the expected loss will bring an estimator \(\psi\) closer to the true \(\psi_0\).
 For example, say we observe a learning data set \(O_i=(Y_i,X_i)\), of \(i=1, \ldots, n\) independent and identically distributed observations, where \(Y_i\) is a continuous outcome of interest, \(X_i\) is a set of covariates. Also, let our objective be to estimate the function \(\psi_0: X \mapsto \psi_0(X) = E_0(Y \mid X)\), which is the conditional mean outcome given covariates. This function can be expressed as the minimizer of the expected squared error loss: \(\psi_0 = \text{argmin}_{\psi} E[L(O,\psi(X))]\), where \(L(O, \psi(X)) = (Y  \psi(X))^2\).
 We can estimate the loss by substituting the empirical distribution of the data \(P_n\) for the true and unknown distribution of the observed data \(P_0\).
 Also, we can use the crossvalidated risk to empirically determine the relative performance of an estimator (i.e., a candidate learner), and perhaps how it’s performance compares to other estimators.
 Once we have tested different estimators on actual data, and looked at the performance (e.g., MSE of predictions across all learners), we can see which algorithm (or weighted combination) has the lowest risk, and thus is closest to the true \(\psi_0\).

The crossvalidated empirical risk of an algorithm is defined as the empirical mean over a validation sample of the loss of the algorithm fitted on the training sample, averaged across the splits of the data.
 The discrete Super Learner, or crossvalidation selector, is the algorithm in the library that minimizes the crossvalidated empirical risk.
 The continuous/ensemble Super Learner, often referred to as Super Learner is a weighted average of the library of algorithm predictions, where the weights are chosen to minimize the crossvalidated empirical risk of the library. This notion of weighted combinations was introduced in Wolpert (1992) for neural networks and adapted for regressions in Breiman (1996). Restricting the weights to be positive and sum to one (i.e., a convex combination) has been shown to perform well in practice (Polley and van der Laan 2010; van der Laan, Polley, and Hubbard 2007).
sl3
“Microwave Dinner” Implementation
We begin by illustrating the core functionality of the SL algorithm as
implemented in sl3
.
The sl3
implementation consists of the following steps:
 Load the necessary libraries and data.
 Define the machine learning task.
 Make an SL by creating library of base learners and a metalearner.
 Train the SL on the machine learning task.
 Obtain predicted values.
WASH Benefits Study Example
Using the WASH Benefits Bangladesh data, we are interested in predicting
weightforheight zscore whz
using the available covariate data. More
information on this dataset, and all other data that we will work with in this
handbook, is contained in Chapter 3. Let’s begin!
0. Load the necessary libraries and data
First, we will load the relevant R
packages, set a seed, and load the data.
library(data.table)
library(dplyr)
library(readr)
library(ggplot2)
library(SuperLearner)
library(origami)
library(sl3)
library(knitr)
library(kableExtra)
# load data set and take a peek
washb_data < fread(
paste0(
"https://raw.githubusercontent.com/tlverse/tlversedata/master/",
"washbenefits/washb_data.csv"
),
stringsAsFactors = TRUE
)
A quick look at the data:
whz  tr  fracode  month  aged  sex  momage  momedu  momheight  hfiacat  Nlt18  Ncomp  watmin  elec  floor  walls  roof  asset_wardrobe  asset_table  asset_chair  asset_khat  asset_chouki  asset_tv  asset_refrig  asset_bike  asset_moto  asset_sewmach  asset_mobile 

0.00  Control  N05265  9  268  male  30  Primary (15y)  146.40  Food Secure  3  11  0  1  0  1  1  0  1  1  1  0  1  0  0  0  0  1 
1.16  Control  N05265  9  286  male  25  Primary (15y)  148.75  Moderately Food Insecure  2  4  0  1  0  1  1  0  1  0  1  1  0  0  0  0  0  1 
1.05  Control  N08002  9  264  male  25  Primary (15y)  152.15  Food Secure  1  10  0  0  0  1  1  0  0  1  0  1  0  0  0  0  0  1 
1.26  Control  N08002  9  252  female  28  Primary (15y)  140.25  Food Secure  3  5  0  1  0  1  1  1  1  1  1  0  0  0  1  0  0  1 
0.59  Control  N06531  9  336  female  19  Secondary (>5y)  150.95  Food Secure  2  7  0  1  0  1  1  1  1  1  1  1  0  0  0  0  0  1 
0.51  Control  N06531  9  304  male  20  Secondary (>5y)  154.20  Severely Food Insecure  0  3  1  1  0  1  1  0  0  0  0  1  0  0  0  0  0  1 
1. Define the machine learning task
To define the machine learning “task” (predict weightforheight Zscore
whz
using the available covariate data), we need to create an sl3_Task
object.
The sl3_Task
keeps track of the roles the variables play in the machine
learning problem, the data, and any metadata (e.g., observationallevel
weights, IDs, offset).
Also, if we had missing outcomes, we would need to set drop_missing_outcome = TRUE
when we create the task. In the next analysis, with the IST stroke trial
data, we do have a missing outcome. In the following chapter, we need to
estimate this “missingness mechanism”; which is the conditional probability that
the outcome is observed, given the history (i.e., variables that were measured
before the missingness). Estimating the missingness mechanism requires learning
a prediction function that outputs the predicted probability that a unit is
missing, given their history.
# specify the outcome and covariates
outcome < "whz"
covars < colnames(washb_data)[which(names(washb_data) == outcome)]
# create the sl3 task
washb_task < make_sl3_Task(
data = washb_data,
covariates = covars,
outcome = outcome
)
#> Warning in process_data(data, nodes, column_names = column_names, flag = flag, :
#> Missing covariate data detected: imputing covariates.
This warning is important. The task just imputed missing covariates for us.
Specifically, for each covariate column with missing values, sl3
uses the
median to impute missing continuous covariates, and the mode to impute binary
and categorical covariates.
Also, for each covariate column with missing values, sl3
adds an additional
column indicating whether or not the value was imputed, which is particularly
handy when the missingness in the data might be informative.
Also, notice that we did not specify the number of folds, or the loss function in the task. The default crossvalidation scheme is Vfold, with \(V=10\) number of folds.
Let’s visualize our washb_task
:
washb_task
#> A sl3 Task with 4695 obs and these nodes:
#> $covariates
#> [1] "tr" "fracode" "month" "aged"
#> [5] "sex" "momage" "momedu" "momheight"
#> [9] "hfiacat" "Nlt18" "Ncomp" "watmin"
#> [13] "elec" "floor" "walls" "roof"
#> [17] "asset_wardrobe" "asset_table" "asset_chair" "asset_khat"
#> [21] "asset_chouki" "asset_tv" "asset_refrig" "asset_bike"
#> [25] "asset_moto" "asset_sewmach" "asset_mobile" "delta_momage"
#> [29] "delta_momheight"
#>
#> $outcome
#> [1] "whz"
#>
#> $id
#> NULL
#>
#> $weights
#> NULL
#>
#> $offset
#> NULL
#>
#> $time
#> NULL
We can’t see when we print the task, but the default crossvalidation fold structure (\(V\)fold crossvalidation with \(V\)=10 folds) was created when we defined the task.
length(washb_task$folds) # how many folds?
#> [1] 10
head(washb_task$folds[[1]]$training_set) # row indexes for fold 1 training
#> [1] 1 2 3 4 5 6
head(washb_task$folds[[1]]$validation_set) # row indexes for fold 1 validation
#> [1] 12 21 29 41 43 53
any(
washb_task$folds[[1]]$training_set %in%
washb_task$folds[[1]]$validation_set
)
#> [1] FALSE
R6
Tip: If you type washb_task$
and then press the “tab” button (you will
need to press “tab” twice if you’re not in RStudio), you can view all of the
active and public fields and methods that can be accessed from the washb_task
object.
2. Make a Super Learner
Now that we have defined our machine learning problem with the sl3_Task
, we
are ready to “make” the Super Learner (SL). This requires specification of
 A set of candidate machine learning algorithms, also commonly referred to as a “library” of “learners”. The set should include a diversity of algorithms that are believed to be consistent with the true datagenerating distribution.
 A metalearner, to ensemble the base learners.
We might also incorporate
 Feature selection, to pass only a subset of the predictors to the algorithm.
 Hyperparameter specification, to tune base learners.
Learners have properties that indicate what features they support. We may use
sl3_list_properties()
to get a list of all properties supported by at least
one learner.
sl3_list_properties()
#> [1] "binomial" "categorical" "continuous" "cv"
#> [5] "density" "h2o" "ids" "importance"
#> [9] "offset" "preprocessing" "sampling" "screener"
#> [13] "timeseries" "weights" "wrapper"
Since we have a continuous outcome, we may identify the learners that support
this outcome type with sl3_list_learners()
.
sl3_list_learners("continuous")
#> [1] "Lrnr_arima" "Lrnr_bartMachine"
#> [3] "Lrnr_bayesglm" "Lrnr_bilstm"
#> [5] "Lrnr_bound" "Lrnr_caret"
#> [7] "Lrnr_cv_selector" "Lrnr_dbarts"
#> [9] "Lrnr_earth" "Lrnr_expSmooth"
#> [11] "Lrnr_gam" "Lrnr_gbm"
#> [13] "Lrnr_glm" "Lrnr_glm_fast"
#> [15] "Lrnr_glmnet" "Lrnr_grf"
#> [17] "Lrnr_gru_keras" "Lrnr_gts"
#> [19] "Lrnr_h2o_glm" "Lrnr_h2o_grid"
#> [21] "Lrnr_hal9001" "Lrnr_HarmonicReg"
#> [23] "Lrnr_hts" "Lrnr_lstm"
#> [25] "Lrnr_lstm_keras" "Lrnr_mean"
#> [27] "Lrnr_multiple_ts" "Lrnr_nnet"
#> [29] "Lrnr_nnls" "Lrnr_optim"
#> [31] "Lrnr_pkg_SuperLearner" "Lrnr_pkg_SuperLearner_method"
#> [33] "Lrnr_pkg_SuperLearner_screener" "Lrnr_polspline"
#> [35] "Lrnr_randomForest" "Lrnr_ranger"
#> [37] "Lrnr_rpart" "Lrnr_rugarch"
#> [39] "Lrnr_screener_correlation" "Lrnr_solnp"
#> [41] "Lrnr_stratified" "Lrnr_svm"
#> [43] "Lrnr_tsDyn" "Lrnr_xgboost"
Now that we have an idea of some learners, we can construct them using the
make_learner
function or the new
method.
# choose base learners
lrn_glm < make_learner(Lrnr_glm)
lrn_mean < Lrnr_mean$new()
We can customize learner hyperparameters to incorporate a diversity of
different settings. Documentation for the learners and their hyperparameters
can be found in the sl3
Learners
Reference.
lrn_lasso < make_learner(Lrnr_glmnet) # alpha default is 1
lrn_ridge < Lrnr_glmnet$new(alpha = 0)
lrn_enet.5 < make_learner(Lrnr_glmnet, alpha = 0.5)
lrn_polspline < Lrnr_polspline$new()
lrn_ranger100 < make_learner(Lrnr_ranger, num.trees = 100)
lrn_hal_faster < Lrnr_hal9001$new(max_degree = 2, reduce_basis = 0.05)
xgb_fast < Lrnr_xgboost$new() # default with nrounds = 20 is pretty fast
xgb_50 < Lrnr_xgboost$new(nrounds = 50)
We can use Lrnr_define_interactions
to define interaction terms among
covariates. The interactions should be supplied as list of character vectors,
where each vector specifies an interaction. For example, we specify
interactions below between (1) tr
(whether or not the subject received the
WASH intervention) and elec
(whether or not the subject had electricity); and
between (2) tr
and hfiacat
(the subject’s level of food security).
interactions < list(c("elec", "tr"), c("tr", "hfiacat"))
# main terms as well as the interactions above will be included
lrn_interaction < make_learner(Lrnr_define_interactions, interactions)
What we just defined above is incomplete. In order to fit learners with these
interactions, we need to create a Pipeline
. A Pipeline
is a set of learners
to be fit sequentially, where the fit from one learner is used to define the
task for the next learner. We need to create a Pipeline
with the interaction
defining learner and another learner that incorporate these terms when fitting
a model. Let’s create a learner pipeline that will fit a linear model with the
combination of main terms and interactions terms, as specified in
lrn_interaction_main
.
# we already instantiated a linear model learner above, no need to do it again
lrn_glm_interaction < make_learner(Pipeline, lrn_interaction, lrn_glm)
lrn_glm_interaction
#> [1] "Lrnr_define_interactions_TRUE"
#> [1] "Lrnr_glm_TRUE"
We can also include learners from the SuperLearner
R
package.
lrn_bayesglm < Lrnr_pkg_SuperLearner$new("SL.bayesglm")
Here is a fun trick to create customized learners over a grid of parameters.
# I like to crock pot my SLs
grid_params < list(
cost = c(0.01, 0.1, 1, 10, 100, 1000),
gamma = c(0.001, 0.01, 0.1, 1),
kernel = c("polynomial", "radial", "sigmoid"),
degree = c(1, 2, 3)
)
grid < expand.grid(grid_params, KEEP.OUT.ATTRS = FALSE)
svm_learners < apply(grid, MARGIN = 1, function(tuning_params) {
do.call(Lrnr_svm$new, as.list(tuning_params))
})
grid_params < list(
max_depth = c(2, 4, 6),
eta = c(0.001, 0.1, 0.3),
nrounds = 100
)
grid < expand.grid(grid_params, KEEP.OUT.ATTRS = FALSE)
grid
#> max_depth eta nrounds
#> 1 2 0.001 100
#> 2 4 0.001 100
#> 3 6 0.001 100
#> 4 2 0.100 100
#> 5 4 0.100 100
#> 6 6 0.100 100
#> 7 2 0.300 100
#> 8 4 0.300 100
#> 9 6 0.300 100
xgb_learners < apply(grid, MARGIN = 1, function(tuning_params) {
do.call(Lrnr_xgboost$new, as.list(tuning_params))
})
xgb_learners
#> [[1]]
#> [1] "Lrnr_xgboost_100_1_2_0.001"
#>
#> [[2]]
#> [1] "Lrnr_xgboost_100_1_4_0.001"
#>
#> [[3]]
#> [1] "Lrnr_xgboost_100_1_6_0.001"
#>
#> [[4]]
#> [1] "Lrnr_xgboost_100_1_2_0.1"
#>
#> [[5]]
#> [1] "Lrnr_xgboost_100_1_4_0.1"
#>
#> [[6]]
#> [1] "Lrnr_xgboost_100_1_6_0.1"
#>
#> [[7]]
#> [1] "Lrnr_xgboost_100_1_2_0.3"
#>
#> [[8]]
#> [1] "Lrnr_xgboost_100_1_4_0.3"
#>
#> [[9]]
#> [1] "Lrnr_xgboost_100_1_6_0.3"
Did you see Lrnr_caret
when we called sl3_list_learners(c("binomial"))
?
All we need to specify to use this popular algorithm as a candidate in our
SL is the algorithm
we want to tune, which is passed as method
to
caret::train()
. The default method for parameter selection criterion with is
set to “CV” instead of the caret::train()
default boot
. The summary metric
used to select the optimal model is RMSE
for continuous outcomes and
Accuracy
for categorical and binomial outcomes.
# Unlike xgboost, I have no idea how to tune a neural net or BART machine, so
# I let caret take the reins
lrnr_caret_nnet < make_learner(Lrnr_caret, algorithm = "nnet")
lrnr_caret_bartMachine < make_learner(Lrnr_caret,
algorithm = "bartMachine",
method = "boot", metric = "Accuracy",
tuneLength = 10
)
In order to assemble the library of learners, we need to “stack” them together.
A Stack
is a special learner and it has the same interface as all other
learners. What makes a stack special is that it combines multiple learners by
training them simultaneously, so that their predictions can be either combined
or compared.
stack < make_learner(
Stack, lrn_glm, lrn_polspline, lrn_enet.5, lrn_ridge, lrn_lasso, xgb_50
)
stack
#> [1] "Lrnr_glm_TRUE"
#> [2] "Lrnr_polspline_5"
#> [3] "Lrnr_glmnet_NULL_deviance_10_0.5_100_TRUE_FALSE"
#> [4] "Lrnr_glmnet_NULL_deviance_10_0_100_TRUE_FALSE"
#> [5] "Lrnr_glmnet_NULL_deviance_10_1_100_TRUE_FALSE"
#> [6] "Lrnr_xgboost_50_1"
We can also stack the learners by first creating a vector, and then instantiating the stack. I prefer this method, since it easily allows us to modify the names of the learners.
# named vector of learners first
learners < c(lrn_glm, lrn_polspline, lrn_enet.5, lrn_ridge, lrn_lasso, xgb_50)
names(learners) < c(
"glm", "polspline", "enet.5", "ridge", "lasso", "xgboost50"
)
# next make the stack
stack < make_learner(Stack, learners)
# now the names are pretty
stack
#> [1] "glm" "polspline" "enet.5" "ridge" "lasso" "xgboost50"
We’re jumping ahead a bit, but let’s check something out quickly. It’s straightforward, and just one more step, to set up this stack such that all of the learners will train in a crossvalidated manner.
cv_stack < Lrnr_cv$new(stack)
cv_stack
#> [1] "Lrnr_cv"
#> [1] "glm" "polspline" "enet.5" "ridge" "lasso" "xgboost50"
Screening Algorithms for Feature Selection
We can optionally select a subset of available covariates and pass only those variables to the modeling algorithm. The current set of learners that can be used for prescreening covariates is included below.

Lrnr_screener_importance
selectsnum_screen
(default = 5) covariates based on the variable importance ranking provided by thelearner
. Any learner with an “importance” method can be used inLrnr_screener_importance
; and this currently includesLrnr_ranger
,Lrnr_randomForest
, andLrnr_xgboost
. 
Lrnr_screener_coefs
, which provides screening of covariates based on the magnitude of their estimated coefficients in a (possibly regularized) GLM. Thethreshold
(default = 1e3) defines the minimum absolute size of the coefficients, and thus covariates, to be kept. Also, amax_retain
argument can be optionally provided to restrict the number of selected covariates to be no more thanmax_retain
. 
Lrnr_screener_correlation
provides covariate screening procedures by running a test of correlation (Pearson default), and then selecting the (1) top ranked variables (default), or (2) the variables with a pvalue lower than some prespecified threshold. 
Lrnr_screener_augment
augments a set of screened covariates with additional covariates that should be included by default, even if the screener did not select them. An example of how to use this screener is included below.
Let’s consider screening covariates based on their randomForest
variable
importance ranking (ordered by mean decrease in accuracy). To select the top
5 most important covariates according to this ranking, we can combine
Lrnr_screener_importance
with Lrnr_ranger
(limiting the number of trees by
setting ntree = 20
).
Hang on! Before you think it – I will confess: Bob Ross and I both know that 20 trees makes for a lonely forest, and I shouldn’t consider it, but these are the sacrifices I have to make for this chapter to build in time!
miniforest < Lrnr_ranger$new(
num.trees = 20, write.forest = FALSE,
importance = "impurity_corrected"
)
# learner must already be instantiated, we did this when we created miniforest
screen_rf < Lrnr_screener_importance$new(learner = miniforest, num_screen = 5)
screen_rf
#> [1] "Lrnr_screener_importance_5"
# which covariates are selected on the full data?
screen_rf$train(washb_task)
#> [1] "Lrnr_screener_importance_5"
#> $selected
#> [1] "aged" "month" "momedu" "asset_tv" "asset_chair"
An example of how to format Lrnr_screener_augment
is included below for
clarity.
keepme < c("aged", "momage")
# screener must already be instantiated, we did this when we created screen_rf
screen_augment_rf < Lrnr_screener_augment$new(
screener = screen_rf, default_covariates = keepme
)
screen_augment_rf
#> [1] "Lrnr_screener_augment_c(\"aged\", \"momage\")"
Selecting covariates with nonzero lasso coefficients is quite common. Let’s
construct Lrnr_screener_coefs
screener that does just that, and test it out.
# we already instantiated a lasso learner above, no need to do it again
screen_lasso < Lrnr_screener_coefs$new(learner = lrn_lasso, threshold = 0)
screen_lasso
#> [1] "Lrnr_screener_coefs_0_NULL_2"
To “pipe” only the selected covariates to the modeling algorithm, we need to
make a Pipeline
, similar to the one we built for the regression model with
interaction terms.
screen_rf_pipe < make_learner(Pipeline, screen_rf, stack)
screen_lasso_pipe < make_learner(Pipeline, screen_lasso, stack)
Now, these learners with no internal screening will be preceded by a screening step.
We also consider the original stack
, to compare how the feature selection
methods perform in comparison to the methods without feature selection.
Analogous to what we have seen before, we have to stack the pipeline and
original stack
together, so we may use them as base learners in our super
learner.
# pretty names again
learners2 < c(learners, screen_rf_pipe, screen_lasso_pipe)
names(learners2) < c(names(learners), "randomforest_screen", "lasso_screen")
fancy_stack < make_learner(Stack, learners2)
fancy_stack
#> [1] "glm" "polspline" "enet.5"
#> [4] "ridge" "lasso" "xgboost50"
#> [7] "randomforest_screen" "lasso_screen"
We will use the default
metalearner, which
uses Lrnr_solnp()
to
provide fitting procedures for a pairing of loss
function and
metalearner function.
This default metalearner selects a loss and metalearner pairing based on the
outcome type. Note that any learner can be used as a metalearner.
Now that we have made a diverse stack of base learners, we are ready to make the SL. The SL algorithm fits a metalearner on the validation set predictions/losses across all folds.
sl < make_learner(Lrnr_sl, learners = fancy_stack)
We can also use Lrnr_cv
to build a SL, crossvalidate a stack of
learners to compare performance of the learners in the stack, or crossvalidate
any single learner (see “Crossvalidation” section of this sl3
introductory tutorial).
Furthermore, we can Define New sl3
Learners which can be used
in all the places you could otherwise use any other sl3
learners, including
Pipelines
, Stacks
, and the SL.
Recall that the discrete SL, or crossvalidated selector, is a metalearner that
assigns a weight of 1 to the learner with the lowest crossvalidated empirical
risk, and weight of 0 to all other learners. This metalearner specification can
be invoked with Lrnr_cv_selector
.
discrete_sl_metalrn < Lrnr_cv_selector$new()
discrete_sl < Lrnr_sl$new(
learners = fancy_stack,
metalearner = discrete_sl_metalrn
)
3. Train the Super Learner on the machine learning task
The SL algorithm fits a metalearner on the validationset predictions in a crossvalidated manner, thereby avoiding overfitting.
Now we are ready to “train” our SL on our sl3_task
object, washb_task
.
set.seed(4197)
sl_fit < sl$train(washb_task)
4. Obtain predicted values
Now that we have fit the SL, we are ready to calculate the predicted outcome for each subject.
# we did it! now we have SL predictions
sl_preds < sl_fit$predict()
head(sl_preds)
#> [1] 0.64698 0.76514 0.64312 0.68991 0.68068 0.66422
We can also obtain a summary of the results.
sl_fit_summary < sl_fit$print()
#> [1] "SuperLearner:"
#> List of 8
#> $ glm : chr "Lrnr_glm_TRUE"
#> $ polspline : chr "Lrnr_polspline_5"
#> $ enet.5 : chr "Lrnr_glmnet_NULL_deviance_10_0.5_100_TRUE_FALSE"
#> $ ridge : chr "Lrnr_glmnet_NULL_deviance_10_0_100_TRUE_FALSE"
#> $ lasso : chr "Lrnr_glmnet_NULL_deviance_10_1_100_TRUE_FALSE"
#> $ xgboost50 : chr "Lrnr_xgboost_50_1"
#> $ randomforest_screen: chr "Pipeline(Lrnr_screener_importance_5>Stack)"
#> $ lasso_screen : chr "Pipeline(Lrnr_screener_coefs_0_NULL_2>Stack)"
#> [1] "Lrnr_solnp_TRUE_TRUE_FALSE_1e05"
#> $pars
#> [1] 0.055565 0.055551 0.055558 0.055564 0.055558 0.055583 0.055556 0.055573
#> [9] 0.055555 0.055555 0.055555 0.055546 0.055553 0.055553 0.055553 0.055552
#> [17] 0.055553 0.055516
#>
#> $convergence
#> [1] 0
#>
#> $values
#> [1] 1.01 1.01
#>
#> $lagrange
#> [,1]
#> [1,] 0.0050956
#>
#> $hessian
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#> [1,] 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [2,] 0 1 0 0 0 0 0 0 0 0 0 0 0
#> [3,] 0 0 1 0 0 0 0 0 0 0 0 0 0
#> [4,] 0 0 0 1 0 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 1 0 0 0 0 0 0 0 0
#> [6,] 0 0 0 0 0 1 0 0 0 0 0 0 0
#> [7,] 0 0 0 0 0 0 1 0 0 0 0 0 0
#> [8,] 0 0 0 0 0 0 0 1 0 0 0 0 0
#> [9,] 0 0 0 0 0 0 0 0 1 0 0 0 0
#> [10,] 0 0 0 0 0 0 0 0 0 1 0 0 0
#> [11,] 0 0 0 0 0 0 0 0 0 0 1 0 0
#> [12,] 0 0 0 0 0 0 0 0 0 0 0 1 0
#> [13,] 0 0 0 0 0 0 0 0 0 0 0 0 1
#> [14,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [15,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [16,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [17,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [18,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [,14] [,15] [,16] [,17] [,18]
#> [1,] 0 0 0 0 0
#> [2,] 0 0 0 0 0
#> [3,] 0 0 0 0 0
#> [4,] 0 0 0 0 0
#> [5,] 0 0 0 0 0
#> [6,] 0 0 0 0 0
#> [7,] 0 0 0 0 0
#> [8,] 0 0 0 0 0
#> [9,] 0 0 0 0 0
#> [10,] 0 0 0 0 0
#> [11,] 0 0 0 0 0
#> [12,] 0 0 0 0 0
#> [13,] 0 0 0 0 0
#> [14,] 1 0 0 0 0
#> [15,] 0 1 0 0 0
#> [16,] 0 0 1 0 0
#> [17,] 0 0 0 1 0
#> [18,] 0 0 0 0 1
#>
#> $ineqx0
#> NULL
#>
#> $nfuneval
#> [1] 23
#>
#> $outer.iter
#> [1] 1
#>
#> $elapsed
#> Time difference of 0.012488 secs
#>
#> $vscale
#> [1] 1.01001 0.00001 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
#> [10] 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
#> [19] 1.00000 1.00000
#>
#> $coefficients
#> glm polspline
#> 0.055565 0.055551
#> enet.5 ridge
#> 0.055558 0.055564
#> lasso xgboost50
#> 0.055558 0.055583
#> randomforest_screen_glm randomforest_screen_polspline
#> 0.055556 0.055573
#> randomforest_screen_enet.5 randomforest_screen_ridge
#> 0.055555 0.055555
#> randomforest_screen_lasso randomforest_screen_xgboost50
#> 0.055555 0.055546
#> lasso_screen_glm lasso_screen_polspline
#> 0.055553 0.055553
#> lasso_screen_enet.5 lasso_screen_ridge
#> 0.055553 0.055552
#> lasso_screen_lasso lasso_screen_xgboost50
#> 0.055553 0.055516
#>
#> $training_offset
#> [1] FALSE
#>
#> $name
#> [1] "solnp"
#>
#> [1] "Crossvalidated risk:"
#> learner coefficients risk se fold_sd
#> 1: glm 0.055565 1.0202 0.023955 0.067500
#> 2: polspline 0.055551 1.0208 0.023577 0.067921
#> 3: enet.5 0.055558 1.0131 0.023598 0.065732
#> 4: ridge 0.055564 1.0153 0.023739 0.065299
#> 5: lasso 0.055558 1.0130 0.023592 0.065840
#> 6: xgboost50 0.055583 1.1136 0.025262 0.077580
#> 7: randomforest_screen_glm 0.055556 1.0173 0.023830 0.069913
#> 8: randomforest_screen_polspline 0.055573 1.0135 0.023814 0.069240
#> 9: randomforest_screen_enet.5 0.055555 1.0177 0.023842 0.070142
#> 10: randomforest_screen_ridge 0.055555 1.0176 0.023855 0.069787
#> 11: randomforest_screen_lasso 0.055555 1.0177 0.023840 0.070155
#> 12: randomforest_screen_xgboost50 0.055546 1.1277 0.026043 0.078673
#> 13: lasso_screen_glm 0.055553 1.0164 0.023542 0.065018
#> 14: lasso_screen_polspline 0.055553 1.0177 0.023520 0.065566
#> 15: lasso_screen_enet.5 0.055553 1.0163 0.023544 0.065017
#> 16: lasso_screen_ridge 0.055552 1.0166 0.023553 0.064869
#> 17: lasso_screen_lasso 0.055553 1.0163 0.023544 0.065020
#> 18: lasso_screen_xgboost50 0.055516 1.1256 0.025939 0.084270
#> 19: SuperLearner NA 1.0100 0.023524 0.068184
#> fold_min_risk fold_max_risk
#> 1: 0.89442 1.1200
#> 2: 0.89892 1.1255
#> 3: 0.88839 1.1058
#> 4: 0.88559 1.1063
#> 5: 0.88842 1.1060
#> 6: 0.96019 1.2337
#> 7: 0.88579 1.1119
#> 8: 0.89370 1.1190
#> 9: 0.88593 1.1137
#> 10: 0.88620 1.1128
#> 11: 0.88593 1.1136
#> 12: 1.00223 1.2465
#> 13: 0.90204 1.1156
#> 14: 0.89742 1.1162
#> 15: 0.90184 1.1154
#> 16: 0.90120 1.1146
#> 17: 0.90183 1.1154
#> 18: 0.96251 1.2327
#> 19: 0.88036 1.1041
From the table of the printed SL fit, we note that the SL had a mean risk of
1.01 and that
this ensemble weighted the ranger
and glmnet
learners highest while not
weighting the mean
learner highly.
We can also see that the glmnet
learner had the lowest crossvalidated mean
risk, thus making it the crossvalidated selector (or the discrete SL). The
mean risk of the SL is calculated using all of the data, and not a separate
holdout, so the SL’s mean risk that is reported here is an underestimation.
Crossvalidated Super Learner
We can crossvalidate the SL to see how well the SL performs on unseen data, and obtain an estimate of the crossvalidated risk of the SL.
This estimation procedure requires an “outer/external” layer of
crossvalidation, also called nested crossvalidation, which involves setting
aside a separate holdout sample that we don’t use to fit the SL. This external
crossvalidation procedure may also incorporate 10 folds, which is the default
in sl3
. However, we will incorporate 2 outer/external folds of
crossvalidation for computational efficiency.
We also need to specify a loss function to evaluate SL. Documentation for the
available loss functions can be found in the sl3
Loss
Function Reference.
washb_task_new < make_sl3_Task(
data = washb_data,
covariates = covars,
outcome = outcome,
folds = origami::make_folds(washb_data, fold_fun = folds_vfold, V = 2)
)
CVsl < CV_lrnr_sl(
lrnr_sl = sl_fit, task = washb_task_new, loss_fun = loss_squared_error
)
Let’s take a look at a table summarizing the performance:
if (knitr::is_latex_output()) {
CVsl %>%
kable(format = "latex")
} else if (knitr::is_html_output()) {
CVsl %>%
kable() %>%
kable_styling(fixed_thead = TRUE) %>%
scroll_box(width = "100%", height = "300px")
}
learner  coefficients  risk  se  fold_sd  fold_min_risk  fold_max_risk 

glm  0.05555  1.0494  0.02687  0.07973  0.99301  1.1058 
polspline  0.05557  1.0173  0.02354  0.06840  0.96890  1.0656 
enet.5  0.05556  1.0239  0.02413  0.07188  0.97310  1.0748 
ridge  0.05557  1.0271  0.02418  0.06797  0.97906  1.0752 
lasso  0.05556  1.0243  0.02417  0.07238  0.97314  1.0755 
xgboost50  0.05555  1.1789  0.02668  0.00524  1.17521  1.1826 
randomforest_screen_glm  0.05555  1.0324  0.02392  0.05939  0.99043  1.0744 
randomforest_screen_polspline  0.05556  1.0259  0.02372  0.07722  0.97129  1.0805 
randomforest_screen_enet.5  0.05555  1.0284  0.02390  0.06515  0.98234  1.0745 
randomforest_screen_ridge  0.05555  1.0310  0.02395  0.06113  0.98782  1.0743 
randomforest_screen_lasso  0.05555  1.0285  0.02390  0.06508  0.98244  1.0745 
randomforest_screen_xgboost50  0.05554  1.1721  0.02685  0.03247  1.14913  1.1950 
lasso_screen_glm  0.05556  1.0257  0.02381  0.05370  0.98771  1.0636 
lasso_screen_polspline  0.05556  1.0267  0.02390  0.05510  0.98771  1.0656 
lasso_screen_enet.5  0.05556  1.0261  0.02384  0.05463  0.98753  1.0648 
lasso_screen_ridge  0.05556  1.0255  0.02383  0.05439  0.98703  1.0639 
lasso_screen_lasso  0.05556  1.0262  0.02384  0.05464  0.98753  1.0648 
lasso_screen_xgboost50  0.05553  1.1519  0.02577  0.04334  1.12129  1.1826 
SuperLearner  NA  1.0187  0.02366  0.06090  0.97563  1.0618 
Variable Importance Measures with sl3
Variable importance can be interesting and informative. It can also be
contradictory and confusing. Nevertheless, we like it, and so do our
collaborators, so we created a variable importance function in sl3
! The sl3
importance
function returns a table with variables listed in decreasing order
of importance (i.e., most important on the first row).
The measure of importance in sl3
is based on a risk ratio, or risk
difference, between the learner fit with a removed, or permuted, covariate and
the learner fit with the true covariate, across all covariates. In this manner,
the larger the risk difference, the more important the variable is in the
prediction.
The intuition of this measure is that it calculates the risk (in terms of the
average loss in predictive accuracy) of losing one covariate, while keeping
everything else fixed, and compares it to the risk if the covariate was not
lost. If this risk ratio is one, or risk difference is zero, then losing that
covariate had no impact, and is thus not important by this measure. We do this
across all of the covariates. As stated above, we can remove the covariate and
refit the SL without it, or we just permute the covariate (faster, more risky)
and hope for the shuffling to distort any meaningful information that was
present in the covariate. This idea of permuting instead of removing saves a
lot of time, and is also incorporated in the randomForest
variable importance
measures. However, the permutation approach is risky, so the importance
function default is to remove and refit.
Let’s explore the sl3
variable importance measurements for the washb
data.
washb_varimp < importance(sl_fit, loss = loss_squared_error, type = "permute")
if (knitr::is_latex_output()) {
washb_varimp %>%
kable(format = "latex")
} else if (knitr::is_html_output()) {
washb_varimp %>%
kable() %>%
kable_styling(fixed_thead = TRUE) %>%
scroll_box(width = "100%", height = "300px")
}
X  risk_ratio 

aged  1.04130 
momedu  1.01392 
asset_refrig  1.00831 
asset_chair  1.00457 
month  1.00316 
momheight  1.00274 
elec  1.00204 
tr  1.00203 
Nlt18  1.00116 
momage  1.00061 
asset_chouki  1.00032 
asset_mobile  1.00030 
floor  1.00021 
delta_momheight  1.00008 
asset_table  1.00003 
Ncomp  1.00001 
sex  1.00000 
asset_moto  0.99999 
watmin  0.99997 
walls  0.99997 
delta_momage  0.99996 
roof  0.99993 
asset_tv  0.99992 
hfiacat  0.99990 
fracode  0.99983 
asset_wardrobe  0.99980 
asset_bike  0.99978 
asset_sewmach  0.99978 
asset_khat  0.99965 
# plot variable importance
importance_plot(
washb_varimp,
main = "sl3 Variable Importance for WASH Benefits Example Data"
)
6.1 Exercises
6.1.1 Predicting Myocardial Infarction with sl3
Follow the steps below to predict myocardial infarction (mi
) using the
available covariate data. We thank Prof. David Benkeser at Emory University for
making the this Cardiovascular Health Study (CHS) data accessible.
# load the data set
db_data < url(
paste0(
"https://raw.githubusercontent.com/benkeser/sllecture/master/",
"chspred.csv"
)
)
chspred < read_csv(file = db_data, col_names = TRUE)
Let’s take a quick peek at the data:
waist  alcoh  hdl  beta  smoke  ace  ldl  bmi  aspirin  gend  age  estrgn  glu  ins  cysgfr  dm  fetuina  whr  hsed  race  logcystat  logtrig  logcrp  logcre  health  logkcal  sysbp  mi 

110.164  0.0000  66.497  0  0  1  114.216  27.997  0  0  73.518  0  159.931  70.3343  75.008  1  0.17516  1.16898  1  1  0.34202  5.4063  2.01260  0.67385  0  4.3926  177.135  0 
89.976  0.0000  50.065  0  0  0  103.777  20.893  0  0  61.772  0  153.389  33.9695  82.743  1  0.57165  0.90114  0  0  0.08465  4.8592  3.29328  0.55509  1  6.2071  136.374  0 
106.194  8.4174  40.506  0  0  0  165.716  28.455  1  1  72.931  0  121.715  17.3017  74.699  0  0.35168  1.17971  0  1  0.44511  4.5088  0.30132  0.01152  0  6.7320  135.199  0 
90.057  0.0000  36.175  0  0  0  45.203  23.961  0  0  79.119  0  53.969  11.7315  95.782  0  0.54391  1.13599  0  0  0.48072  5.1832  3.02426  0.57507  1  7.3972  139.018  0 
78.614  2.9790  71.064  0  1  0  131.312  10.966  0  1  69.018  0  94.315  9.7112  72.711  0  0.49159  1.10276  1  0  0.31206  4.2190  0.70568  0.00534  1  8.2779  88.047  0 
91.659  0.0000  59.496  0  0  0  171.187  29.132  0  1  81.835  0  212.907  28.2269  69.218  1  0.46215  0.95291  1  0  0.28716  5.1773  0.97046  0.21268  1  5.9942  69.594  0 
 Create an
sl3
task, setting myocardial infarctionmi
as the outcome and using all available covariate data.  Make a library of seven relatively fast base learning algorithms (i.e., do
not consider BART or HAL). Customize hyperparameters for one of your
learners. Feel free to use learners from
sl3
orSuperLearner
. You may use the same base learning library that is presented above.  Incorporate at least one pipeline with feature selection. Any screener and learner(s) can be used.
 Fit the metalearning step with the default metalearner.
 With the metalearner and base learners, make the Super Learner (SL) and train it on the task.
 Print your SL fit by calling
print()
with$
.  Crossvalidate your SL fit to see how well it performs on unseen data. Specify a valid loss function to evaluate the SL.
 Use the
importance()
function to identify the “most important” predictor of myocardial infarction, according tosl3
importance metrics.
6.1.2 Predicting Recurrent Ischemic Stroke in an RCT with sl3
For this exercise, we will work with a random sample of 5,000 patients who
participated in the International Stroke Trial (IST). This data is described in
Chapter 3.2 of the tlverse
handbook.
 Train a SL to predict recurrent stroke
DRSISC
with the available covariate data (the 25 other variables). Of course, you can consider feature selection in the machine learning algorithms. In this data, the outcome is occasionally missing, so be sure to specifydrop_missing_outcome = TRUE
when defining the task.  Use the SLbased predictions to calculate the area under the ROC curve (AUC).
 Calculate the crossvalidated AUC to evaluate the performance of the SL on unseen data.
 Which covariates are the most predictive of 14day recurrent stroke,
according to
sl3
variable importance measures?
ist_data < paste0(
"https://raw.githubusercontent.com/tlverse/",
"tlversehandbook/master/data/ist_sample.csv"
) %>% fread()
# number 3 help
ist_task_CVsl < make_sl3_Task(
data = ist_data,
outcome = "DRSISC",
covariates = colnames(ist_data)[which(names(ist_data) == "DRSISC")],
drop_missing_outcome = TRUE,
folds = origami::make_folds(
n = sum(!is.na(ist_data$DRSISC)),
fold_fun = folds_vfold,
V = 5
)
)
6.2 Concluding Remarks
Super Learner (SL) is a general approach that can be applied to a diversity of estimation and prediction problems which can be defined by a loss function.
 It would be straightforward to plug in the estimator returned by SL into the
target parameter mapping.
 For example, suppose we are after the average treatment effect (ATE) of a binary treatment intervention: \(\Psi_0 = E_{0,W}[E_0(YA=1,W)  E_0(YA=0,W)]\).
 We could use the SL that was trained on the original data (let’s call
this
sl_fit
) to predict the outcome for all subjects under each intervention. All we would need to do is take the average difference between the counterfactual outcomes under each intervention of interest.  Considering \(\Psi_0\) above, we would first need two \(n\)length vectors of predicted outcomes under each intervention. One vector would represent the predicted outcomes under an intervention that sets all subjects to receive \(A=1\), \(Y_iA_i=1,W_i\) for all \(i=1,\ldots,n\). The other vector would represent the predicted outcomes under an intervention that sets all subjects to receive \(A=0\), \(Y_iA_i=0,W_i\) for all \(i=1,\ldots,n\).
 After obtaining these vectors of counterfactual predicted outcomes, all we would need to do is average and then take the difference in order to “plugin” the SL estimator into the target parameter mapping.
 In
sl3
and with our current ATE example, this could be achieved withmean(sl_fit$predict(A1_task))  mean(sl_fit$predict(A0_task))
; whereA1_task$data
would contain all 1’s (or the level that pertains to receiving the treatment) for the treatment column in the data (keeping all else the same), andA0_task$data
would contain all 0’s (or the level that pertains to not receiving the treatment) for the treatment column in the data.
 It’s a worthwhile exercise to obtain the predicted counterfactual outcomes
and create these counterfactual
sl3
tasks. It’s too biased; however, to plug the SL fit into the target parameter mapping, (e.g., calling the result ofmean(sl_fit$predict(A1_task))  mean(sl_fit$predict(A0_task))
the estimated ATE. We would end up with an estimator for the ATE that was optimized for estimation of the prediction function, and not the ATE! 
At the end of the “analysis day”, we want an estimator that is optimized for our target estimand of interest. We ultimately care about doing a good job estimating \(\psi_0\). The SL is an essential step to help us get there. In fact, we will use the counterfactual predicted outcomes that were explained at length above. However, SL is not the end of the estimation procedure. Specifically, the Super Learner would not be an asymptotically linear estimator of the target estimand; and it is not an efficient substitution estimator. This begs the question, why is it so important for an estimator to possess these properties?
 An asymptotically linear estimator converges to the estimand a \(\frac{1}{\sqrt{n}}\) rate, thereby permitting formal statistical inference (i.e., confidence intervals and \(p\)values).
 Substitution, or plugin, estimators of the estimand are desirable because they respect both the local and global constraints of the statistical model (e.g., bounds), and have they have better finitesample properties.

An efficient estimator is optimal in the sense that it has the lowest possible variance, and is thus the most precise. An estimator is efficient if and only if is asymptotically linear with influence curve equal to the canonical gradient.
 The canonical gradient is a mathematical object that is specific to the target estimand, and it provides information on the level of difficulty of the estimation problem. Various canonical gradient are shown in the chapters that follow.
 Practitioner’s do not need to know how to calculate a canonical gradient in order to understand efficiency and use Targeted Maximum Likelihood Estimation (TMLE). Metaphorically, you do not need to be Yoda in order to be a Jedi.
 TMLE is a general strategy that succeeds in constructing efficient and asymptotically linear plugin estimators.
 SL is fantastic for pure prediction, and for obtaining an initial estimate in the first step of TMLE, but we need the second step of TMLE to have the desirable statistical properties mentioned above.
In the chapters that follow, we focus on the targeted maximum likelihood estimator and the targeted minimum lossbased estimator, both referred to as TMLE.
Appendix
6.2.1 Exercise 1 Solution
Here is a potential solution to the sl3
Exercise 1 – Predicting Myocardial
Infarction with sl3
.
db_data < url(
"https://raw.githubusercontent.com/benkeser/sllecture/master/chspred.csv"
)
chspred < read_csv(file = db_data, col_names = TRUE)
# make task
chspred_task < make_sl3_Task(
data = chspred,
covariates = head(colnames(chspred), 1),
outcome = "mi"
)
# make learners
glm_learner < Lrnr_glm$new()
lasso_learner < Lrnr_glmnet$new(alpha = 1)
ridge_learner < Lrnr_glmnet$new(alpha = 0)
enet_learner < Lrnr_glmnet$new(alpha = 0.5)
# curated_glm_learner uses formula = "mi ~ smoke + beta + waist"
curated_glm_learner < Lrnr_glm_fast$new(covariates = c("smoke, beta, waist"))
mean_learner < Lrnr_mean$new() # That is one mean learner!
glm_fast_learner < Lrnr_glm_fast$new()
ranger_learner < Lrnr_ranger$new()
svm_learner < Lrnr_svm$new()
xgb_learner < Lrnr_xgboost$new()
# screening
screen_cor < make_learner(Lrnr_screener_correlation)
glm_pipeline < make_learner(Pipeline, screen_cor, glm_learner)
# stack learners together
stack < make_learner(
Stack,
glm_pipeline, glm_learner,
lasso_learner, ridge_learner, enet_learner,
curated_glm_learner, mean_learner, glm_fast_learner,
ranger_learner, svm_learner, xgb_learner
)
# make and train SL
sl < Lrnr_sl$new(
learners = stack
)
sl_fit < sl$train(chspred_task)
sl_fit$print()
CVsl < CV_lrnr_sl(sl_fit, chspred_task, loss_loglik_binomial)
CVsl
varimp < importance(sl_fit, type = "permute")
varimp %>%
importance_plot(
main = "sl3 Variable Importance for Myocardial Infarction Prediction"
)
6.2.2 Exercise 2 Solution
Here is a potential solution to sl3
Exercise 2 – Predicting Recurrent
Ischemic Stroke in an RCT with sl3
.
library(ROCR) # for AUC calculation
ist_data < paste0(
"https://raw.githubusercontent.com/tlverse/",
"tlversehandbook/master/data/ist_sample.csv"
) %>% fread()
# stack
ist_task < make_sl3_Task(
data = ist_data,
outcome = "DRSISC",
covariates = colnames(ist_data)[which(names(ist_data) == "DRSISC")],
drop_missing_outcome = TRUE
)
# learner library
lrn_glm < Lrnr_glm$new()
lrn_lasso < Lrnr_glmnet$new(alpha = 1)
lrn_ridge < Lrnr_glmnet$new(alpha = 0)
lrn_enet < Lrnr_glmnet$new(alpha = 0.5)
lrn_mean < Lrnr_mean$new()
lrn_ranger < Lrnr_ranger$new()
lrn_svm < Lrnr_svm$new()
# xgboost grid
grid_params < list(
max_depth = c(2, 5, 8),
eta = c(0.01, 0.15, 0.3)
)
grid < expand.grid(grid_params, KEEP.OUT.ATTRS = FALSE)
params_default < list(nthread = getOption("sl.cores.learners", 1))
xgb_learners < apply(grid, MARGIN = 1, function(params_tune) {
do.call(Lrnr_xgboost$new, c(params_default, as.list(params_tune)))
})
learners < unlist(list(
xgb_learners, lrn_ridge, lrn_mean, lrn_lasso,
lrn_glm, lrn_enet, lrn_ranger, lrn_svm
),
recursive = TRUE
)
# SL
sl < Lrnr_sl$new(learners)
sl_fit < sl$train(ist_task)
# AUC
preds < sl_fit$predict()
obs < c(na.omit(ist_data$DRSISC))
AUC < performance(prediction(sl_preds, obs), measure = "auc")@y.values[[1]]
plot(performance(prediction(sl_preds, obs), "tpr", "fpr"))
# CVsl
ist_task_CVsl < make_sl3_Task(
data = ist_data,
outcome = "DRSISC",
covariates = colnames(ist_data)[which(names(ist_data) == "DRSISC")],
drop_missing_outcome = TRUE,
folds = origami::make_folds(
n = sum(!is.na(ist_data$DRSISC)),
fold_fun = folds_vfold,
V = 5
)
)
CVsl < CV_lrnr_sl(sl_fit, ist_task_CVsl, loss_loglik_binomial)
CVsl
# sl3 variable importance plot
ist_varimp < importance(sl_fit, type = "permute")
ist_varimp %>%
importance_plot(
main = "Variable Importance for Predicting Recurrent Ischemic Stroke"
)