Chapter 4 Super (Ensemble Machine) Learning

Based on the sl3 R package by Jeremy Coyle, Nima Hejazi, Ivana Malenica, and Oleg Sofrygin.

Updated: 2019-12-06

4.1 Introduction

Once the statistical estimation problem is defined, as described in the The Targeted Learning Roadmap, we are ready to construct the TMLE: an asymptotically efficient substitution estimator of this target quantity.

The first step in the estimation procedure is an initial estimate of the data-generating distribution, or the relevant part of this distribution that is needed to evaluate the target parameter. For this initial estimation, we use the super learner (van der Laan, Polley, and Hubbard 2007), an important step for creating a robust estimator.

4.1.1 Super Learning

  • A common task in statistical data analysis is estimator selection (e.g., for prediction).
  • There is no universally optimal machine learning algorithm for density estimation or prediction.
  • For some data, one needs learners that can model a complex function.
  • For others, possibly as a result of noise or insufficient sample size, a simple, parametric model might fit best.
  • Super Learner, an ensemble learner, solves this issue, by allowing a combination of learners from the simplest (intercept-only) to most complex (neural nets, random forests, SVM, etc).
  • It works by using cross-validation in a manner which guarantees that the resulting fit will be as good as possible, given the learners provided.
  • Note: even a combination of poor learners can sometimes result in good fit. It is very important to have good candidates in our library, possibly incorporating known knowledge about the system in question.

4.1.1.1 General Overview of the Algorithm

What is cross-validation and how does it work?

  • There are many different cross-validation schemes, designed to accommodate different study designs and data structures.
  • The figure below shows an example of 10-fold cross-validation.

General step-by-step overview of the Super Learner algorithm:

  • Break up the sample evenly into V-folds (say V=10).
  • For each of these 10 folds, remove that portion of the sample (kept out as validation sample) and the remaining will be used to fit learners (training sample).
  • Fit each learner on the training sample (note, some learners will have their own internal cross-validation procedure or other methods to select tuning parameters).
  • For each observation in the corresponding validation sample, predict the outcome using each of the learners, so if there are \(p\) learners, then there would be \(p\) predictions.
  • Take out another validation sample and repeat until each of the V-sets of data are removed.
  • Compare the cross-validated fit of the learners across all observations based on specified loss function (e.g., squared error, negative log-likelihood, …) by calculating the corresponding average loss (risk).
  • Either:

    • choose the learner with smallest risk and apply that learner to entire data set (resulting SL fit),
    • do a weighted average of the learners to minimize the cross-validated risk (construct an ensemble of learners), by

      • re-fitting the learners on the original data set, and
      • use the weights above to get the SL fit.

Note, this entire procedure can be itself cross-validated to get a consistent estimate of the future performance of the SL fit.

How to pick a Super Learner library?

  • A library is simply a collection of algorithms.
  • The algorithms in the library should come from contextual knowledge and a large set of “default” algorithms.
  • The algorithms may range from a simple linear regression model to multi-step algorithms involving screening covariates, penalizations, optimizing tuning parameters, etc.

4.1.1.2 Example: Super Learner In Prediction

  • We observe a learning data set \(X_i=(Y_i,W_i)\), for \(i=1, ..., n\).
  • Here, \(Y_i\) is the outcome of interest, and \(W_i\) is a p-dimensional set of covariates.
  • Our objective is to estimate the function \(\psi_0(W) = E(Y|W)\).
  • This function can be expressed as the minimizer of the expected loss: \(\psi_0(W) = \text{argmin}_{\psi} E[L(X,\psi(W))]\).
  • Here, the loss function is represented as \(L\) (e.g., squared error loss, \(L: (Y-\psi(W))^2)\)).

4.1.1.3 Why use the Super Learner?

  • For prediction, one can use the cross-validated risk to empirically determine the relative performance of SL and competing methods.
  • When we have tested different algorithms on actual data and looked at the performance (e.g., MSE of prediction), never does one algorithm always win (see below).
  • Below shows the results of such a study, comparing the fits of several different learners, including the SL algorithms.
  • Super Learner performs asymptotically as well as best possible weighted combination.
  • By including all competitors in the library of candidate estimators (glm, neural nets, SVMs, random forest, etc.), the Super Learner will asymptotically outperform any of its competitors- even if the set of competitors is allowed to grow polynomial in sample size.
  • Motivates the name “Super Learner”: it provides a system of combining many estimators into an improved estimator.

Review of the Super Learner

  • Loss-function-based tool that uses V-fold cross-validation to obtain the best prediction of the relevant part of the likelihood that’s needed to evaluate target parameter.

  • Requires expressing the estimand as the minimizer of an expected loss, and proposing a library of algorithms (“learners” in sl3 nomenclature) that we think might be consistent with the true data-generating distribution.

  • The discrete super learner, or cross-validated selector, is the algorithm in the library that minimizes the V-fold cross-validated empirical risk.

  • The super learner is a weighted average of the library of algorithms, where the weights are chosen to minimize the V-fold cross-validated empirical risk of the library. Restricting the weights (“metalearner” in sl3 nomenclature) to be positive and sum to one (convex combination) has been shown to improve upon the discrete super learner (Polley and van der Laan 2010; van der Laan, Polley, and Hubbard 2007).

  • Proven to be asymptotically as accurate as the best possible prediction algorithm that is tested (van der Laan and Dudoit 2003; van der Vaart, Dudoit, and van der Laan 2006).

  • This background material is described in greater detail in the accompanying tlverse handbook sl3 chapter.

4.2 sl3 “Microwave Dinner” Implementation

We begin by illustrating the core functionality of the super learner algorithm as implemented in sl3. For those who are interested in the internals of sl3, see this sl3 introductory tutorial.

The sl3 implementation consists of the following steps:

  1. Load the necessary libraries and data
  2. Define the machine learning task
  3. Make a super learner by creating library of base learners and a metalearner
  4. Train the super learner on the machine learning task
  5. Obtain predicted values

International Stroke Trial Example

Using the IST data, we are interested in predicting recurrent stroke DRSISC using the available covariate data.

0. Load the necessary libraries and data

RDELAY RCONSC SEX AGE RSLEEP RATRIAL RCT RVISINF RHEP24 RASP3 RSBP RDEF1 RDEF2 RDEF3 RDEF4 RDEF5 RDEF6 RDEF7 RDEF8 STYPE RXHEP REGION MISSING_RATRIAL_RASP3 MISSING_RHEP24 RXASP DRSISC
46 F F 85 N N N N Y N 150 N Y N N N N N N PACS N Europe and Central Asia 0 0 0 0
33 F M 71 Y Y Y Y N Y 180 Y Y Y Y Y N N N TACS L East Asia and Pacific 0 0 0 0
6 D M 88 N Y N N N N 140 Y Y Y C C C C C PACS N Europe and Central Asia 0 0 0 0
8 F F 68 Y N Y Y N N 118 Y Y N N N N N N LACS M Europe and Central Asia 0 0 0 0
13 F M 60 N N Y N N N 140 Y Y Y Y N N Y Y POCS N Europe and Central Asia 0 0 1 0
16 F F 71 Y N Y N N N 160 N Y N N N N N N PACS N Europe and Central Asia 0 0 1 0

1. Define the machine learning task

To define the machine learning “task” (predict stroke DRSISC using the available covariate data), we need to create an sl3_Task object.

The sl3_Task keeps track of the roles the variables play in the machine learning problem, the data, and any metadata (e.g., observational-level weights, id, offset).

We are not interested in predicting missing outcomes. We set drop_missing_outcome = TRUE when we create the task. In the next chapter, we estimate this missingness mechanism and account for it in the estimation.

Warning in process_data(data, nodes, column_names = column_names, flag = flag, :
Missing outcome data detected: dropping outcomes.
A sl3 Task with 4990 obs and these nodes:
$covariates
 [1] "RDELAY"                "RCONSC"                "SEX"                  
 [4] "AGE"                   "RSLEEP"                "RATRIAL"              
 [7] "RCT"                   "RVISINF"               "RHEP24"               
[10] "RASP3"                 "RSBP"                  "RDEF1"                
[13] "RDEF2"                 "RDEF3"                 "RDEF4"                
[16] "RDEF5"                 "RDEF6"                 "RDEF7"                
[19] "RDEF8"                 "STYPE"                 "RXHEP"                
[22] "REGION"                "MISSING_RATRIAL_RASP3" "MISSING_RHEP24"       
[25] "RXASP"                

$outcome
[1] "DRSISC"

$id
NULL

$weights
NULL

$offset
NULL

2. Make a super learner

Now that we have defined our machine learning problem with the task, we are ready to “make” the super learner. This requires specification of

  • Base learning algorithms, to establish a library of learners that we think might be consistent with the true data-generating distribution.
  • Metalearner, to ensemble the base learners.

We might also incorporate

  • Feature selection, to pass only a subset of the predictors to the algorithm.
  • Hyperparameter specification, to tune base learners.

Learners have properties that indicate what features they support. We may use sl3_list_properties() to get a list of all properties supported by at least one learner.

 [1] "binomial"             "categorical"          "continuous"          
 [4] "cv"                   "density"              "ids"                 
 [7] "multivariate_outcome" "offset"               "preprocessing"       
[10] "timeseries"           "weights"              "wrapper"             

Since we have a binomial outcome, we may identify the learners that support this outcome type with sl3_list_learners().

 [1] "Lrnr_bartMachine"               "Lrnr_caret"                    
 [3] "Lrnr_dbarts"                    "Lrnr_earth"                    
 [5] "Lrnr_gam"                       "Lrnr_gbm"                      
 [7] "Lrnr_glm"                       "Lrnr_glm_fast"                 
 [9] "Lrnr_glmnet"                    "Lrnr_grf"                      
[11] "Lrnr_h2o_glm"                   "Lrnr_h2o_grid"                 
[13] "Lrnr_hal9001"                   "Lrnr_mean"                     
[15] "Lrnr_optim"                     "Lrnr_pkg_SuperLearner"         
[17] "Lrnr_pkg_SuperLearner_method"   "Lrnr_pkg_SuperLearner_screener"
[19] "Lrnr_polspline"                 "Lrnr_randomForest"             
[21] "Lrnr_ranger"                    "Lrnr_rpart"                    
[23] "Lrnr_screener_corP"             "Lrnr_screener_corRank"         
[25] "Lrnr_screener_randomForest"     "Lrnr_solnp"                    
[27] "Lrnr_stratified"                "Lrnr_svm"                      
[29] "Lrnr_xgboost"                  

Now that we have an idea of some learners, we can construct them using the make_learner function.

We can customize learner hyperparameters to incorporate a diversity of different settings. Documentation for the learners and their hyperparameters can be found in the sl3 Learners Reference.

We can also include learners from the SuperLearner R package.

Here is a fun trick to create customized learners over a grid of parameters.

Did you see Lrnr_caret when we called sl3_list_learners(c("binomial"))? All we need to specify is the algorithm to use, which is passed as method to caret::train(). The default method for parameter selection criterion with is set to “CV” instead of the caret::train() default boot. The summary metric to used to select the optimal model is RMSE for continuous outcomes and Accuracy for categorical and binomial outcomes.

In order to assemble the library of learners, we need to “stack” them together.

A Stack is a special learner and it has the same interface as all other learners. What makes a stack special is that it combines multiple learners by training them simultaneously, so that their predictions can be either combined or compared.

We can optionally select a subset of available covariates and pass only those variables to the modeling algorithm.

Let’s consider screening covariates based on their randomForest variable importance ranking (ordered by mean decrease in accuracy)

[1] "Lrnr_screener_randomForest_5_100"
$selected
[1] "RDELAY" "SEX"    "AGE"    "RSLEEP" "RSBP"  

To “pipe” only the selected covariates to the modeling algorithm, we need to make a Pipeline, which is a just set of learners to be fit sequentially, where the fit from one learner is used to define the task for the next learner.

Now our learners will be preceded by a screening step.

We also consider the original stack, just to compare how the feature selection methods perform in comparison to the methods without feature selection.

Analogous to what we have seen before, we have to stack the pipeline and original stack together, so we may use them as base learners in our super learner.

We will use the default metalearner, which uses Lrnr_solnp() to provide fitting procedures for a pairing of loss function and metalearner function. The default metalearner chooses loss and metalearner pairing based on the outcome type. Note that any learner can be used as a metalearner.

We have made a library/stack of base learners, so we are ready to make the super learner. The super learner algorithm fits a metalearner on the validation-set predictions.

We can also use Lrnr_cv to build a super learner, cross-validate a stack of learners to compare performance of the learners in the stack, or cross-validate any single learner (see “Cross-validation” section of this sl3 introductory tutorial).

Furthermore, we can Define New sl3 Learners which can be used in all the places you could otherwise use any other sl3 learners, including Pipelines, Stacks, and the Super Learner.

3. Train the super learner on the machine learning task

Now we are ready to “train” our super learner on our sl3_task object, ist_task.

4. Obtain predicted values

Now that we have fit the super learner, we are ready to obtain our predicted values, and we can also obtain a summary of the results.

[1] 0.02307531 0.02822206 0.01917227 0.02306946 0.02092219 0.02683632
[1] "SuperLearner:"
List of 2
 $ : chr "Pipeline(Lrnr_screener_randomForest_5_100->Stack)"
 $ : chr "Stack"
[1] "Lrnr_solnp_TRUE_TRUE_FALSE_1e-05"
$pars
 [1] 0.10000127 0.10000283 0.09999962 0.10000085 0.09999499 0.10001846
 [7] 0.10000283 0.09999529 0.10000012 0.09998375

$convergence
[1] 0

$values
[1] 0.0230736 0.0230736

$lagrange
            [,1]
[1,] 0.000182045

$hessian
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    1    0    0    0    0    0    0    0    0     0
 [2,]    0    1    0    0    0    0    0    0    0     0
 [3,]    0    0    1    0    0    0    0    0    0     0
 [4,]    0    0    0    1    0    0    0    0    0     0
 [5,]    0    0    0    0    1    0    0    0    0     0
 [6,]    0    0    0    0    0    1    0    0    0     0
 [7,]    0    0    0    0    0    0    1    0    0     0
 [8,]    0    0    0    0    0    0    0    1    0     0
 [9,]    0    0    0    0    0    0    0    0    1     0
[10,]    0    0    0    0    0    0    0    0    0     1

$ineqx0
NULL

$nfuneval
[1] 15

$outer.iter
[1] 1

$elapsed
Time difference of 0.03249311 secs

$vscale
 [1] 0.0230736 0.0000100 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
 [8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

$coefficients
                            Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_glm_TRUE 
                                                                                 0.10000127 
                                Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_mean 
                                                                                 0.10000283 
  Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE 
                                                                                 0.09999962 
  Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE 
                                                                                 0.10000085 
Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_glmnet_NULL_deviance_10_0.5_100_TRUE 
                                                                                 0.09999499 
                                                                        Stack_Lrnr_glm_TRUE 
                                                                                 0.10001846 
                                                                            Stack_Lrnr_mean 
                                                                                 0.10000283 
                                              Stack_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE 
                                                                                 0.09999529 
                                              Stack_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE 
                                                                                 0.10000012 
                                            Stack_Lrnr_glmnet_NULL_deviance_10_0.5_100_TRUE 
                                                                                 0.09998375 

$training_offset
[1] FALSE

$name
[1] "solnp"

[1] "Cross-validated risk (MSE, squared error loss):"
                                                                                        learner
 1:                             Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_glm_TRUE
 2:                                 Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_mean
 3:   Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
 4:   Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE
 5: Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_glmnet_NULL_deviance_10_0.5_100_TRUE
 6:                                                                         Stack_Lrnr_glm_TRUE
 7:                                                                             Stack_Lrnr_mean
 8:                                               Stack_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE
 9:                                               Stack_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE
10:                                             Stack_Lrnr_glmnet_NULL_deviance_10_0.5_100_TRUE
11:                                                                                SuperLearner
    coefficients  mean_risk     SE_risk     fold_SD fold_min_risk fold_max_risk
 1:   0.10000127 0.02310653 0.002047872 0.005298033    0.01765364    0.03494974
 2:   0.10000283 0.02309466 0.002050044 0.005317404    0.01774964    0.03496153
 3:   0.09999962 0.02310366 0.002050655 0.005310963    0.01774964    0.03496153
 4:   0.10000085 0.02309954 0.002050446 0.005313563    0.01774964    0.03496123
 5:   0.09999499 0.02311787 0.002051626 0.005302995    0.01774964    0.03496153
 6:   0.10001846 0.02336042 0.002031224 0.005202322    0.01753999    0.03491601
 7:   0.10000283 0.02309466 0.002050044 0.005317404    0.01774964    0.03496153
 8:   0.09999529 0.02313214 0.002046856 0.005284860    0.01770783    0.03485090
 9:   0.10000012 0.02310411 0.002042884 0.005227175    0.01769421    0.03471470
10:   0.09998375 0.02317824 0.002049707 0.005265698    0.01770753    0.03486204
11:           NA 0.02307360 0.002051115 0.005308319    0.01764294    0.03492837

4.3 Extensions

4.3.1 Cross-validated Super Learner

We can cross-validate the super learner to see how well the super learner performs on unseen data, and obtain an estimate of the cross-validated risk of the super learner.

This estimation procedure requires an “external” layer of cross-validation, also called nested cross-validation, which involves setting aside a separate holdout sample that we don’t use to fit the super learner. This external cross-validation procedure may also incorporate 10 folds, which is the default in sl3. However, we will incorporate 2 outer/external folds of cross-validation for computational efficiency.

We also need to specify a loss function to evaluate super learner. Documentation for the available loss functions can be found in the sl3 Loss Function Reference.

learner coefficients mean_risk SE_risk fold_SD fold_min_risk fold_max_risk
Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_glm_TRUE 0.0999914 0.0232048 0.0020484 0.0023337 0.0215546 0.0248550
Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_mean 0.1000060 0.0230958 0.0020500 0.0021600 0.0215684 0.0246232
Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE 0.0999876 0.0231530 0.0020508 0.0022409 0.0215684 0.0247376
Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE 0.0999936 0.0231230 0.0020504 0.0021984 0.0215684 0.0246775
Pipeline(Lrnr_screener_randomForest_5_100->Stack)_Lrnr_glmnet_NULL_deviance_10_0.5_100_TRUE 0.0999880 0.0231526 0.0020507 0.0022403 0.0215684 0.0247368
Stack_Lrnr_glm_TRUE 0.0999883 0.0235503 0.0020280 0.0022289 0.0219743 0.0251264
Stack_Lrnr_mean 0.1000060 0.0230958 0.0020500 0.0021600 0.0215684 0.0246232
Stack_Lrnr_glmnet_NULL_deviance_10_1_100_TRUE 0.1000127 0.0230849 0.0020459 0.0021755 0.0215466 0.0246232
Stack_Lrnr_glmnet_NULL_deviance_10_0_100_TRUE 0.1000131 0.0230590 0.0020452 0.0021860 0.0215133 0.0246047
Stack_Lrnr_glmnet_NULL_deviance_10_0.5_100_TRUE 0.1000132 0.0230842 0.0020456 0.0021765 0.0215451 0.0246232
SuperLearner NA 0.0230833 0.0020537 0.0021633 0.0215536 0.0246130

4.3.2 Variable Importance Measures with sl3

The sl3 varimp function returns a table with variables listed in decreasing order of importance, in which the measure of importance is based on a risk difference between the learner fit with a permuted covariate and the learner fit with the true covariate, across all covariates.

In this manner, the larger the risk difference, the more important the variable is in the prediction.

X risk_diff
STYPE 8.58e-05
RDEF7 6.03e-05
RCT 1.59e-05
RCONSC 1.20e-05
RSBP 5.30e-06
RASP3 5.20e-06
REGION 5.10e-06
RDEF3 4.20e-06
RDEF6 2.90e-06
RDEF8 2.80e-06
RXASP 2.70e-06
RXHEP 1.90e-06
SEX 9.00e-07
AGE 7.00e-07
RDEF1 7.00e-07
RSLEEP 5.00e-07
MISSING_RHEP24 -1.00e-07
MISSING_RATRIAL_RASP3 -3.00e-07
RDEF4 -1.50e-06
RVISINF -1.60e-06
RDEF5 -4.50e-06
RATRIAL -5.70e-06
RDEF2 -6.10e-06
RDELAY -7.20e-06
RHEP24 -7.90e-06

4.4 Exercise

4.4.1 Predicting Myocardial Infarction with sl3

Follow the steps below to predict myocardial infarction (mi) using the available covariate data. We thank Prof. David Benkeser at Emory University for making the this Cardiovascular Health Study (CHS) data accessible.

waist alcoh hdl beta smoke ace ldl bmi aspirin gend age estrgn glu ins cysgfr dm fetuina whr hsed race logcystat logtrig logcrp logcre health logkcal sysbp mi
110.1642 0.0000 66.4974 0 0 1 114.2162 27.9975 0 0 73.5179 0 159.9314 70.3343 75.0078 1 0.1752 1.1690 1 1 -0.3420 5.4063 2.0126 -0.6739 0 4.3926 177.1345 0
89.9763 0.0000 50.0652 0 0 0 103.7766 20.8931 0 0 61.7723 0 153.3888 33.9695 82.7433 1 0.5717 0.9011 0 0 -0.0847 4.8592 3.2933 -0.5551 1 6.2071 136.3742 0
106.1941 8.4174 40.5059 0 0 0 165.7158 28.4554 1 1 72.9312 0 121.7145 -17.3017 74.6989 0 0.3517 1.1797 0 1 -0.4451 4.5088 0.3013 -0.0115 0 6.7320 135.1993 0
90.0566 0.0000 36.1750 0 0 0 45.2035 23.9608 0 0 79.1191 0 53.9691 11.7315 95.7823 0 0.5439 1.1360 0 0 -0.4807 5.1832 3.0243 -0.5751 1 7.3972 139.0182 0
78.6143 2.9790 71.0642 0 1 0 131.3121 10.9656 0 1 69.0179 0 94.3153 9.7112 72.7109 0 0.4916 1.1028 1 0 0.3121 4.2190 -0.7057 0.0053 1 8.2779 88.0470 0
91.6593 0.0000 59.4963 0 0 0 171.1872 29.1317 0 1 81.8346 0 212.9066 -28.2269 69.2184 1 0.4621 0.9529 1 0 -0.2872 5.1773 0.9705 0.2127 1 5.9942 69.5943 0
  1. Create an sl3 task, setting myocardial infarction mi as the outcome and using all available covariate data.
  2. Make a library of seven relatively fast base learning algorithms (i.e., do not consider BART or HAL). Customize hyperparameters for one of your learners. Feel free to use learners from sl3 or SuperLearner. You may use the same base learning library that is presented above.
  3. Incorporate feature selection with the screener Lrnr_screener_corP.
  4. Fit the metalearning step with non-negative least squares, Lrnr_nnls.
  5. With the metalearner and base learners, make the super learner and train it on the task.
  6. Print your super learner fit by calling print() with $. Which learner is the discrete super learner?
  7. Cross-validate your super learner fit to see how well it performs on unseen data. Specify loss_squared_error as the loss function to evaluate the super learner. Like above, create a new task with 2 folds of external cross-validation for computational efficiency. Report the cross-validated mean risk of the discrete super learner and the super learner.

4.5 Summary

  • The general ensemble learning approach of super learner can be applied to a diversity of estimation and prediction problems that can be defined by a loss function.

  • Plug-in estimators of the estimand are desirable because a plug-in estimator respects both the local and global constraints of the statistical model.

  • Asymptotically linear estimators are also advantageous, since they converge to the estimand at \(\frac{1}{\sqrt{n}}\) rate, and thereby permit formal statistical inference.

  • If we plug in the estimator returned by super learner into the target parameter mapping, then we would end up with an estimator that has the same bias as what we plugged in. This estimator would not be asymptotically linear.

  • Targeted maximum likelihood estimation (TMLE) is a general strategy that succeeds in constructing asymptotically linear plug-in estimators.

  • In the chapters that follow, we focus on the targeted maximum likelihood estimator and the targeted minimum loss-based estimator, both referred to as TMLE.

References

Polley, Eric C, and Mark J van der Laan. 2010. “Super Learner in Prediction.” bepress.

van der Laan, Mark J, and Sandrine Dudoit. 2003. “Unified Cross-Validation Methodology for Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples.” bepress.

van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).

van der Vaart, Aad W, Sandrine Dudoit, and Mark J van der Laan. 2006. “Oracle Inequalities for Multi-Fold Cross Validation.” Statistics & Decisions 24 (3). Oldenbourg Wissenschaftsverlag: 351–71.