Introduction

This document consists of some simple benchmarks for various choices of Super Learner implementation, wrapper functions, and parallelization schemes. The purpose of this document is two-fold:

  1. Compare the computational performance of these methods
  2. Illustrate the use of these different methods

Test Setup

Test System

  • CPU model: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
  • Physical cores: 12
  • Logical cores: 12
  • Clock speed: 2.6GHz
  • Memory: 252.2GB

Test Data

n = 1e4
data(cpp_imputed)
cpp_big <- cpp_imputed[sample(nrow(cpp_imputed), n, replace = TRUE), ]
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",
            "sexn")
outcome <- "haz"

task <- sl3_Task$new(cpp_big, covariates = covars, outcome = outcome,
                     outcome_type = "continuous")
  • Number of observations: 10000
  • Number of covariates: 7

Test Descriptions

Legacy SuperLearner

The legacy SuperLearner package serves as a suitable baseline. We can fit it sequentially (no parallelization):

time_SuperLearner_sequential <- system.time({
  SuperLearner(task$Y, as.data.frame(task$X), newX = NULL, family = gaussian(), 
               SL.library = c("SL.glmnet", "SL.randomForest", "SL.speedglm"),
               method = "method.NNLS", id = NULL, verbose = FALSE,
               control = list(), cvControl = list(), obsWeights = NULL,
               env = parent.frame())
})

We can also fit it using multicore parallelization, using the mcSuperLearner function.

options(mc.cores = cpus_physical)
time_SuperLearner_multicore <- system.time({
  mcSuperLearner(task$Y, as.data.frame(task$X), newX = NULL,
                 family = gaussian(),
                 SL.library = c("SL.glmnet", "SL.randomForest", "SL.speedglm"),
                 method = "method.NNLS", id = NULL, verbose = FALSE,
                 control = list(), cvControl = list(), obsWeights = NULL,
                 env = parent.frame())
})

The SuperLearner package supports a number of other parallelization schemes, although these weren’t tested here.

sl3 with Legacy SuperLearner Wrappers

To maximize comparability with the legacy implementation, we can use sl3 with the SuperLearner wrappers, so that the actual computation used to train the learners is identical:

sl_glmnet <- Lrnr_pkg_SuperLearner$new("SL.glmnet")
sl_random_forest <- Lrnr_pkg_SuperLearner$new("SL.randomForest")
sl_speedglm <- Lrnr_pkg_SuperLearner$new("SL.speedglm")
nnls_lrnr <- Lrnr_nnls$new()

sl3_legacy <- Lrnr_sl$new(list(sl_random_forest, sl_glmnet, sl_speedglm),
                          nnls_lrnr)

sl3 with Native Learners

We can also use native sl3 learners, which have been rewritten to be performant on large sample sizes:

lrnr_glmnet <- Lrnr_glmnet$new()
random_forest <- Lrnr_randomForest$new()
glm_fast <- Lrnr_glm_fast$new()
nnls_lrnr <- Lrnr_nnls$new()

sl3_native <- Lrnr_sl$new(list(random_forest, lrnr_glmnet, glm_fast), nnls_lrnr)

sl3 Parallelization Options

sl3 uses the delayed package to parallelize training tasks. Delayed, in turn, uses the future package to support a range of parallel back-ends. We test several of these, for both the legacy wrappers and native learners.

First, sequential evaluation (no parallelization):

plan(sequential)
test <- delayed_learner_train(sl3_legacy, task)
time_sl3_legacy_sequential <- system.time({
  sched <- Scheduler$new(test, SequentialJob)
  cv_fit <- sched$compute()
})

test <- delayed_learner_train(sl3_native, task)
time_sl3_native_sequential <- system.time({
  sched <- Scheduler$new(test, SequentialJob)
  cv_fit <- sched$compute()
})

Next, multicore parallelization:

plan(multicore, workers = cpus_physical)
test <- delayed_learner_train(sl3_legacy, task)
time_sl3_legacy_multicore <- system.time({
  sched <- Scheduler$new(test, FutureJob, nworkers = cpus_physical,
                         verbose = FALSE)
  cv_fit <- sched$compute()
})

test <- delayed_learner_train(sl3_native, task)
time_sl3_native_multicore <- system.time({
  sched <- Scheduler$new(test, FutureJob, nworkers = cpus_physical,
                         verbose = FALSE)
  cv_fit <- sched$compute()
})

We also test multicore parallelization with hyper-threading – we use a number of workers equal to the number of logical, not physical, cores:

plan(multicore, workers = cpus_logical)
test <- delayed_learner_train(sl3_legacy, task)
time_sl3_legacy_multicore_ht <- system.time({
  sched <- Scheduler$new(test, FutureJob, nworkers = cpus_logical,
                         verbose = FALSE)
  cv_fit <- sched$compute()
})

test <- delayed_learner_train(sl3_native, task)
time_sl3_native_multicore_ht <- system.time({
  sched <- Scheduler$new(test, FutureJob, nworkers = cpus_logical,
                         verbose = FALSE)
  cv_fit <- sched$compute()
})

Finally, we test parallelization using multisession:

plan(multisession, workers = cpus_physical)
test <- delayed_learner_train(sl3_legacy, task)
time_sl3_legacy_multisession <- system.time({
  sched <- Scheduler$new(test, FutureJob, nworkers = cpus_physical,
                         verbose = FALSE)
  cv_fit <- sched$compute()
})

test <- delayed_learner_train(sl3_native, task)
time_sl3_native_multisession <- system.time({
  sched <- Scheduler$new(test, FutureJob, nworkers = cpus_physical,
                         verbose = FALSE)
  cv_fit <- sched$compute()
})

Results

We can see that using the native learners results in about a 4x speedup relative to the legacy wrappers. This can be at least partially explained by the fact that legacy SL.randomForest wrapper uses randomForest.formula for continuous data, which resorts to using the model.matrix function, known to be slow on large datasets. Improvements to the legacy wrappers would probably reduce or eliminate this difference.

We can also see that multicore parallelization for the legacy SuperLearner function results in another 4x speedup on this system. Relative to that, the sl3_legacy_multicore test results in almost an additional 2x speedup. This can be explained by the use of delayed parallelization. While mcSuperLearner parallelizes simply across the \(V\) cross-validation folds, delayed allows sl3 to parallelize across all training tasks that comprise the SuperLearner, which is a total of \((V+1)*n_{learners}\) training tasks, \(n_{learners}\) is the number of learners in the library (here 4), and \((V+1)\) is one more than the number of cross-validation folds, accounting for the re-fit to the full data typically implemented in the SuperLearner algorithm. We don’t see a substantial difference between the three parallelization schems for sl3.

These effects appear multiplicative, resulting in the fastest implementation, sl3_improved_multicore_ht (sl3 with native learners and hyper-threaded multicore parallelization), being about 32x faster than the slowest, SuperLearner_sequential (Legacy SuperLearner without parallelization). This is a dramatic improvement in the time required to run this SuperLearner.

Session Information

R version 3.4.4 (2018-03-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Server release 6.9 (Santiago)

Matrix products: default BLAS/LAPACK: /usr/lib64/libopenblas-r0.2.20.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats graphics grDevices utils datasets base

other attached packages: [1] speedglm_0.3-2 MASS_7.3-49
[3] randomForest_4.6-12 glmnet_2.0-13
[5] foreach_1.4.4 Matrix_1.2-12
[7] scales_0.5.0.9000 stringr_1.3.0
[9] data.table_1.10.4-3 ggplot2_2.2.1.9000
[11] future_1.7.0 SuperLearner_2.0-23-9000 [13] nnls_1.4 delayed_0.2.1
[15] sl3_1.0.0 knitr_1.20
[17] nima_0.4.6 fcuk_0.1.21

loaded via a namespace (and not attached): [1] stringdist_0.9.4.7 origami_1.0.0 gtools_3.5.0
[4] purrr_0.2.4 listenv_0.7.0 lattice_0.20-35
[7] ggthemes_3.4.0 colorspace_1.3-2 htmltools_0.3.6
[10] yaml_2.1.18 rlang_0.2.0.9000 pillar_1.2.1
[13] withr_2.1.1.9000 uuid_0.1-2 ProjectTemplate_0.8 [16] plyr_1.8.4 munsell_0.4.3 gtable_0.2.0
[19] visNetwork_2.0.3 devtools_1.13.5 htmlwidgets_1.0
[22] codetools_0.2-15 evaluate_0.10.1 memoise_1.1.0
[25] rstackdeque_1.1.1 methods_3.4.4 Rcpp_0.12.16
[28] backports_1.1.2 checkmate_1.8.5 jsonlite_1.5
[31] abind_1.4-5 gridExtra_2.3 digest_0.6.15
[34] stringi_1.1.7 BBmisc_1.11 grid_3.4.4
[37] rprojroot_1.3-2 tools_3.4.4 magrittr_1.5
[40] lazyeval_0.2.1 tibble_1.4.2 future.apply_0.1.0 [43] pkgconfig_2.0.1 iterators_1.0.9 assertthat_0.2.0
[46] rmarkdown_1.9 R6_2.2.2 globals_0.11.0
[49] igraph_1.2.1 compiler_3.4.4