Function that takes a cross-validated fit (i.e., cross-validated learner that has already been trained on a task), which could be a cross-validated single learner or super learner, and generates a risk-based variable importance score for either each covariate or each group of covariates in the task. This function outputs a data.table, where each row corresponds to the risk difference or the risk ratio between the following two risks: the risk when a covariate (or group of covariates) is permuted or removed, and the original risk (i.e., when all covariates are included as they were in the observed data). A higher risk ratio/difference corresponds to a more important covariate/group. A plot can be generated from the returned data.table by calling companion function importance_plot.

importance(fit, eval_fun = NULL, fold_number = "validation",
  type = c("remove", "permute"), importance_metric = c("difference",
  "ratio"), covariate_groups = NULL)

importance(fit, eval_fun = NULL, fold_number = "validation",
  type = c("remove", "permute"), importance_metric = c("difference",
  "ratio"), covariate_groups = NULL)

Arguments

fit

A trained cross-validated (CV) learner (such as a CV stack or super learner), from which cross-validated predictions can be generated.

eval_fun

The evaluation function (risk or loss function) for evaluating the risk. Defaults vary based on the outcome type, matching defaults in default_metalearner. See loss_functions and risk_functions for options. Default is NULL.

fold_number

The fold number to use for obtaining the predictions from the fit. Either a positive integer for obtaining predictions from a specific fold's fit; "full" for obtaining predictions from a fit on all of the data, or "validation" (default) for obtaining cross-validated predictions, where the data used for training and prediction never overlaps across the folds. Note that if a positive integer or "full" is supplied here then there will be overlap between the data used for training and validation, so fold_number ="validation" is recommended.

type

Which method should be used to obscure the relationship between each covariate / covariate group and the outcome? When type is "remove" (default), each covariate / covariate group is removed one at a time from the task; the cross-validated learner is refit to this modified task; and finally, predictions are obtained from this refit. When type is "permute", each covariate / covariate group is permuted (sampled without replacement) one at a time, and then predictions are obtained from this modified data.

importance_metric

Either "ratio" or "difference" (default). For each covariate / covariate group, "ratio" returns the risk of the permuted/removed covariate / covariate group divided by observed/original risk (i.e., the risk with all covariates as they existed in the sample) and "difference" returns the difference between the risk with the permuted/removed covariate / covariate group and the observed risk.

covariate_groups

Optional named list covariate groups which will invoke variable importance evaluation at the group-level, by removing/permuting all covariates in the same group together. If covariates in the task are not specified in the list of groups, then those covariates will be added as additional single-covariate groups.

Value

A data.table of variable importance for each covariate.

Examples

# define ML task
data(cpp_imputed)
covs <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs")
task <- sl3_Task$new(cpp_imputed, covariates = covs, outcome = "haz")

# build relatively fast learner library (not recommended for real analysis)
lasso_lrnr <- Lrnr_glmnet$new()
glm_lrnr <- Lrnr_glm$new()
ranger_lrnr <- Lrnr_ranger$new()
lrnrs <- c(lasso_lrnr, glm_lrnr, ranger_lrnr)
names(lrnrs) <- c("lasso", "glm", "ranger")
lrnr_stack <- make_learner(Stack, lrnrs)

# instantiate SL with default metalearner
sl <- Lrnr_sl$new(lrnr_stack)
sl_fit <- sl$train(task)

importance_result <- importance(sl_fit)
importance_result
#>    covariate MSE_difference
#> 1:  gagebrth    0.042704792
#> 2:      mage    0.031846864
#> 3:  meducyrs    0.030945845
#> 4:    apgar1    0.013604514
#> 5:    parity    0.004934802
#> 6:    apgar5   -0.008147488

# importance with groups of covariates
groups <- list(
  scores = c("apgar1", "apgar5"),
  maternal = c("parity", "mage", "meducyrs")
)
importance_result_groups <- importance(sl_fit, covariate_groups = groups)
importance_result_groups
#>    covariate_group MSE_difference
#> 1:          scores     6.53175726
#> 2:        maternal     0.14678546
#> 3:        gagebrth     0.03660152