randomForest
and order in decreasing order of
importance.R/Lrnr_randomForest.R
, R/importance.R
importance.Rd
Function that takes a cross-validated fit (i.e., cross-validated learner
that has already been trained on a task), which could be a cross-validated
single learner or super learner, and generates a risk-based variable
importance score for either each covariate or each group of covariates in
the task. This function outputs a data.table
, where each row
corresponds to the risk difference or the risk ratio between the following
two risks: the risk when a covariate (or group of covariates) is permuted or
removed, and the original risk (i.e., when all covariates are included as
they were in the observed data). A higher risk ratio/difference corresponds
to a more important covariate/group. A plot can be generated from the
returned data.table
by calling companion function
importance_plot
.
importance(fit, eval_fun = NULL, fold_number = "validation",
type = c("remove", "permute"), importance_metric = c("difference",
"ratio"), covariate_groups = NULL)
importance(fit, eval_fun = NULL, fold_number = "validation",
type = c("remove", "permute"), importance_metric = c("difference",
"ratio"), covariate_groups = NULL)
A trained cross-validated (CV) learner (such as a CV stack or super learner), from which cross-validated predictions can be generated.
The evaluation function (risk or loss function) for
evaluating the risk. Defaults vary based on the outcome type, matching
defaults in default_metalearner
. See
loss_functions
and risk_functions
for options.
Default is NULL
.
The fold number to use for obtaining the predictions from
the fit. Either a positive integer for obtaining predictions from a
specific fold's fit; "full"
for obtaining predictions from a fit on
all of the data, or "validation"
(default) for obtaining
cross-validated predictions, where the data used for training and
prediction never overlaps across the folds. Note that if a positive integer
or "full"
is supplied here then there will be overlap between the
data used for training and validation, so fold_number ="validation"
is recommended.
Which method should be used to obscure the relationship between
each covariate / covariate group and the outcome? When type
is
"remove"
(default), each covariate / covariate group is removed one
at a time from the task; the cross-validated learner is refit to this
modified task; and finally, predictions are obtained from this refit. When
type
is "permute"
, each covariate / covariate group is
permuted (sampled without replacement) one at a time, and then predictions
are obtained from this modified data.
Either "ratio"
or "difference"
(default). For each covariate / covariate group, "ratio"
returns the
risk of the permuted/removed covariate / covariate group divided by
observed/original risk (i.e., the risk with all covariates as they existed
in the sample) and "difference"
returns the difference between the
risk with the permuted/removed covariate / covariate group and the observed
risk.
Optional named list covariate groups which will invoke variable importance evaluation at the group-level, by removing/permuting all covariates in the same group together. If covariates in the task are not specified in the list of groups, then those covariates will be added as additional single-covariate groups.
A data.table
of variable importance for each covariate.
# define ML task
data(cpp_imputed)
covs <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs")
task <- sl3_Task$new(cpp_imputed, covariates = covs, outcome = "haz")
# build relatively fast learner library (not recommended for real analysis)
lasso_lrnr <- Lrnr_glmnet$new()
glm_lrnr <- Lrnr_glm$new()
ranger_lrnr <- Lrnr_ranger$new()
lrnrs <- c(lasso_lrnr, glm_lrnr, ranger_lrnr)
names(lrnrs) <- c("lasso", "glm", "ranger")
lrnr_stack <- make_learner(Stack, lrnrs)
# instantiate SL with default metalearner
sl <- Lrnr_sl$new(lrnr_stack)
sl_fit <- sl$train(task)
importance_result <- importance(sl_fit)
importance_result
#> covariate MSE_difference
#> 1: gagebrth 0.042704792
#> 2: mage 0.031846864
#> 3: meducyrs 0.030945845
#> 4: apgar1 0.013604514
#> 5: parity 0.004934802
#> 6: apgar5 -0.008147488
# importance with groups of covariates
groups <- list(
scores = c("apgar1", "apgar5"),
maternal = c("parity", "mage", "meducyrs")
)
importance_result_groups <- importance(sl_fit, covariate_groups = groups)
importance_result_groups
#> covariate_group MSE_difference
#> 1: scores 6.53175726
#> 2: maternal 0.14678546
#> 3: gagebrth 0.03660152