Generalized Random Forests Learner

This learner implements Generalized Random Forests, using the grf package. This is a pluggable package for forest-based statistical estimation and inference. GRF currently provides non-parametric methods for least-squares regression, quantile regression, and treatment effect estimation (optionally using instrumental variables). Current implementation trains a regression forest that can be used to estimate quantiles of the conditional distribution of (Y|X=x).

Format

R6Class object.

Value

Learner object with methods for training and prediction. See Lrnr_base for documentation on learners.

Parameters

num.trees = 2000: Number of trees grown in the forest. NOTE: Getting accurate confidence intervals generally requires more trees than getting accurate predictions.
quantiles = c(0.1, 0.5, 0.9): Vector of quantiles used to calibrate the forest.
regression.splitting = FALSE: Whether to use regression splits when growing trees instead of specialized splits based on the quantiles (the default). Setting this flag to TRUE corresponds to the approach to quantile forests from Meinshausen (2006).
clusters = NULL: Vector of integers or factors specifying which cluster each observation corresponds to.
equalize.cluster.weights = FALSE: If FALSE, each unit is given the same weight (so that bigger clusters get more weight). If TRUE, each cluster is given equal weight in the forest. In this case, during training, each tree uses the same number of observations from each drawn cluster: If the smallest cluster has K units, then when we sample a cluster during training, we only give a random K elements of the cluster to the tree-growing procedure. When estimating average treatment effects, each observation is given weight 1/cluster size, so that the total weight of each cluster is the same.
sample.fraction = 0.5: Fraction of the data used to build each tree. NOTE: If honesty = TRUE, these subsamples will further be cut by a factor of honesty.fraction..
mtry = NULL: Number of variables tried for each split. By default, this is set based on the dimensionality of the predictors.
min.node.size = 5: A target for the minimum number of observations in each tree leaf. Note that nodes with size smaller than min.node.size can occur, as in the randomForest package.
honesty = TRUE: Whether or not honest splitting (i.e., sub-sample splitting) should be used.
alpha = 0.05: A tuning parameter that controls the maximum imbalance of a split.
imbalance.penalty = 0: A tuning parameter that controls how harshly imbalanced splits are penalized.
num.threads = 1: Number of threads used in training. If set to NULL, the software automatically selects an appropriate amount.
quantiles_pred: Vector of quantiles used to predict. This can be different than the vector of quantiles used for training.

Common Parameters

Individual learners have their own sets of parameters. Below is a list of shared parameters, implemented by Lrnr_base, and shared by all learners.

covariates: A character vector of covariates. The learner will use this to subset the covariates for any specified task
outcome_type: A variable_type object used to control the outcome_type used by the learner. Overrides the task outcome_type if specified
...: All other parameters should be handled by the invidual learner classes. See the documentation for the learner class you're instantiating

Other Learners: Custom_chain, Lrnr_HarmonicReg, Lrnr_arima, Lrnr_bartMachine, Lrnr_base, Lrnr_bayesglm, Lrnr_bilstm, Lrnr_caret, Lrnr_cv_selector, Lrnr_cv, Lrnr_dbarts, Lrnr_define_interactions, Lrnr_density_discretize, Lrnr_density_hse, Lrnr_density_semiparametric, Lrnr_earth, Lrnr_expSmooth, Lrnr_gam, Lrnr_ga, Lrnr_gbm, Lrnr_glm_fast, Lrnr_glm_semiparametric, Lrnr_glmnet, Lrnr_glmtree, Lrnr_glm, Lrnr_grfcate, Lrnr_gru_keras, Lrnr_gts, Lrnr_h2o_grid, Lrnr_hal9001, Lrnr_haldensify, Lrnr_hts, Lrnr_independent_binomial, Lrnr_lightgbm, Lrnr_lstm_keras, Lrnr_mean, Lrnr_multiple_ts, Lrnr_multivariate, Lrnr_nnet, Lrnr_nnls, Lrnr_optim, Lrnr_pca, Lrnr_pkg_SuperLearner, Lrnr_polspline, Lrnr_pooled_hazards, Lrnr_randomForest, Lrnr_ranger, Lrnr_revere_task, Lrnr_rpart, Lrnr_rugarch, Lrnr_screener_augment, Lrnr_screener_coefs, Lrnr_screener_correlation, Lrnr_screener_importance, Lrnr_sl, Lrnr_solnp_density, Lrnr_solnp, Lrnr_stratified, Lrnr_subset_covariates, Lrnr_svm, Lrnr_tsDyn, Lrnr_ts_weights, Lrnr_xgboost, Pipeline, Stack, define_h2o_X(), undocumented_learner

Examples

# load example data
data(cpp_imputed)

# create sl3 task
task <- sl3_Task$new(
  cpp_imputed,
  covariates = c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs"),
  outcome = "haz"
)

# train grf learner and make predictions
lrnr_grf <- Lrnr_grf$new(seed = 123)
lrnr_grf_fit <- lrnr_grf$train(task)
lrnr_grf_pred <- lrnr_grf_fit$predict()