Estimation procedure for HAL, the Highly Adaptive Lasso

fit_hal(
  X,
  Y,
  X_unpenalized = NULL,
  max_degree = 3,
  fit_type = c("glmnet", "lassi"),
  n_folds = 10,
  foldid = NULL,
  use_min = TRUE,
  reduce_basis = NULL,
  family = c("gaussian", "binomial", "cox"),
  return_lasso = TRUE,
  return_x_basis = FALSE,
  basis_list = NULL,
  lambda = NULL,
  id = NULL,
  offset = NULL,
  cv_select = TRUE,
  ...,
  yolo = TRUE
)

Arguments

X

An input matrix containing observations and covariates.

Y

A numeric vector of obervations of the outcome variable.

X_unpenalized

An input matrix with the same format as X, that directly get appended into the design matrix (no basis expansion). No L-1 penalization is performed on these covariates.

max_degree

The highest order of interaction terms for which the basis functions ought to be generated. The default (NULL) corresponds to generating basis functions for the full dimensionality of the input matrix.

fit_type

The specific routine to be called when fitting the Lasso regression in a cross-validated manner. Choosing the glmnet option will result in a call to cv.glmnet while lassi will produce a (faster) call to a custom Lasso routine.

n_folds

Integer for the number of folds to be used when splitting the data for V-fold cross-validation. This defaults to 10.

foldid

An optional vector of values between 1 and n_folds identifying what fold each observation is in. If supplied, n_folds can be missing. When supplied, this is passed to cv.glmnet.

use_min

Determines which lambda is selected from cv.glmnet. TRUE corresponds to "lambda.min" and FALSE corresponds to "lambda.1se".

reduce_basis

A numeric value bounded in the open interval (0,1) indicating the minimum proportion of 1's in a basis function column needed for the basis function to be included in the procedure to fit the Lasso. Any basis functions with a lower proportion of 1's than the cutoff will be removed. This argument defaults to NULL, in which case all basis functions are used in the lasso-fitting stage of the HAL algorithm.

family

A character corresponding to the error family for a generalized linear model. Options are limited to "gaussian" for fitting a standard linear model, "binomial" for penalized logistic regression, "cox" for a penalized proportional hazards model. Note that in the case of "binomial" and "cox" the argument fit_type is limited to "glmnet"; thus, documentation of the glmnet package should be consulted for any errors resulting from the Lasso fitting step in these cases.

return_lasso

A logical indicating whether or not to return the glmnet fit of the lasso model.

return_x_basis

A logical indicating whether or not to return the matrix of (possibly reduced) basis functions used in the HAL lasso fit.

basis_list

The full set of basis functions generated from the input data X (via a call to enumerate_basis). The dimensionality of this structure is dim = (n * 2^(d - 1)), where n is the number of observations and d is the number of columns in X.

lambda

User-specified array of values of the lambda tuning parameter of the Lasso L1 regression. If NULL, cv.glmnet will be used to automatically select a CV-optimal value of this regularization parameter. If specified, the Lasso L1 regression model will be fit via glmnet, returning regularized coefficient values for each value in the input array.

id

a vector of ID values, used to generate cross-validation folds for cross-validated selection of the regularization parameter lambda.

offset

a vector of offset values, used in fitting.

cv_select

A logical specifying whether the array of values specified should be passed to cv.glmnet in order to pick the optimal value (based on cross-validation) (when set to TRUE) or to simply fit along the sequence of values (or single value) using glmnet (when set to FALSE).

...

Other arguments passed to cv.glmnet. Please consult its documentation for a full list of options.

yolo

A logical indicating whether to print one of a curated selection of quotes from the HAL9000 computer, from the critically acclaimed epic science-fiction film "2001: A Space Odyssey" (1968).

Value

Object of class hal9001, containing a list of basis functions, a copy map, coefficients estimated for basis functions, and timing results (for assessing computational efficiency).

Details

The procedure uses a custom C++ implementation to generate a design matrix consisting of basis functions corresponding to covariates and interactions of covariates and to remove duplicate columns of indicators. The Lasso regression is fit to this (usually) very wide matrix using either a custom implementation (based on origami) or by a call to cv.glmnet.

Examples

# \donttest{ n <- 100 p <- 3 x <- xmat <- matrix(rnorm(n * p), n, p) y_prob <- plogis(3 * sin(x[, 1]) + sin(x[, 2])) y <- rbinom(n = n, size = 1, prob = y_prob) ml_hal_fit <- fit_hal(X = x, Y = y, family = "binomial", yolo = FALSE) preds <- predict(ml_hal_fit, new_data = x) # }