Estimation procedure for HAL, the Highly Adaptive LASSO

fit_hal(X, Y, degrees = NULL, fit_type = c("glmnet", "lassi"),
  n_folds = 10, use_min = TRUE, reduce_basis = NULL,
  family = c("gaussian", "binomial"), return_lasso = FALSE,
  return_x_basis = FALSE, basis_list = NULL, lambda = NULL, ...,
  yolo = TRUE)

Arguments

X

An input matrix containing observations and covariates following standard conventions in problems of statistical learning.

Y

A numeric vector of obervations of the outcome variable of interest, following standard conventions in problems of statistical learning.

degrees

The highest order of interaction terms for which the basis functions ought to be generated. The default (NULL) corresponds to generating basis functions for the full dimensionality of the input matrix.

fit_type

The specific routine to be called when fitting the LASSO regression in a cross-validated manner. Choosing the glmnet option will result in a call to cv.glmnet while lassi will produce a (faster) call to a custom LASSO routine using the origami package.

n_folds

Integer for the number of folds to be used when splitting the data for cross-validation. This defaults to 10 as this is the convention for v-fold cross-validation.

use_min

Determines which lambda is selected from cv.glmnet. TRUE corresponds to "lambda.min" and FALSE corresponds to "lambda.1se".

reduce_basis

A numeric value bounded in the open interval (0,1) indicating the minimum proportion of 1's in a basis function column needed for the basis function to be included in the procedure to fit the Lasso. Any basis functions with a lower proportion of 1's than the specified cutoff will be removed. This argument defaults to NULL, in which case all basis functions are used in the lasso-fitting stage of the HAL algorithm.

family

A character corresponding to the error family for a generalized linear model. Options are limited to "gaussian" for fitting a standard general linear model and "binomial" for logistic regression.

return_lasso

A logical indicating whether or not to return the HAL lasso fit.

return_x_basis

A logical indicating whether or not to return the matrix of (possibly reduced) basis functions used in the HAL lasso fit.

basis_list

The full set of basis functions generated from the input data X (via a call to enumerate_basis). The dimensionality of this structure is dim = (n * 2^(d - 1)), where n is the number of observations and d is the number of columns in X.

lambda

A user-specified array of values of the lambda tuning parameter of the Lasso L1 regression. If NULL, cv.glmnet will be used to automatically select a CV-optimal value of this parameter. If specified, the Lasso L1 regression model will be fit via glmnet.

...

Other arguments passed to cv.glmnet. Please consult the documentation for glmnet for a full list of options.

yolo

A logical indicating whether to print one of a curated selection of quotes from the HAL9000 computer, from the critically acclaimed epic science-fiction film "2001: A Space Odyssey" (1968).

Value

Object of class hal9001, containing a list of basis functions, a copy map, coefficients estimated for basis functions, and timing results (for assessing computational efficiency).

Details

The procedure uses a custom C++ implementation to generate a design matrix consisting of basis functions corresponding to covariates and interactions of covariates and to remove duplicate columns of indicators. The LASSO regression is fit to this (usually) very wide matrix using either a custom implementation (based on the origami package) or by a call to cv.glmnet from the glmnet package.