Estimation procedure for HAL, the Highly Adaptive Lasso

fit_hal(
  X,
  Y,
  formula = NULL,
  X_unpenalized = NULL,
  max_degree = ifelse(ncol(X) >= 20, 2, 3),
  smoothness_orders = 1,
  num_knots = num_knots_generator(max_degree = max_degree, smoothness_orders =
    smoothness_orders, base_num_knots_0 = 20, base_num_knots_1 = 10),
  reduce_basis = NULL,
  family = c("gaussian", "binomial", "poisson", "cox", "mgaussian"),
  lambda = NULL,
  id = NULL,
  weights = NULL,
  offset = NULL,
  fit_control = list(cv_select = TRUE, use_min = TRUE, lambda.min.ratio = 1e-04,
    prediction_bounds = "default"),
  basis_list = NULL,
  return_lasso = TRUE,
  return_x_basis = FALSE,
  yolo = FALSE
)

Arguments

X

An input matrix with dimensions number of observations -by- number of covariates that will be used to derive the design matrix of basis functions.

Y

A numeric vector of observations of the outcome variable. For family="mgaussian", Y is a matrix of observations of the outcome variables.

formula

A character string formula to be used in formula_hal. See its documentation for details.

X_unpenalized

An input matrix with the same number of rows as X, for which no L1 penalization will be performed. Note that X_unpenalized is directly appended to the design matrix; no basis expansion is performed on X_unpenalized.

max_degree

The highest order of interaction terms for which basis functions ought to be generated.

smoothness_orders

An integer, specifying the smoothness of the basis functions. See details for smoothness_orders for more information.

num_knots

An integer vector of length 1 or max_degree, specifying the maximum number of knot points (i.e., bins) for any covariate for generating basis functions. If num_knots is a unit-length vector, then the same num_knots are used for each degree (this is not recommended). The default settings for num_knots are recommended, and these defaults decrease num_knots with increasing max_degree and smoothness_orders, which prevents (expensive) combinatorial explosions in the number of higher-degree and higher-order basis functions generated. This allows the complexity of the optimization problem to grow scalably. See details of num_knots more information.

reduce_basis

Am optional numeric value bounded in the open unit interval indicating the minimum proportion of 1's in a basis function column needed for the basis function to be included in the procedure to fit the lasso. Any basis functions with a lower proportion of 1's than the cutoff will be removed. Defaults to 1 over the square root of the number of observations. Only applicable for models fit with zero-order splines, i.e. smoothness_orders = 0.

family

A character or a family object (supported by glmnet) specifying the error/link family for a generalized linear model. character options are limited to "gaussian" for fitting a standard penalized linear model, "binomial" for penalized logistic regression, "poisson" for penalized Poisson regression, "cox" for a penalized proportional hazards model, and "mgaussian" for multivariate penalized linear model. Note that passing in family objects leads to slower performance relative to passing in a character family (if supported). For example, one should set family = "binomial" instead of family = binomial() when calling fit_hal.

lambda

User-specified sequence of values of the regularization parameter for the lasso L1 regression. If NULL, the default sequence in cv.glmnet will be used. The cross-validated optimal value of this regularization parameter will be selected with cv.glmnet. If fit_control's cv_select argument is set to FALSE, then the lasso model will be fit via glmnet, and regularized coefficient values for each lambda in the input array will be returned.

id

A vector of ID values that is used to generate cross-validation folds for cv.glmnet. This argument is ignored when fit_control's cv_select argument is FALSE.

weights

observation weights; defaults to 1 per observation.

offset

a vector of offset values, used in fitting.

fit_control

List of arguments, including the following, and any others to be passed to cv.glmnet or glmnet.

  • cv_select: A logical specifying if the sequence of specified lambda values should be passed to cv.glmnet in order for a single, optimal value of lambda to be selected according to cross-validation. When cv_select = FALSE, a glmnet model will be used to fit the sequence of (or single) lambda.

  • use_min: Specify the choice of lambda to be selected by cv.glmnet. When TRUE, "lambda.min" is used; otherwise, "lambda.1se". Only used when cv_select = TRUE.

  • lambda.min.ratio: A glmnet argument specifying the smallest value for lambda, as a fraction of lambda.max, the (data derived) entry value (i.e. the smallest value for which all coefficients are zero). We've seen that not setting lambda.min.ratio can lead to no lambda values that fit the data sufficiently well.

  • prediction_bounds: An optional vector of size two that provides the lower and upper bounds predictions; not used when family = "cox". When prediction_bounds = "default", the predictions are bounded between min(Y) - sd(Y) and max(Y) + sd(Y) for each outcome (when family = "mgaussian", each outcome can have different bounds). Bounding ensures that there is no extrapolation.

basis_list

The full set of basis functions generated from X.

return_lasso

A logical indicating whether or not to return the glmnet fit object of the lasso model.

return_x_basis

A logical indicating whether or not to return the matrix of (possibly reduced) basis functions used in fit_hal.

yolo

A logical indicating whether to print one of a curated selection of quotes from the HAL9000 computer, from the critically acclaimed epic science-fiction film "2001: A Space Odyssey" (1968).

Value

Object of class hal9001, containing a list of basis functions, a copy map, coefficients estimated for basis functions, and timing results (for assessing computational efficiency).

Details

The procedure uses a custom C++ implementation to generate a design matrix of spline basis functions of covariates and interactions of covariates. The lasso regression is fit to this design matrix via cv.glmnet or a custom implementation derived from origami. The maximum dimension of the design matrix is \(n\) -by- \((n * 2^(d-1))\), where where \(n\) is the number of observations and \(d\) is the number of covariates.

For smoothness_orders = 0, only zero-order splines (piece-wise constant) are generated, which assume the true regression function has no smoothness or continuity. When smoothness_orders = 1, first-order splines (piece-wise linear) are generated, which assume continuity of the true regression function. When smoothness_orders = 2, second-order splines (piece-wise quadratic and linear terms) are generated, which assume a the true regression function has a single order of differentiability.

num_knots argument specifies the number of knot points for each covariate and for each max_degree. Fewer knot points can significantly decrease runtime, but might be overly simplistic. When considering smoothness_orders = 0, too few knot points (e.g., < 50) can significantly reduce performance. When smoothness_orders = 1 or higher, then fewer knot points (e.g., 10-30) is actually better for performance. We recommend specifying num_knots with respect to smoothness_orders, and as a vector of length max_degree with values decreasing exponentially. This prevents combinatorial explosions in the number of higher-degree basis functions generated. The default behavior of num_knots follows this logic — for smoothness_orders = 0, num_knots is set to \(500 / 2^{j-1}\), and for smoothness_orders = 1 or higher, num_knots is set to \(200 / 2^{j-1}\), where \(j\) is the interaction degree. We also include some other suitable settings for num_knots below, all of which are less complex than default num_knots and will thus result in a faster runtime:

  • Some good settings for little to no cost in performance:

    • If smoothness_orders = 0 and max_degree = 3, num_knots = c(400, 200, 100).

    • If smoothness_orders = 1+ and max_degree = 3, num_knots = c(100, 75, 50).

  • Recommended settings for fairly fast runtime:

    • If smoothness_orders = 0 and max_degree = 3, num_knots = c(200, 100, 50).

    • If smoothness_orders = 1+ and max_degree = 3, num_knots = c(50, 25, 15).

  • Recommended settings for fast runtime:

    • If smoothness_orders = 0 and max_degree = 3, num_knots = c(100, 50, 25).

    • If smoothness_orders = 1+ and max_degree = 3, num_knots = c(40, 15, 10).

  • Recommended settings for very fast runtime:

    • If smoothness_orders = 0 and max_degree = 3, num_knots = c(50, 25, 10).

    • If smoothness_orders = 1+ and max_degree = 3, num_knots = c(25, 10, 5).

Examples

n <- 100
p <- 3
x <- xmat <- matrix(rnorm(n * p), n, p)
y_prob <- plogis(3 * sin(x[, 1]) + sin(x[, 2]))
y <- rbinom(n = n, size = 1, prob = y_prob)
hal_fit <- fit_hal(X = x, Y = y, family = "binomial")
preds <- predict(hal_fit, new_data = x)