Estimation procedure for HAL, the Highly Adaptive Lasso
fit_hal(
X,
Y,
formula = NULL,
X_unpenalized = NULL,
max_degree = ifelse(ncol(X) >= 20, 2, 3),
smoothness_orders = 1,
num_knots = num_knots_generator(max_degree = max_degree, smoothness_orders =
smoothness_orders, base_num_knots_0 = 20, base_num_knots_1 = 10),
reduce_basis = NULL,
family = c("gaussian", "binomial", "poisson", "cox", "mgaussian"),
lambda = NULL,
id = NULL,
weights = NULL,
offset = NULL,
fit_control = list(cv_select = TRUE, use_min = TRUE, lambda.min.ratio = 1e-04,
prediction_bounds = "default"),
basis_list = NULL,
return_lasso = TRUE,
return_x_basis = FALSE,
yolo = FALSE
)An input matrix with dimensions number of observations -by-
number of covariates that will be used to derive the design matrix of basis
functions.
A numeric vector of observations of the outcome variable. For
family="mgaussian", Y is a matrix of observations of the
outcome variables.
A character string formula to be used in
formula_hal. See its documentation for details.
An input matrix with the same number of rows as
X, for which no L1 penalization will be performed. Note that
X_unpenalized is directly appended to the design matrix; no basis
expansion is performed on X_unpenalized.
The highest order of interaction terms for which basis functions ought to be generated.
An integer, specifying the smoothness of the
basis functions. See details for smoothness_orders for more
information.
An integer vector of length 1 or max_degree,
specifying the maximum number of knot points (i.e., bins) for any covariate
for generating basis functions. If num_knots is a unit-length
vector, then the same num_knots are used for each degree (this is
not recommended). The default settings for num_knots are
recommended, and these defaults decrease num_knots with increasing
max_degree and smoothness_orders, which prevents (expensive)
combinatorial explosions in the number of higher-degree and higher-order
basis functions generated. This allows the complexity of the optimization
problem to grow scalably. See details of num_knots more information.
Am optional numeric value bounded in the open
unit interval indicating the minimum proportion of 1's in a basis function
column needed for the basis function to be included in the procedure to fit
the lasso. Any basis functions with a lower proportion of 1's than the
cutoff will be removed. Defaults to 1 over the square root of the number of
observations. Only applicable for models fit with zero-order splines, i.e.
smoothness_orders = 0.
A character or a family object
(supported by glmnet) specifying the error/link
family for a generalized linear model. character options are limited
to "gaussian" for fitting a standard penalized linear model, "binomial" for
penalized logistic regression, "poisson" for penalized Poisson regression,
"cox" for a penalized proportional hazards model, and "mgaussian" for
multivariate penalized linear model. Note that passing in
family objects leads to slower performance relative to passing in a
character family (if supported). For example, one should set
family = "binomial" instead of family = binomial() when
calling fit_hal.
User-specified sequence of values of the regularization
parameter for the lasso L1 regression. If NULL, the default sequence
in cv.glmnet will be used. The cross-validated
optimal value of this regularization parameter will be selected with
cv.glmnet. If fit_control's cv_select
argument is set to FALSE, then the lasso model will be fit via
glmnet, and regularized coefficient values for each
lambda in the input array will be returned.
A vector of ID values that is used to generate cross-validation
folds for cv.glmnet. This argument is ignored when
fit_control's cv_select argument is FALSE.
observation weights; defaults to 1 per observation.
a vector of offset values, used in fitting.
List of arguments, including the following, and any
others to be passed to cv.glmnet or
glmnet.
cv_select: A logical specifying if the sequence of
specified lambda values should be passed to
cv.glmnet in order for a single, optimal value of
lambda to be selected according to cross-validation. When
cv_select = FALSE, a glmnet model will be
used to fit the sequence of (or single) lambda.
use_min: Specify the choice of lambda to be selected by
cv.glmnet. When TRUE, "lambda.min" is
used; otherwise, "lambda.1se". Only used when
cv_select = TRUE.
lambda.min.ratio: A glmnet argument
specifying the smallest value for lambda, as a fraction of
lambda.max, the (data derived) entry value (i.e. the smallest value
for which all coefficients are zero). We've seen that not setting
lambda.min.ratio can lead to no lambda values that fit the
data sufficiently well.
prediction_bounds: An optional vector of size two that provides
the lower and upper bounds predictions; not used when
family = "cox". When prediction_bounds = "default", the
predictions are bounded between min(Y) - sd(Y) and
max(Y) + sd(Y) for each outcome (when family = "mgaussian",
each outcome can have different bounds). Bounding ensures that there is
no extrapolation.
The full set of basis functions generated from X.
A logical indicating whether or not to return
the glmnet fit object of the lasso model.
A logical indicating whether or not to return
the matrix of (possibly reduced) basis functions used in fit_hal.
A logical indicating whether to print one of a curated
selection of quotes from the HAL9000 computer, from the critically
acclaimed epic science-fiction film "2001: A Space Odyssey" (1968).
Object of class hal9001, containing a list of basis
functions, a copy map, coefficients estimated for basis functions, and
timing results (for assessing computational efficiency).
The procedure uses a custom C++ implementation to generate a design
matrix of spline basis functions of covariates and interactions of
covariates. The lasso regression is fit to this design matrix via
cv.glmnet or a custom implementation derived from
origami. The maximum dimension of the design matrix is \(n\) -by-
\((n * 2^(d-1))\), where where \(n\) is the number of observations and
\(d\) is the number of covariates.
For smoothness_orders = 0, only zero-order splines (piece-wise
constant) are generated, which assume the true regression function has no
smoothness or continuity. When smoothness_orders = 1, first-order
splines (piece-wise linear) are generated, which assume continuity of the
true regression function. When smoothness_orders = 2, second-order
splines (piece-wise quadratic and linear terms) are generated, which assume
a the true regression function has a single order of differentiability.
num_knots argument specifies the number of knot points for each
covariate and for each max_degree. Fewer knot points can
significantly decrease runtime, but might be overly simplistic. When
considering smoothness_orders = 0, too few knot points (e.g., < 50)
can significantly reduce performance. When smoothness_orders = 1 or
higher, then fewer knot points (e.g., 10-30) is actually better for
performance. We recommend specifying num_knots with respect to
smoothness_orders, and as a vector of length max_degree with
values decreasing exponentially. This prevents combinatorial explosions in
the number of higher-degree basis functions generated. The default behavior
of num_knots follows this logic — for smoothness_orders = 0,
num_knots is set to \(500 / 2^{j-1}\), and for
smoothness_orders = 1 or higher, num_knots is set to
\(200 / 2^{j-1}\), where \(j\) is the interaction degree. We also
include some other suitable settings for num_knots below, all of
which are less complex than default num_knots and will thus result
in a faster runtime:
Some good settings for little to no cost in performance:
If smoothness_orders = 0 and max_degree = 3,
num_knots = c(400, 200, 100).
If smoothness_orders = 1+ and max_degree = 3,
num_knots = c(100, 75, 50).
Recommended settings for fairly fast runtime:
If smoothness_orders = 0 and max_degree = 3,
num_knots = c(200, 100, 50).
If smoothness_orders = 1+ and max_degree = 3,
num_knots = c(50, 25, 15).
Recommended settings for fast runtime:
If smoothness_orders = 0 and max_degree = 3,
num_knots = c(100, 50, 25).
If smoothness_orders = 1+ and max_degree = 3,
num_knots = c(40, 15, 10).
Recommended settings for very fast runtime:
If smoothness_orders = 0 and max_degree = 3,
num_knots = c(50, 25, 10).
If smoothness_orders = 1+ and max_degree = 3,
num_knots = c(25, 10, 5).