Estimation procedure for HAL, the Highly Adaptive Lasso
fit_hal(
X,
Y,
formula = NULL,
X_unpenalized = NULL,
max_degree = ifelse(ncol(X) >= 20, 2, 3),
smoothness_orders = 1,
num_knots = num_knots_generator(max_degree = max_degree, smoothness_orders =
smoothness_orders, base_num_knots_0 = 20, base_num_knots_1 = 10),
reduce_basis = NULL,
family = c("gaussian", "binomial", "poisson", "cox", "mgaussian"),
lambda = NULL,
id = NULL,
weights = NULL,
offset = NULL,
fit_control = list(cv_select = TRUE, use_min = TRUE, lambda.min.ratio = 1e-04,
prediction_bounds = "default"),
basis_list = NULL,
return_lasso = TRUE,
return_x_basis = FALSE,
yolo = FALSE
)
An input matrix
with dimensions number of observations -by-
number of covariates that will be used to derive the design matrix of basis
functions.
A numeric
vector of observations of the outcome variable. For
family="mgaussian"
, Y
is a matrix of observations of the
outcome variables.
A character string formula to be used in
formula_hal
. See its documentation for details.
An input matrix
with the same number of rows as
X
, for which no L1 penalization will be performed. Note that
X_unpenalized
is directly appended to the design matrix; no basis
expansion is performed on X_unpenalized
.
The highest order of interaction terms for which basis functions ought to be generated.
An integer
, specifying the smoothness of the
basis functions. See details for smoothness_orders
for more
information.
An integer
vector of length 1 or max_degree
,
specifying the maximum number of knot points (i.e., bins) for any covariate
for generating basis functions. If num_knots
is a unit-length
vector, then the same num_knots
are used for each degree (this is
not recommended). The default settings for num_knots
are
recommended, and these defaults decrease num_knots
with increasing
max_degree
and smoothness_orders
, which prevents (expensive)
combinatorial explosions in the number of higher-degree and higher-order
basis functions generated. This allows the complexity of the optimization
problem to grow scalably. See details of num_knots
more information.
Am optional numeric
value bounded in the open
unit interval indicating the minimum proportion of 1's in a basis function
column needed for the basis function to be included in the procedure to fit
the lasso. Any basis functions with a lower proportion of 1's than the
cutoff will be removed. Defaults to 1 over the square root of the number of
observations. Only applicable for models fit with zero-order splines, i.e.
smoothness_orders = 0
.
A character
or a family
object
(supported by glmnet
) specifying the error/link
family for a generalized linear model. character
options are limited
to "gaussian" for fitting a standard penalized linear model, "binomial" for
penalized logistic regression, "poisson" for penalized Poisson regression,
"cox" for a penalized proportional hazards model, and "mgaussian" for
multivariate penalized linear model. Note that passing in
family objects leads to slower performance relative to passing in a
character family (if supported). For example, one should set
family = "binomial"
instead of family = binomial()
when
calling fit_hal
.
User-specified sequence of values of the regularization
parameter for the lasso L1 regression. If NULL
, the default sequence
in cv.glmnet
will be used. The cross-validated
optimal value of this regularization parameter will be selected with
cv.glmnet
. If fit_control
's cv_select
argument is set to FALSE
, then the lasso model will be fit via
glmnet
, and regularized coefficient values for each
lambda in the input array will be returned.
A vector of ID values that is used to generate cross-validation
folds for cv.glmnet
. This argument is ignored when
fit_control
's cv_select
argument is FALSE
.
observation weights; defaults to 1 per observation.
a vector of offset values, used in fitting.
List of arguments, including the following, and any
others to be passed to cv.glmnet
or
glmnet
.
cv_select
: A logical
specifying if the sequence of
specified lambda
values should be passed to
cv.glmnet
in order for a single, optimal value of
lambda
to be selected according to cross-validation. When
cv_select = FALSE
, a glmnet
model will be
used to fit the sequence of (or single) lambda
.
use_min
: Specify the choice of lambda to be selected by
cv.glmnet
. When TRUE
, "lambda.min"
is
used; otherwise, "lambda.1se"
. Only used when
cv_select = TRUE
.
lambda.min.ratio
: A glmnet
argument
specifying the smallest value for lambda
, as a fraction of
lambda.max
, the (data derived) entry value (i.e. the smallest value
for which all coefficients are zero). We've seen that not setting
lambda.min.ratio
can lead to no lambda
values that fit the
data sufficiently well.
prediction_bounds
: An optional vector of size two that provides
the lower and upper bounds predictions; not used when
family = "cox"
. When prediction_bounds = "default"
, the
predictions are bounded between min(Y) - sd(Y)
and
max(Y) + sd(Y)
for each outcome (when family = "mgaussian"
,
each outcome can have different bounds). Bounding ensures that there is
no extrapolation.
The full set of basis functions generated from X
.
A logical
indicating whether or not to return
the glmnet
fit object of the lasso model.
A logical
indicating whether or not to return
the matrix of (possibly reduced) basis functions used in fit_hal
.
A logical
indicating whether to print one of a curated
selection of quotes from the HAL9000 computer, from the critically
acclaimed epic science-fiction film "2001: A Space Odyssey" (1968).
Object of class hal9001
, containing a list of basis
functions, a copy map, coefficients estimated for basis functions, and
timing results (for assessing computational efficiency).
The procedure uses a custom C++ implementation to generate a design
matrix of spline basis functions of covariates and interactions of
covariates. The lasso regression is fit to this design matrix via
cv.glmnet
or a custom implementation derived from
origami. The maximum dimension of the design matrix is \(n\) -by-
\((n * 2^(d-1))\), where where \(n\) is the number of observations and
\(d\) is the number of covariates.
For smoothness_orders = 0
, only zero-order splines (piece-wise
constant) are generated, which assume the true regression function has no
smoothness or continuity. When smoothness_orders = 1
, first-order
splines (piece-wise linear) are generated, which assume continuity of the
true regression function. When smoothness_orders = 2
, second-order
splines (piece-wise quadratic and linear terms) are generated, which assume
a the true regression function has a single order of differentiability.
num_knots
argument specifies the number of knot points for each
covariate and for each max_degree
. Fewer knot points can
significantly decrease runtime, but might be overly simplistic. When
considering smoothness_orders = 0
, too few knot points (e.g., < 50)
can significantly reduce performance. When smoothness_orders = 1
or
higher, then fewer knot points (e.g., 10-30) is actually better for
performance. We recommend specifying num_knots
with respect to
smoothness_orders
, and as a vector of length max_degree
with
values decreasing exponentially. This prevents combinatorial explosions in
the number of higher-degree basis functions generated. The default behavior
of num_knots
follows this logic — for smoothness_orders = 0
,
num_knots
is set to \(500 / 2^{j-1}\), and for
smoothness_orders = 1
or higher, num_knots
is set to
\(200 / 2^{j-1}\), where \(j\) is the interaction degree. We also
include some other suitable settings for num_knots
below, all of
which are less complex than default num_knots
and will thus result
in a faster runtime:
Some good settings for little to no cost in performance:
If smoothness_orders = 0
and max_degree = 3
,
num_knots = c(400, 200, 100)
.
If smoothness_orders = 1+
and max_degree = 3
,
num_knots = c(100, 75, 50)
.
Recommended settings for fairly fast runtime:
If smoothness_orders = 0
and max_degree = 3
,
num_knots = c(200, 100, 50)
.
If smoothness_orders = 1+
and max_degree = 3
,
num_knots = c(50, 25, 15)
.
Recommended settings for fast runtime:
If smoothness_orders = 0
and max_degree = 3
,
num_knots = c(100, 50, 25)
.
If smoothness_orders = 1+
and max_degree = 3
,
num_knots = c(40, 15, 10)
.
Recommended settings for very fast runtime:
If smoothness_orders = 0
and max_degree = 3
,
num_knots = c(50, 25, 10)
.
If smoothness_orders = 1+
and max_degree = 3
,
num_knots = c(25, 10, 5)
.