Estimation procedure for HAL, the Highly Adaptive Lasso

```
fit_hal(
X,
Y,
formula = NULL,
X_unpenalized = NULL,
max_degree = ifelse(ncol(X) >= 20, 2, 3),
smoothness_orders = 1,
num_knots = num_knots_generator(max_degree = max_degree, smoothness_orders =
smoothness_orders, base_num_knots_0 = 200, base_num_knots_1 = 50),
reduce_basis = 1/sqrt(length(Y)),
family = c("gaussian", "binomial", "poisson", "cox"),
lambda = NULL,
id = NULL,
offset = NULL,
fit_control = list(cv_select = TRUE, n_folds = 10, foldid = NULL, use_min = TRUE,
lambda.min.ratio = 1e-04, prediction_bounds = "default"),
basis_list = NULL,
return_lasso = TRUE,
return_x_basis = FALSE,
yolo = FALSE
)
```

- X
An input

`matrix`

with dimensions number of observations -by- number of covariates that will be used to derive the design matrix of basis functions.- Y
A

`numeric`

vector of observations of the outcome variable.- formula
A character string formula to be used in

`formula_hal`

. See its documentation for details.- X_unpenalized
An input

`matrix`

with the same number of rows as`X`

, for which no L1 penalization will be performed. Note that`X_unpenalized`

is directly appended to the design matrix; no basis expansion is performed on`X_unpenalized`

.- max_degree
The highest order of interaction terms for which basis functions ought to be generated.

- smoothness_orders
An

`integer`

, specifying the smoothness of the basis functions. See details for`smoothness_orders`

for more information.- num_knots
An

`integer`

vector of length 1 or`max_degree`

, specifying the maximum number of knot points (i.e., bins) for any covariate for generating basis functions. If`num_knots`

is a unit-length vector, then the same`num_knots`

are used for each degree (this is not recommended). The default settings for`num_knots`

are recommended, and these defaults decrease`num_knots`

with increasing`max_degree`

and`smoothness_orders`

, which prevents (expensive) combinatorial explosions in the number of higher-degree and higher-order basis functions generated. This allows the complexity of the optimization problem to grow scalably. See details of`num_knots`

more information.- reduce_basis
A

`numeric`

value bounded in the open unit interval indicating the minimum proportion of 1's in a basis function column needed for the basis function to be included in the procedure to fit the lasso. Any basis functions with a lower proportion of 1's than the cutoff will be removed. When`reduce_basis`

is set to`NULL`

, all basis functions are used in the lasso-fitting stage of`fit_hal`

.- family
A

`character`

or a`family`

object (supported by`glmnet`

) specifying the error/link family for a generalized linear model.`character`

options are limited to "gaussian" for fitting a standard penalized linear model, "binomial" for penalized logistic regression, "poisson" for penalized Poisson regression, and "cox" for a penalized proportional hazards model. Note that passing in family objects leads to slower performance relative to passing in a character family (if supported). For example, one should set`family = "binomial"`

instead of`family = binomial()`

when calling`fit_hal`

.- lambda
User-specified sequence of values of the regularization parameter for the lasso L1 regression. If

`NULL`

, the default sequence in`cv.glmnet`

will be used. The cross-validated optimal value of this regularization parameter will be selected with`cv.glmnet`

. If`fit_control`

's`cv_select`

argument is set to`FALSE`

, then the lasso model will be fit via`glmnet`

, and regularized coefficient values for each lambda in the input array will be returned.- id
A vector of ID values that is used to generate cross-validation folds for

`cv.glmnet`

. This argument is ignored when`fit_control`

's`cv_select`

argument is`FALSE`

.- offset
a vector of offset values, used in fitting.

- fit_control
List of arguments for fitting. Includes the following arguments, and any others to be passed to

`cv.glmnet`

or`glmnet`

.`cv_select`

: A`logical`

specifying if the sequence of specified`lambda`

values should be passed to`cv.glmnet`

in order for a single, optimal value of`lambda`

to be selected according to cross-validation. When`cv_select = FALSE`

, a`glmnet`

model will be used to fit the sequence of (or single)`lambda`

.`n_folds`

: Integer for the number of folds to be used when splitting the data for V-fold cross-validation. Only used when`cv_select = TRUE`

.`foldid`

: An optional`numeric`

containing values between 1 and`n_folds`

, identifying the fold to which each observation is assigned. If supplied,`n_folds`

can be missing. In such a case, this vector is passed directly to`cv.glmnet`

. Only used when`cv_select = TRUE`

.`use_min`

: Specify the choice of lambda to be selected by`cv.glmnet`

. When`TRUE`

,`"lambda.min"`

is used; otherwise,`"lambda.1se"`

. Only used when`cv_select = TRUE`

.`lambda.min.ratio`

: A`glmnet`

argument specifying the smallest value for`lambda`

, as a fraction of`lambda.max`

, the (data derived) entry value (i.e. the smallest value for which all coefficients are zero). We've seen that not setting`lambda.min.ratio`

can lead to no`lambda`

values that fit the data sufficiently well.`prediction_bounds`

: A vector of size two that provides the lower and upper bounds for predictions. When`prediction_bounds = "default"`

, the predictions are bounded between`min(Y) - sd(Y)`

and`max(Y) + sd(Y)`

. Bounding ensures that there is no extrapolation, and it is necessary for cross-validation selection and/or Super Learning.

- basis_list
The full set of basis functions generated from

`X`

.- return_lasso
A

`logical`

indicating whether or not to return the`glmnet`

fit object of the lasso model.- return_x_basis
A

`logical`

indicating whether or not to return the matrix of (possibly reduced) basis functions used in`fit_hal`

.- yolo
A

`logical`

indicating whether to print one of a curated selection of quotes from the HAL9000 computer, from the critically acclaimed epic science-fiction film "2001: A Space Odyssey" (1968).

Object of class `hal9001`

, containing a list of basis
functions, a copy map, coefficients estimated for basis functions, and
timing results (for assessing computational efficiency).

The procedure uses a custom C++ implementation to generate a design
matrix of spline basis functions of covariates and interactions of
covariates. The lasso regression is fit to this design matrix via
`cv.glmnet`

or a custom implementation derived from
origami. The maximum dimension of the design matrix is \(n\) -by-
\((n * 2^(d-1))\), where where \(n\) is the number of observations and
\(d\) is the number of covariates.

For `smoothness_orders = 0`

, only zero-order splines (piece-wise
constant) are generated, which assume the true regression function has no
smoothness or continuity. When `smoothness_orders = 1`

, first-order
splines (piece-wise linear) are generated, which assume continuity of the
true regression function. When `smoothness_orders = 2`

, second-order
splines (piece-wise quadratic and linear terms) are generated, which assume
a the true regression function has a single order of differentiability.

`num_knots`

argument specifies the number of knot points for each
covariate and for each `max_degree`

. Fewer knot points can
significantly decrease runtime, but might be overly simplistic. When
considering `smoothness_orders = 0`

, too few knot points (e.g., < 50)
can significantly reduce performance. When `smoothness_orders = 1`

or
higher, then fewer knot points (e.g., 10-30) is actually better for
performance. We recommend specifying `num_knots`

with respect to
`smoothness_orders`

, and as a vector of length `max_degree`

with
values decreasing exponentially. This prevents combinatorial explosions in
the number of higher-degree basis functions generated. The default behavior
of `num_knots`

follows this logic --- for `smoothness_orders = 0`

,
`num_knots`

is set to \(500 / 2^{j-1}\), and for
`smoothness_orders = 1`

or higher, `num_knots`

is set to
\(200 / 2^{j-1}\), where \(j\) is the interaction degree. We also
include some other suitable settings for `num_knots`

below, all of
which are less complex than default `num_knots`

and will thus result
in a faster runtime:

Some good settings for little to no cost in performance:

If

`smoothness_orders = 0`

and`max_degree = 3`

,`num_knots = c(400, 200, 100)`

.If

`smoothness_orders = 1+`

and`max_degree = 3`

,`num_knots = c(100, 75, 50)`

.

Recommended settings for fairly fast runtime:

If

`smoothness_orders = 0`

and`max_degree = 3`

,`num_knots = c(200, 100, 50)`

.If

`smoothness_orders = 1+`

and`max_degree = 3`

,`num_knots = c(50, 25, 15)`

.

Recommended settings for fast runtime:

If

`smoothness_orders = 0`

and`max_degree = 3`

,`num_knots = c(100, 50, 25)`

.If

`smoothness_orders = 1+`

and`max_degree = 3`

,`num_knots = c(40, 15, 10)`

.

Recommended settings for very fast runtime:

If

`smoothness_orders = 0`

and`max_degree = 3`

,`num_knots = c(50, 25, 10)`

.If

`smoothness_orders = 1+`

and`max_degree = 3`

,`num_knots = c(25, 10, 5)`

.