`vignettes/generalizedCV.Rmd`

`generalizedCV.Rmd`

Cross-validation is an essential tool for evaluating how any given data analytic procedure extends from a sample to the target population from which the sample is derived. It has seen widespread application in all facets of statistics, perhaps most notably statistical machine learning. When used for model selection, cross-validation has powerful optimality properties (van der Vaart, Dudoit, and van der Laan 2006), (van der Laan, Polley, and Hubbard 2007).

Cross-validation works by partitioning a sample into complementary
subsets, applying a particular data analytic (statistical) routine on a
subset (the “training” set), and evaluating the routine of choice on the
complementary subset (the “testing” set). This procedure is repeated
across multiple partitions of the data. A variety of different
partitioning schemes exist, such as V-fold cross-validation and
bootstrap cross-validation, many of which are supported by
`origami`

. The `origami`

package provides a suite
of tools that generalize the application of cross-validation to
arbitrary data analytic procedures. The use of `origami`

is
best illustrated by example.

We’ll start by examining a fairly simple data set:

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
```

One might be interested in examining how the efficiency of a car, as measured by miles-per-gallon (mpg), is explained by various technical aspects of the car, with data across a variety of different models of cars. Linear regression is perhaps the simplest statistical procedure that could be used to make such deductions. Let’s try it out:

```
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## am 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
```

We can assess how well the model fits the data by comparing the
predictions of the linear model to the true outcomes observed in the
data set. This is the well known (and standard) mean squared error. We
can extract that from the `lm`

model object like so:

The mean squared error is 4.6092009. There is an important problem that arises when we assess the model in this way – that is, we have trained our linear regression model on the full data set and assessed the error on the full data set, using up all of our data. We, of course, are generally not interested in how well the model explains variation in the observed data; rather, we are interested in how the explanation provided by the model generalizes to a target population from which the sample is presumably derived. Having used all of our available data, we cannot honestly evaluate how well the model fits (and thus explains) variation at the population level.

To resolve this issue, cross-validation allows for a particular procedure (e.g., linear regression) to be implemented over subsets of the data, evaluating how well the procedure fits on a testing (“validation”) set, thereby providing an honest evaluation of the error.

We can easily add cross-validation to our linear regression procedure
using `origami`

. First, let us define a new function to
perform linear regression on a specific partition of the data (called a
“fold”):

```
cv_lm <- function(fold, data, reg_form) {
# get name and index of outcome variable from regression formula
out_var <- as.character(unlist(str_split(reg_form, " "))[1])
out_var_ind <- as.numeric(which(colnames(data) == out_var))
# split up data into training and validation sets
train_data <- training(data)
valid_data <- validation(data)
# fit linear model on training set and predict on validation set
mod <- lm(as.formula(reg_form), data = train_data)
preds <- predict(mod, newdata = valid_data)
# capture results to be returned as output
out <- list(coef = data.frame(t(coef(mod))),
SE = ((preds - valid_data[, out_var_ind])^2))
return(out)
}
```

Our `cv_lm`

function is rather simple: we merely split the
available data into a training and validation sets, using the eponymous
functions provided in `origami`

, fit the linear model on the
training set, and evaluate the model on the testing set. This is a
simple example of what `origami`

considers to be
`cv_fun`

s – functions for using cross-validation to perform a
particular routine over an input data set. Having defined such a
function, we can simply generate a set of partitions using
`origami`

’s `make_folds`

function, and apply our
`cv_lm`

function over the resultant `folds`

object. Below, we replicate the resubstitution estimate of the error –
we did this “by hand” above – using the functions
`make_folds`

and `cv_lm`

.

`## origami v1.0.7: Generalized Framework for Cross-Validation`

```
# resubstitution estimate
resub <- make_folds(mtcars, fold_fun = folds_resubstitution)[[1]]
resub_results <- cv_lm(fold = resub, data = mtcars, reg_form = "mpg ~ .")
mean(resub_results$SE)
```

`## [1] 4.609201`

This (very nearly) matches the estimate of the error that we obtained above.

We can more honestly evaluate the error by *V-fold
cross-validation*, which partitions the data into **v
subsets**, fitting the model on \(v -
1\) of the subsets and evaluating on the subset that was held out
for testing. This is repeated such that each subset is used for testing.
We can easily apply our `cv_lm`

function using
`origami`

’s `cross_validate`

(n.b., by default
this performs 10-fold cross-validation):

```
# cross-validated estimate
folds <- make_folds(mtcars)
cvlm_results <- cross_validate(cv_fun = cv_lm, folds = folds, data = mtcars,
reg_form = "mpg ~ .")
mean(cvlm_results$SE)
```

`## [1] 15.17963`

Having performed 10-fold cross-validation, we quickly notice that our previous estimate of the model error (by resubstitution) was quite optimistic. The honest estimate of the error is several times larger.

Generally, `cross_validate`

usage will mirror the workflow
in the above example. First, the user must define folds and a function
that operates on each fold. Once these are passed to
`cross_validate`

, the function will map the function across
the folds, and combine the results in a reasonable way. More details on
each step of this process will be given below.

The `folds`

object passed to `cross_validate`

is a list of folds. Such lists can be generated using the
`make_folds`

function. Each `fold`

consists of a
list with a `training`

index vector, a
`validation`

index vector, and a `fold_index`

(its
order in the list of folds). This function supports a variety of
cross-validation schemes including *v-fold* and
*bootstrap* cross-validation as well as time series methods like
*“Rolling Window”*. It can balance across levels of a variable
(`stratify_ids`

), and it can also keep all observations from
the same independent unit together (`cluster_ids`

). See the
documentation of the `make_folds`

function for details about
supported cross-validation schemes and arguments.

The `cv_fun`

argument to `cross_validate`

is a
function that will perform some operation on each fold. The first
argument to this function must be `fold`

, which will receive
an individual fold object to operate on. Additional arguments can be
passed to `cv_fun`

using the `...`

argument to
`cross_validate`

. Within this function, the convenience
functions `training`

, `validation`

and
`fold_index`

can return the various components of a fold
object. They do this by retrieving the `fold`

object from
their calling environment. It can also be specified directly. If
`training`

or `validation`

is passed an object, it
will index into it in a sensible way. For instance, if it is a vector,
it will index the vector directly. If it is a `data.frame`

or
`matrix`

, it will index rows. This allows the user to easily
partition data into training and validation sets. This fold function
must return a named list of results containing whatever fold-specific
outputs are generated.

`cross_validate`

After defining folds, `cross_validate`

can be used to map
the `cv_fun`

across the `folds`

using
`future_lapply`

. This means that it can be easily
parallelized by specifying a parallelization scheme (i.e., a
`plan`

). See the `future`

package for more details.

The application of `cross_validate`

generates a list of
results. As described above, each call to `cv_fun`

itself
returns a list of results, with different elements for each type of
result we care about. The main loop generates a list of these individual
lists of results (a sort of “meta-list”). This “meta-list” is then
inverted such that there is one element per result type (this too is a
list of the results for each fold). By default,
`combine_results`

is used to combine these results type
lists.

For instance, in the above `mtcars`

example, the results
type lists contains one `coef`

`data.frame`

from
each fold. These are `rbind`

ed together to form one
`data.frame`

containing the `coefs`

from all folds
in different rows. How results are combined is determined automatically
by examining the data types of the results from the first fold. This can
be modified by specifying a list of arguments to
`.combine_control`

. See the help for
`combine_results`

for more details. In most cases, the
defaults should suffice.

To examine `origami`

further, let us return to our example
analysis using the `mtcars`

data set. Here, we will write a
new `cv_fun`

type object. As an example, we will use L.
Breiman’s `randomForest`

:

```
cv_rf <- function(fold, data, reg_form) {
# get name and index of outcome variable from regression formula
out_var <- as.character(unlist(str_split(reg_form, " "))[1])
out_var_ind <- as.numeric(which(colnames(data) == out_var))
# define training and validation sets based on input object of class "folds"
train_data <- training(data)
valid_data <- validation(data)
# fit Random Forest regression on training set and predict on holdout set
mod <- randomForest(formula = as.formula(reg_form), data = train_data)
preds <- predict(mod, newdata = valid_data)
# define output object to be returned as list (for flexibility)
out <- list(coef = data.frame(mod$coefs),
SE = ((preds - valid_data[, out_var_ind])^2))
return(out)
}
```

Above, in writing our `cv_rf`

function to cross-validate
`randomForest`

, we used our previous function
`cv_lm`

as an example. For now, individual
`cv_fun`

s must be written by hand; however, in future
releases, a wrapper may be available to support auto-generating
`cv_fun`

s to be used with `origami`

.

Below, we use `cross_validate`

to apply our new
`cv_rf`

function over the `folds`

object generated
by `make_folds`

.

`## randomForest 4.7-1.1`

`## Type rfNews() to see new features/changes/bug fixes.`

```
folds <- make_folds(mtcars)
cvrf_results <- cross_validate(cv_fun = cv_rf, folds = folds, data = mtcars,
reg_form = "mpg ~ .")
mean(cvrf_results$SE)
```

`## [1] 6.936715`

Using 10-fold cross-validation (the default), we obtain an honest
estimate of the prediction error of random forests. From this, we gather
that the use of `origami`

’s `cross_validate`

procedure can be generalized to arbitrary esimation techniques, given
availability of an appropriate `cv_fun`

function.

Cross-validation can also be used for forecast model selection in a
time series setting. Here, the partitioning scheme mirrors the
application of the forecasting model: We’ll train the data on past
observations (either all available or a recent subset), and then use the
model forecast (predict), the next few observations. Consider the
`AirPassengers`

dataset, a monthly time series of passenger
air traffic in thousands of people.

```
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
```

Suppose we want to pick between two forecasting models,
`stl`

, and `arima`

(the details of these models
are not important for this example). We can do that by evaluating their
forecasting performance.

```
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
```

```
folds = make_folds(AirPassengers, fold_fun=folds_rolling_origin,
first_window = 36, validation_size = 24)
fold = folds[[1]]
# function to calculate cross-validated squared error
cv_forecasts <- function(fold, data) {
train_data <- training(data)
valid_data <- validation(data)
valid_size <- length(valid_data)
train_ts <- ts(log10(train_data), frequency = 12)
# borrowed from AirPassengers help
arima_fit <- arima(train_ts, c(0, 1, 1),
seasonal = list(order = c(0, 1, 1),
period = 12))
raw_arima_pred <- predict(arima_fit, n.ahead = valid_size)
arima_pred <- 10^raw_arima_pred$pred
arima_MSE <- mean((arima_pred - valid_data)^2)
# stl model
stl_fit <- stlm(train_ts, s.window = 12)
raw_stl_pred = forecast(stl_fit, h = valid_size)
stl_pred <- 10^raw_stl_pred$mean
stl_MSE <- mean((stl_pred - valid_data)^2)
out <- list(mse = data.frame(fold = fold_index(),
arima = arima_MSE, stl = stl_MSE))
return(out)
}
mses = cross_validate(cv_fun = cv_forecasts, folds = folds,
data = AirPassengers)$mse
colMeans(mses[, c("arima", "stl")])
```

```
## arima stl
## 667.2477 925.7137
```

```
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forecast_8.17.0 randomForest_4.7-1.1 stringr_1.4.1
## [4] origami_1.0.7
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 lattice_0.20-45 listenv_0.8.0 zoo_1.8-10
## [5] assertthat_0.2.1 rprojroot_2.0.3 digest_0.6.29 lmtest_0.9-40
## [9] utf8_1.2.2 parallelly_1.32.1 R6_2.5.1 evaluate_0.16
## [13] ggplot2_3.3.6 pillar_1.8.1 rlang_1.0.5 curl_4.3.2
## [17] data.table_1.14.2 TTR_0.24.3 fracdiff_1.5-1 jquerylib_0.1.4
## [21] rmarkdown_2.16 pkgdown_2.0.6 textshaping_0.3.6 desc_1.4.1
## [25] munsell_0.5.0 compiler_4.2.1 xfun_0.32 pkgconfig_2.0.3
## [29] systemfonts_1.0.4 urca_1.3-3 globals_0.16.1 htmltools_0.5.3
## [33] nnet_7.3-17 tidyselect_1.1.2 tibble_3.1.8 quadprog_1.5-8
## [37] codetools_0.2-18 fansi_1.0.3 future_1.28.0 dplyr_1.0.10
## [41] grid_4.2.1 nlme_3.1-159 jsonlite_1.8.0 gtable_0.3.1
## [45] lifecycle_1.0.1 DBI_1.1.3 magrittr_2.0.3 scales_1.2.1
## [49] quantmod_0.4.20 future.apply_1.9.0 cli_3.3.0 stringi_1.7.8
## [53] cachem_1.0.6 tseries_0.10-51 fs_1.5.2 timeDate_4021.104
## [57] bslib_0.4.0 xts_0.12.1 ragg_1.2.2 vctrs_0.4.1
## [61] generics_0.1.3 tools_4.2.1 glue_1.6.2 purrr_0.3.4
## [65] abind_1.4-5 parallel_4.2.1 fastmap_1.1.0 yaml_2.3.5
## [69] colorspace_2.0-3 memoise_2.0.1 knitr_1.40 sass_0.4.2
```

van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007.
“Super Learner.” *Statistical Applications in Genetics
and Molecular Biology* 6 (1).

van der Vaart, Aad W, Sandrine Dudoit, and Mark J van der Laan. 2006.
“Oracle Inequalities for Multi-Fold Cross Validation.”
*Statistics & Decisions* 24 (3): 351–71.