4 Cross-validation
Ivana Malenica
Based on the origami
R
package
by Jeremy Coyle, Nima Hejazi, Ivana Malenica and Rachael Phillips.
Updated: 2021-08-18
Learning Objectives
By the end of this chapter you will be able to:
- Reliably assess the performance of a machine learning algorithm, or compare
the performance of several algorithms by applying cross-validation schemes
using the
origami
R
package.
4.1 Introduction
In this chapter, we start elaborating on the estimation step outlined in the introductory chapter, which discussed the Roadmap for Targeted Learning. In order to generate an initial estimate of our target parameter — which is the focus of the following chapter on Super Learning, we first need to translate, and incorporate, our knowledge about the data generating process into the estimation procedure, and decide how to evaluate our estimation performance.
The performance, or error, of any algorithm used in the estimation procedure directly relates to its generalizability on the independent data. The proper assessment of the performance of proposed algorithms is extremely important; it guides the choice of the final learning method, and it gives us a quantitative assessment of how good the chosen algorithm is doing. In order to assess the performance of an algorithm, we introduce the concept of a loss function, which helps us define the risk, also referred to as the expected prediction error.
Constructing a library that is consistent with the data-generating distribution
Our goal, as further specified in the next chapter, will be to estimate the true risk of the proposed statistical learning method. Our goal(s) consist of:
- Estimating the performance of different algorithms in order to choose the best one.
- Having chosen a winner, estimate the true risk of the proposed statistical learning method.
In the following, we propose a method to do so using the observed data and
cross-validation procedure using the origami
package (Coyle and Hejazi 2018).
4.2 Background
Ideally, in a data-rich scenario, we would split our dataset into three parts:
- training set,
- validation set,
- test set.
The training set is used to fit algorithm(s) of interest; we evaluate the performance of the fit(s) on a validation set, which can be used to estimate prediction error (e.g., for tuning and model selection). The final error of the chosen algorithm(s) is obtained by using the test set, which is kept separately, and doesn’t see the data before the final evaluation. One might wonder, with training data readily available, why not use the training error to evaluate the proposed algorithm’s performance? Unfortunately, the training error is not a good estimate of the true risk; it consistently decreases with model complexity, resulting in a possible overfit to the training data and low generalizability.
Since data are often scarce, separating it into training, validation and test set is usually not possible. In the absence of a large data set and a designated test set, we must resort to methods that estimate the true risk by efficient sample re-use. Re-sampling methods, in great generality, involve repeatedly sampling from the training set and fitting proposed algorithms on the new samples. While often computationally intensive, re-sampling methods are particularly useful for model selection and estimation of the true risk. In addition, they might provide more insight on variability and robustness of the algorithm fit then fitting an algorithm only once on all the training data.
4.2.1 What is cross-validation?
In this chapter, we focus on cross-validation — an essential tool for evaluating how any given algorithm extends from a sample to the target population from which the sample is derived. It has seen widespread application in all facets of statistics, perhaps most notably statistical machine learning. The cross-validation procedure can be used for model selection, as well as for estimation of the true risk associated with any statistical learning method in order to evaluate its performance. It particular, cross-validation directly estimates the true risk when the estimate is applied to an independent sample from the joint distribution of the predictors and outcome. When used for model selection, cross-validation has powerful optimality properties. The asymptotic optimality results state that the cross-validated selector performs (in terms of risk) asymptotically as well as an optimal oracle selector based on the true, unknown data generating distribution. For further details on the theoretical results, we suggest van der Laan and Dudoit (2003), van der Laan, Dudoit, and Keles (2004), Dudoit and van der Laan (2005) and Van der Vaart, Dudoit, and Laan (2006).
In great generality, cross-validation works by partitioning a sample into
complementary subsets, applying a particular algorithm(s) on a subset (the
training set), and evaluating the method of choice on the complementary subset
(the validation/test set). This procedure is repeated across multiple partitions
of the data. A variety of different partitioning schemes exist, depending on the
problem of interest, data size, prevalence of the outcome, and dependence
structure. The origami
package provides a suite of tools that generalize the
application of cross-validation to arbitrary data analytic procedures. In the
the following, we describe different types of cross-validation schemes readily
available in origami
, introduce the general structure of the origami
package, and show their use in applied settings.
4.3 Roadmap: How does it all fit together?
Similarly to how we defined the Roadmap for Targeted Learning, we can define the Estimation Roadmap to guide the estimation process. In particular, we have developed a unified loss-based cross-validation methodology for estimator construction, selection, and performance assessment in a series of articles (e.g., see van der Laan and Dudoit (2003), van der Laan, Dudoit, and Keles (2004), Dudoit and van der Laan (2005), Van der Vaart, Dudoit, and Laan (2006), and van der Laan, Polley, and Hubbard (2007)) that follow three main steps:
The loss funtion: Define the target parameter as the minimizer of the expected loss (risk) for a full data loss function chosen to represent the desired performance measure. Map the full data loss function into an observed data loss function, having the same expected value and leading to an efficient estimator of risk.
The algorithms: Construct a finite collection of candidate estimators for the parameter of interest.
The cross-validation scheme: Apply appropriate cross-validation to select an optimal estimator among the candidates, and assess the overall performance of the resulting estimator.
Step 1 of the Estimation Roadmap allows us to unify a broad range of problems that are traditionally treated separately in the statistical literature, including density estimation, prediction of polychotomous and continuous outcomes. For example, if we are interested in estimating the full joint conditional density, we could use the negative log-likelihood loss. If instead we are interested in the conditional mean with continuous outcome, one could use the squared error loss; had the outcome been binary, one could resort to the indicator (0-1) loss. The unified loss-based framework also reconciles censored and full data estimation methods, by generalizing any loss based learning for full data into loss based learning for general censored data.
4.4 Example: cross-validation and prediction
Now that we introduced the Estimation Roadmap, we can define our objective with more mathematical notation, using prediction as an example. Let the observed data be defined as \(X = (W,Y)\), where a unit specific data can be written as \(X_i = (W_i,Y_i)\), for \(i = 1, \ldots, n\). For each of the \(n\) samples, we denote \(Y_i\) as the outcome of interest (polychotomous or continuous), and \(W_i\) as a \(p\)-dimensional set of covariates. Let \(\psi_0(W)\) denote the target parameter of interest we want to estimate; for this example, we are interested in estimating the conditional expectation of the outcome given the covariates, \(\psi_0(W) = E(Y \mid W)\). Following the Estimation Roadmap, we chose the appropriate loss function, \(L\), such that \(\psi_0(W) = \text{argmin}_{\psi} E[L(X,\psi(W))]\). But how do we know how each \(\psi\) is doing? In order to pick the optimal estimator among the candidates, and assess the overall performance of the resulting estimator, we use cross-validation — dividing the available data into the training set and validation set. Observations in the training set are used to fit (or train) the estimator, while the validation set is used to assess the risk of (or validate) it.
To derive a general representation for cross-validation, we define a split vector, \(B_n = (B_n(i): i = 1, \ldots, n) \in \{0,1\}^n\). Note that split vector is independent of the empirical distribution, \(P_n\). A realization of \(B_n\) defines a random split of the data into a training and validation set such that if
\[B_n(i) = 0, \ \ \text{i sample is in the training set}\] \[B_n(i) = 1, \ \ \text{i sample is in the validation set.}\] We can further define \(P_{n,B_n}^0\) and \(P_{n,B_n}^1\) as the empirical distributions of the training and validation sets, respectively. Then \(n_0 = \sum_i (1-B_n(i))\) and \(n_1 = \sum_i B_n(i)\) denote the number of samples in each set. The particular distribution of the split vector \(B_n\) defines the type of cross-validation scheme, tailored to the problem and data set in hand.
4.5 Cross-validation schemes in origami
As we specified earlier, the particular distribution of the split vector \(B_n\)
defines the type of the cross-validation method. In the following, we describe
different types of cross-validation schemes available in origami
package, and
show their use in the sequel.
WASH Benefits study example
In order to illustrate different cross-validation schemes, we will be using the
WASH data. Detailed information on the WASH Benefits example dataset can be
found in Chapter 3. In particular, we are interested in predicting
weight-for-height z-score whz
using the available covariate data. For this
illustration, we will start by treating the data as independent and identically
distributed (i.i.d.) random draws. To see what each cross-validation scheme is
doing, we will subset the data to only \(n=30\). Note that each row represents an
i.i.d. sample, indexed by the row number.
library(data.table)
library(origami)
library(knitr)
library(kableExtra)
# load data set and take a peek
washb_data <- fread(
paste0(
"https://raw.githubusercontent.com/tlverse/tlverse-data/master/",
"wash-benefits/washb_data.csv"
),
stringsAsFactors = TRUE
)
whz | tr | fracode | month | aged | sex | momage | momedu | momheight | hfiacat | Nlt18 | Ncomp | watmin | elec | floor | walls | roof | asset_wardrobe | asset_table | asset_chair | asset_khat | asset_chouki | asset_tv | asset_refrig | asset_bike | asset_moto | asset_sewmach | asset_mobile |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.00 | Control | N05265 | 9 | 268 | male | 30 | Primary (1-5y) | 146.40 | Food Secure | 3 | 11 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
-1.16 | Control | N05265 | 9 | 286 | male | 25 | Primary (1-5y) | 148.75 | Moderately Food Insecure | 2 | 4 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
-1.05 | Control | N08002 | 9 | 264 | male | 25 | Primary (1-5y) | 152.15 | Food Secure | 1 | 10 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
-1.26 | Control | N08002 | 9 | 252 | female | 28 | Primary (1-5y) | 140.25 | Food Secure | 3 | 5 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
-0.59 | Control | N06531 | 9 | 336 | female | 19 | Secondary (>5y) | 150.95 | Food Secure | 2 | 7 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
-0.51 | Control | N06531 | 9 | 304 | male | 20 | Secondary (>5y) | 154.20 | Severely Food Insecure | 0 | 3 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
Above is a look at the first 30 of the data.
4.5.1 Cross-validation for i.i.d. data
4.5.1.1 Re-substitution
The re-substitution method is the simplest strategy for estimating the risk associated with fitting a proposed algorithm on a set of observations. Here, all observed data is used for both training and validation set.
We illustrate the usage of the re-substitution method with origami
package
below; we will use the function folds_resubstitution(n)
. In order to setup
folds_resubstitution(n)
, we just need the total number of samples we want to
allocate to training and validation sets; remember that each row of data is a
unique i.i.d. sample. Notice the structure of the origami
output:
- v: the cross-validation fold
- training_set: the indexes of the samples in the training set
- validation_set: the indexes of the samples in the training set.
This structure of the origami
output (aka, fold(s)) will persist for each of the
cross-validation schemes we present in this chapter. Below, we show the fold
generated by the re-substitution method:
folds_resubstitution(nrow(washb_data))
#> [[1]]
#> $v
#> [1] 1
#>
#> $training_set
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30
#>
#> $validation_set
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30
#>
#> attr(,"class")
#> [1] "fold"
4.5.1.2 Holdout method
The holdout method, or the validation set approach, consists of randomly dividing the available data into the training set and validation set (holdout set). The model is then fitted on the training set, and further evaluated on the observations in the validation set. Typically, the data is split into \(60/40\), \(70/30\), \(80/20\) or \(90/10\) splits.
The holdout method is intuitive, conceptually easy, and computationally not too demanding. However, if we repeat the process of randomly splitting the data into the training and validation set, we might get a very different cross-validated emprical risk. In particular, the emprical mean of the loss over the validation sets might be highly variable, depending on which samples were included in the training/validation split. Overall, the cross-validated emprical risk for the holdout method is more variabiable, since in includes variability of the random split as well - which is not what we want. For classification problems, there is a possibility of an uneven distribution of different classes in the training and validation set unless data is stratified. Finally, note that we are not using all of the data to train and evaluate the performance of the proposed algorithm, which might result in bias.
4.5.1.3 Leave-one-out
The leave-one-out cross-validation scheme is closely related to the holdout method. In particular, it also involves splitting the data into the training and validation set; however, instead of partitioning the observed data into sets of similar size, a single observation is used as a validation set. With that, majority of the units are employed for training (fitting) the proposed algorithm. Since only one unit (for example \(x_1 = (w_1, y_1)\)) is not used in the fitting process, leave-one-out cross-validation results in a possibly less biased estimate of the true risk; typically, leave-one-out approach will not overestimate the risk as much as the holdout method. On the other hand, since the estimate of risk is based on a single sample, it is typically a highly variable estimate.
We can repeat the process of spiting the data into training and validation set until all samples are part of the validation set at some point. For example, next iteration of the cross-validation might have \(x_2 = (w_2,y_2)\) as the validation set and all the rest of \(n-1\) samples as the training set. Repeating this approach \(n\) times results in, for example, \(n\) squared errors \(MSE_1, MSE_2, \ldots, MSE_n\). The estimate of the true risk is the average over the \(n\) squared errors. While the leave-one-out cross-validation results in a less biased (albeit, more variable) estimate of risk than the holdout method, it could be expensive to implement if \(n\) is large.
We illustrate the usage of the leave-one-out cross-validation with origami
package below; we will use the function folds_loo(n)
. In order to setup
folds_loo(n)
, similarly to the re-substitution method, we just need the total
number of samples we want to cross-validate. We show the first two folds
generated by the leave-one-out cross-validation below.
folds <- folds_loo(nrow(washb_data))
folds[[1]]
#> $v
#> [1] 1
#>
#> $training_set
#> [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
#> [26] 27 28 29 30
#>
#> $validation_set
#> [1] 1
#>
#> attr(,"class")
#> [1] "fold"
folds[[2]]
#> $v
#> [1] 2
#>
#> $training_set
#> [1] 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
#> [26] 27 28 29 30
#>
#> $validation_set
#> [1] 2
#>
#> attr(,"class")
#> [1] "fold"
4.5.1.4 V-fold
An alternative to leave-one-out is V-fold cross-validation. This cross-validation scheme randomly divides the data into \(v\) sets (folds) of equal size; for each fold, the number of samples in the validation set are the same. For V-fold cross-validation, one of the folds is treated as a validation set, whereas the proposed algorithm is fit on the remaining \(v-1\) folds in the training set. The loss, for example MSE, is computed on the samples in the validation set. With the proposed algorithm trained and its performance evaluated on the first fold, we repeat this process \(v\) times; each time, a different group of samples is treated as a validation set. Note that with V-fold cross-validation we effectively use all of the data to train and evaluate the proposed algorithm without overfitting to the training data. In the end, the V-fold cross-validation results in \(v\) estimates of validation error. The final V-fold CV estimate is computed as an average over all the validation losses.
For a dataset with \(n\) samples, V-fold cross-validation with \(v=n\) is just leave-one-out; similarly, if we set \(n=1\), we can get the holdout method’s estimate of algorithm’s performance. Despite the obvious computational advantages, V-fold cross-validation often gives more accurate estimates of the true risk. The reason for this comes from the bias-variance trade-off that comes from employing both methods; while leave-one-out might be less biased, it has higher variance. This difference becomes more obvious as \(v<<n\) (but not too small, as then we increase bias). With V-fold cross-validation, we end up averaging output from \(v\) fits that are typically less correlated than the outputs from leave-one-out fits. Since the mean of many highly correlated quantities has higher variance, leave-one-out estimate of the risk will have higher variance than the estimate based on V-fold cross-validation.
Let’s see V-fold cross-validation with origami
in action! In the next chapter
we will study the Super Learner — an actual algorithm that we fit and evaluate
its performance. The Super Learner relies on V-fold cross-validation as default
cross-validation scheme. In order to set up V-fold CV, we need to call function
folds_vfold(n, V)
. Arguments for folds_vfold(n, V)
require the total number
of samples to be cross-validated, and the number of folds we want to get.
At \(V=2\), we get 2 folds with \(n/2\) number of samples in both training and validation set.
folds <- folds_vfold(nrow(washb_data), V = 2)
folds[[1]]
#> $v
#> [1] 1
#>
#> $training_set
#> [1] 2 3 4 6 7 8 11 12 14 15 19 22 23 24 28
#>
#> $validation_set
#> [1] 1 5 9 10 13 16 17 18 20 21 25 26 27 29 30
#>
#> attr(,"class")
#> [1] "fold"
folds[[2]]
#> $v
#> [1] 2
#>
#> $training_set
#> [1] 1 5 9 10 13 16 17 18 20 21 25 26 27 29 30
#>
#> $validation_set
#> [1] 2 3 4 6 7 8 11 12 14 15 19 22 23 24 28
#>
#> attr(,"class")
#> [1] "fold"
4.5.1.5 Monte Carlo
With Monte Carlo cross-validation, we randomly select some fraction of the data (without replacement) to form the training set; we assign the rest of the samples to the validation set. With that, the data is repeatedly and randomly divided into two sets, a training set of \(n_0 = n \cdot (1-p)\) observations and a validation set of \(n_1 = n \cdot p\) observations. This process is then repeated multiple times, generating (at random) new training and validation partitions each time.
Since the partitions are independent across folds, the same sample can appear in the validation set multiple times — note that this is a stark difference between Monte Carlo and V-fold cross-validation. For a given \(p\), Monte Carlo cross-validation would be optimal if done infinite times, but this is not computationally feasible. With Monte Carlo cross-validation, one is able to explore many more available partitions than with V-fold cross-validation — resulting in a possibly less variable estimate of the risk, at a cost of an increase in bias. By having many overlapping splits, we often also need more splits (and thus more computational time) to achieve V-fold performance with only \(V\) splits.
We illustrate the usage of the Monte Carlo cross-validation with origami
package below using the function folds_montecarlo(n, V, pvalidation)
. In order
to setup folds_montecarlo(n, V, pvalidation)
, we need
- the total number of samples we want to cross-validate (
n
); - the number of folds (
V
); - the proportion of observations to be placed in the validation set
(
pvalidation
).
At V=2
and pvalidation=0.2
, we obtain 2 folds with approximately \(6\) samples
in validation set per fold.
folds <- folds_montecarlo(nrow(washb_data), V = 2, pvalidation = 0.2)
folds[[1]]
#> $v
#> [1] 1
#>
#> $training_set
#> [1] 19 27 16 29 23 12 1 3 18 11 5 7 8 6 9 22 10 25 20 28 15 2 24 26
#>
#> $validation_set
#> [1] 4 13 14 17 21 30
#>
#> attr(,"class")
#> [1] "fold"
folds[[2]]
#> $v
#> [1] 2
#>
#> $training_set
#> [1] 19 15 28 25 29 11 20 17 14 4 9 12 30 8 27 18 16 10 13 6 24 3 26 1
#>
#> $validation_set
#> [1] 2 5 7 21 22 23
#>
#> attr(,"class")
#> [1] "fold"
4.5.1.6 Bootstrap
The bootstrap cross-validation also consists of randomly selecting samples, with replacement, for the training set. The rest of the samples not picked for the training set are allocated to the validation set. This process is then repeated multiple times, generating (at random) new training and validation partitions each time. In contract to the Monte Carlo cross-validation, the total number of samples in a training and validation size across folds is not constant. We also sample with replacement, hence the same samples can be in multiple training sets. The proportion of observations in the validation sets is a random variable, with expectation \(\sim 0.368\).
We illustrate the usage of the bootstrap cross-validation with origami
package
below using the function folds_bootstrap(n, V)
. In order to setup
folds_bootstrap(n, V)
, we need
- the total number of samples we want to cross-validate (
n
); and - the number of folds (
V
).
At \(V=2\), we obtain \(2\) folds with different number of samples in the validation set across folds.
folds <- folds_bootstrap(n = nrow(washb_data), V = 2)
folds[[1]]
#> $v
#> [1] 1
#>
#> $training_set
#> [1] 2 5 30 1 29 16 10 11 8 25 28 2 11 2 16 28 15 28 1 27 9 19 20 30 18
#> [26] 11 13 2 18 12
#>
#> $validation_set
#> [1] 3 4 6 7 14 17 21 22 23 24 26
#>
#> attr(,"class")
#> [1] "fold"
folds[[2]]
#> $v
#> [1] 2
#>
#> $training_set
#> [1] 12 16 10 29 22 15 27 9 27 16 12 28 10 28 26 1 14 6 23 14 21 16 5 20 8
#> [26] 23 25 8 27 5
#>
#> $validation_set
#> [1] 2 3 4 7 11 13 17 18 19 24 30
#>
#> attr(,"class")
#> [1] "fold"
4.5.2 Cross-validation for dependent data
The origami
package also supports numerous cross-validation schemes for
time-series data, for both single and multiple time-series with arbitrary time
and network dependence.
AirPassenger Example
In order to illustrate different cross-validation schemes for time-series, we
will be using the AirPassenger data; this is a widely used, freely available
dataset. The AirPassenger dataset in R
provides monthly totals of
international airline passengers from 1949 to 1960. This dataset is already of a
time series class therefore no further class or date manipulation is required.
Constructing a library that is consistent with the data-generating distribution Goal: we want to forecast the number of airline passengers at some future time (i.e., horizon) \(h\) using the historical data from 1949 to 1960.
library(ggfortify)
data(AirPassengers)
AP <- AirPassengers
autoplot(AP) +
labs(
x = "Date",
y = "Passenger numbers (1000's)",
title = "Air Passengers from 1949 to 1961"
)
t <- length(AP)
4.5.2.1 Rolling origin
Rolling origin cross-validation scheme lends itself to “online” algorithms, where large streams of data have to be fit continually, and the final fit is constantly updated with more data acquired. In general, the rolling origin scheme defines an initial training set, and with each iteration the size of the training set grows by \(m\) observations until we reach time \(t\) for a particular fold. The time points included in the training set are always behind the validation set time points; in addition, there might be a gap between training and validation times of size \(h\).
To further illustrate rolling origin cross-validation, we show below an example with 3 folds. Here, the first window size is 15 time points, on which we first train the proposed algorithm. We then evaluate its performance on 10 time points, with a gap of size 5 between the training and validation time points. For the following fold, we train the algorithm on a longer stream of data, 25 time points, including the original 15 we started with. We then evaluate its performance on 10 time points in the future.
We illustrate the usage of the rolling origin cross-validation with origami
package below using the function folds_rolling_origin(n, first_window, validation_size, gap, batch)
. In order to setup folds_rolling_origin(n, first_window, validation_size, gap, batch)
, we need
- the total number of time points we want to cross-validate (
n
); - the size of the first training set (
first_window
); - the size of the validation set (
validation_size
); - the gap between training and validation set (
gap
); - the size of the update on the training set per each iteration of CV (
batch
).
Our time-series has \(t=144\) time points. Setting the first_window
to \(50\),
validation_size
to 10, gap
to 5 and batch
to 20, we get 4 time-series
folds; we show the first two below.
folds <- folds_rolling_origin(
t,
first_window = 50, validation_size = 10, gap = 5, batch = 20
)
folds[[1]]
#> $v
#> [1] 1
#>
#> $training_set
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#>
#> $validation_set
#> [1] 56 57 58 59 60 61 62 63 64 65
#>
#> attr(,"class")
#> [1] "fold"
folds[[2]]
#> $v
#> [1] 2
#>
#> $training_set
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#> [51] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
#>
#> $validation_set
#> [1] 76 77 78 79 80 81 82 83 84 85
#>
#> attr(,"class")
#> [1] "fold"
4.5.2.2 Rolling window
Instead of adding more time points to the training set per each iteration, the rolling window cross-validation scheme “rolls” the training sample forward by \(m\) time units. The rolling window scheme might be considered in parametric settings when one wishes to guard against moment or parameter drift that is difficult to model explicitly; it is also more efficient for computationally demanding settings such as streaming data, in which large amounts of training data cannot be stored. In contrast to rolling origin CV, the training sample for each iteration of the rolling window scheme is always the same.
To illustrate the rolling window cross-validation with 3 time-series folds below. The first window size is 15 time points, on which we first train the proposed algorithm. As in the previous illustration, we evaluate its performance on 10 time points, with a gap of size 5 between the training and validation time points. However, for the next fold, we train the algorithm on time points further away from the origin (here, 10 time points). Note that the size of the training set in the new fold is the same as in the first fold (15 time points). This setup keeps the training sets comparable over time (and fold) as compared to the rolling origin CV. We then evaluate the performance of the proposed algorithm on 10 time points in the future.
We illustrate the usage of the rolling window cross-validation with origami
package below using the function folds_rolling_window(n, window_size, validation_size, gap, batch)
. In order to setup folds_rolling_window(n, window_size, validation_size, gap, batch)
, we need:
- the total number of time points we want to cross-validate (
n
); - the size of the training sets (
window_size
); - the size of the validation set (
validation_size
); - the gap between training and validation set (
gap
); and - the size of the update on the training set per each iteration of CV (
batch
).
Setting the window_size
to \(50\), validation_size
to 10, gap
to 5 and
batch
to 20, we also get 4 time-series folds; we show the first two below.
folds <- folds_rolling_window(
t,
window_size = 50, validation_size = 10, gap = 5, batch = 20
)
folds[[1]]
#> $v
#> [1] 1
#>
#> $training_set
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#>
#> $validation_set
#> [1] 56 57 58 59 60 61 62 63 64 65
#>
#> attr(,"class")
#> [1] "fold"
folds[[2]]
#> $v
#> [1] 2
#>
#> $training_set
#> [1] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
#> [26] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
#>
#> $validation_set
#> [1] 76 77 78 79 80 81 82 83 84 85
#>
#> attr(,"class")
#> [1] "fold"
4.5.2.3 Rolling origin with V-fold
A variant of rolling origin scheme which accounts for sample dependence is the
rolling-origin-\(V\)-fold cross-validation. In contrast to the canonical rolling
origin CV, samples in the training and validation set are not the same, as the
variant encompasses \(V\)-fold CV in addition to the time-series setup. The
predictions are evaluated on the future times of time-series units not seen
during the training step, allowing for dependence in both samples and time. One
can use the rolling-origin-\(v\)-fold cross-validation with origami
package
using the function folds_vfold_rolling_origin_pooled(n, t, id, time, V, first_window, validation_size, gap, batch)
. In the figure below, we show \(V=2\)
\(V\)-folds, and 2 time-series CV folds.
4.5.2.4 Rolling window with V-fold
Analogous to the previous section, we can extend rolling window CV to support
multiple time-series with arbitrary sample dependence. One can use the
rolling-window-\(V\)-fold cross-validation with origami
package using the
function folds_vfold_rolling_window_pooled(n, t, id, time, V, window_size, validation_size, gap, batch)
. In the figure below, we show \(V=2\) \(V\)-folds, and
2 time-series CV folds.
4.6 General workflow of origami
Before we dive into more details, let’s take a moment to review some of the
basic functionality in origami
R package. The main function in the origami
is cross_validate
. To start off, the user must define the fold structure and a
function that operates on each fold. Once these are passed to cross_validate
,
cross_validate
will apply the same specified function to each fold, and
combine the fold-specific results in a meaningful way. We will see this in
action in later sections; for now, we provide specific details on each each step
of this process below.
(1) Define folds
The folds
object passed to cross_validate
is a list of folds; such lists can
be generated using the make_folds
function. Each fold consists of a list with
a training
index vector, a validation
index vector, and a fold_index
(its
order in the list of folds). This function supports a variety of
cross-validation schemes we describe in the following section. The make_folds
can balance across levels of a variable (strata_ids
), and it can also keep
all observations from the same independent unit together (cluster
).
(2) Define fold function
The cv_fun
argument to cross_validate
is a function that will perform some
operation on each fold. The first argument to this function must be fold
,
which will receive an individual fold object to operate on. Additional arguments
can be passed to cv_fun
using the ...
argument to cross_validate
. Within
this function, the convenience functions training
, validation
and
fold_index
can return the various components of a fold object. If training
or validation
is passed an object, it will index it in a sensible way.
For instance, if it is a vector, it will index the vector directly; if it is a
data.frame
or matrix
, it will index rows. This allows the user to easily
partition data into training and validation sets. The fold function must return
a named list of results containing whatever fold-specific outputs are generated.
(3) Apply cross_validate
After defining folds, cross_validate
can be used to map the cv_fun
across
the folds
using future_lapply
. This means that it can be easily parallelized
by specifying a parallelization scheme (i.e., a plan
from the future
parallelization framework for R
(Bengtsson 2020)). The application of cross_validate
generates a list
of results. As described above, each call to cv_fun
itself returns a list of
results, with different elements for each type of result we care about. The main
loop generates a list of these individual lists of results (a sort of
“meta-list”). This “meta-list” is then inverted such that there is one element
per result type (this too is a list of the results for each fold). By default,
combine_results
is used to combine these results type lists in a sensible
manner. How results are combined is determined automatically by examining the
data types of the results from the first fold. This can be modified by
specifying a list of arguments to .combine_control
.
4.7 Cross-validation in action
Let’s see origami
in action! In the following chapter we will learn how to use
cross-validation with the Super Learner, and how we can utilize the power of
cross-validation to build optimal ensembles of algorithms, not just its use on a
single statistical learning method.
4.7.1 Cross-validation with linear regression
First, we will load the relevant R
packages, set a seed, and load the full
WASH data once again. In order to illustrate cross-validation with origami
and
linear regression, we will focus on predicting the weight-for-height Z-score
whz
using all of the available covariate data. As stated previously, we will
assume the data is independent and identically distributed, ignoring the cluster
structure imposed by the clinical trial design. For the sake of illustration, we
will work with a subset of data, and remove all samples with missing data from
the dataset; we will learn in the next chapter how to deal with missingness.
library(stringr)
library(dplyr)
library(tidyr)
# load data set and take a peek
washb_data <- fread(
paste0(
"https://raw.githubusercontent.com/tlverse/tlverse-data/master/",
"wash-benefits/washb_data.csv"
),
stringsAsFactors = TRUE
)
# Remove missing data, then pick just the first 500 rows
washb_data <- washb_data %>%
drop_na() %>%
slice(1:500)
outcome <- "whz"
covars <- colnames(washb_data)[-which(names(washb_data) == outcome)]
Here’s a look at the data:
whz | tr | fracode | month | aged | sex | momage | momedu | momheight | hfiacat | Nlt18 | Ncomp | watmin | elec | floor | walls | roof | asset_wardrobe | asset_table | asset_chair | asset_khat | asset_chouki | asset_tv | asset_refrig | asset_bike | asset_moto | asset_sewmach | asset_mobile |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.00 | Control | N05265 | 9 | 268 | male | 30 | Primary (1-5y) | 146.40 | Food Secure | 3 | 11 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
-1.16 | Control | N05265 | 9 | 286 | male | 25 | Primary (1-5y) | 148.75 | Moderately Food Insecure | 2 | 4 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
-1.05 | Control | N08002 | 9 | 264 | male | 25 | Primary (1-5y) | 152.15 | Food Secure | 1 | 10 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
-1.26 | Control | N08002 | 9 | 252 | female | 28 | Primary (1-5y) | 140.25 | Food Secure | 3 | 5 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
-0.59 | Control | N06531 | 9 | 336 | female | 19 | Secondary (>5y) | 150.95 | Food Secure | 2 | 7 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
-0.51 | Control | N06531 | 9 | 304 | male | 20 | Secondary (>5y) | 154.20 | Severely Food Insecure | 0 | 3 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
We can see the covariates used in the prediction:
outcome
#> [1] "whz"
covars
#> [1] "tr" "fracode" "month" "aged"
#> [5] "sex" "momage" "momedu" "momheight"
#> [9] "hfiacat" "Nlt18" "Ncomp" "watmin"
#> [13] "elec" "floor" "walls" "roof"
#> [17] "asset_wardrobe" "asset_table" "asset_chair" "asset_khat"
#> [21] "asset_chouki" "asset_tv" "asset_refrig" "asset_bike"
#> [25] "asset_moto" "asset_sewmach" "asset_mobile"
Next, we fit a linear model on the full data, with the goal of predicting the
weight-for-height Z-score whz
using all of the available covariate data. Let’s
try it out:
lm_mod <- lm(whz ~ ., data = washb_data)
summary(lm_mod)
#>
#> Call:
#> lm(formula = whz ~ ., data = washb_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.8890 -0.6799 -0.0169 0.6595 3.1005
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.89006 1.72022 -1.10 0.2725
#> trHandwashing -0.25276 0.17032 -1.48 0.1385
#> trNutrition -0.09695 0.15696 -0.62 0.5371
#> trNutrition + WSH -0.09587 0.16528 -0.58 0.5622
#> trSanitation -0.27702 0.15846 -1.75 0.0811 .
#> trWSH -0.02846 0.15967 -0.18 0.8586
#> trWater -0.07148 0.15813 -0.45 0.6515
#> fracodeN05160 0.62355 0.30719 2.03 0.0430 *
#> fracodeN05265 0.38762 0.31011 1.25 0.2120
#> fracodeN05359 0.10187 0.31329 0.33 0.7452
#> fracodeN06229 0.30933 0.29766 1.04 0.2993
#> fracodeN06453 0.08066 0.30006 0.27 0.7882
#> fracodeN06458 0.43707 0.29970 1.46 0.1454
#> fracodeN06473 0.45406 0.30912 1.47 0.1426
#> fracodeN06479 0.60994 0.31463 1.94 0.0532 .
#> fracodeN06489 0.25923 0.31901 0.81 0.4169
#> fracodeN06500 0.07539 0.35794 0.21 0.8333
#> fracodeN06502 0.36748 0.30504 1.20 0.2290
#> fracodeN06505 0.20038 0.31560 0.63 0.5258
#> fracodeN06516 0.55455 0.29807 1.86 0.0635 .
#> fracodeN06524 0.49429 0.31423 1.57 0.1164
#> fracodeN06528 0.75966 0.31060 2.45 0.0148 *
#> fracodeN06531 0.36856 0.30155 1.22 0.2223
#> fracodeN06862 0.56932 0.29293 1.94 0.0526 .
#> fracodeN08002 0.36779 0.26846 1.37 0.1714
#> month 0.17161 0.10865 1.58 0.1149
#> aged -0.00336 0.00112 -3.00 0.0029 **
#> sexmale 0.12352 0.09203 1.34 0.1802
#> momage -0.01379 0.00973 -1.42 0.1570
#> momeduPrimary (1-5y) -0.13214 0.15225 -0.87 0.3859
#> momeduSecondary (>5y) 0.12632 0.16041 0.79 0.4314
#> momheight 0.00512 0.00919 0.56 0.5776
#> hfiacatMildly Food Insecure 0.05804 0.19341 0.30 0.7643
#> hfiacatModerately Food Insecure -0.01362 0.12887 -0.11 0.9159
#> hfiacatSeverely Food Insecure -0.13447 0.25418 -0.53 0.5970
#> Nlt18 -0.02557 0.04060 -0.63 0.5291
#> Ncomp 0.00179 0.00762 0.23 0.8145
#> watmin 0.01347 0.00861 1.57 0.1182
#> elec 0.08906 0.10700 0.83 0.4057
#> floor -0.17763 0.17734 -1.00 0.3171
#> walls -0.03001 0.21445 -0.14 0.8888
#> roof -0.03716 0.49214 -0.08 0.9399
#> asset_wardrobe -0.05754 0.13736 -0.42 0.6755
#> asset_table -0.22079 0.12276 -1.80 0.0728 .
#> asset_chair 0.28012 0.13750 2.04 0.0422 *
#> asset_khat 0.02306 0.11766 0.20 0.8447
#> asset_chouki -0.13943 0.14084 -0.99 0.3227
#> asset_tv 0.17723 0.12972 1.37 0.1726
#> asset_refrig 0.12613 0.23162 0.54 0.5863
#> asset_bike -0.02568 0.10083 -0.25 0.7990
#> asset_moto -0.32094 0.19944 -1.61 0.1083
#> asset_sewmach 0.05090 0.17795 0.29 0.7750
#> asset_mobile 0.01420 0.14972 0.09 0.9245
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.984 on 447 degrees of freedom
#> Multiple R-squared: 0.129, Adjusted R-squared: 0.0277
#> F-statistic: 1.27 on 52 and 447 DF, p-value: 0.104
We can assess how well the model fits the data by comparing the predictions of
the linear model to the true outcomes observed in the data set. This is the well
known (and standard) mean squared error. We can extract that from the lm
model
object as follows:
The mean squared error is 0.86568. There is an important problem that arises when we assess the model in this way - that is, we have trained our linear regression model on the full data set and assessed the error on the full data set, using up all of our data. We, of course, are generally not interested in how well the model explains variation in the observed data; rather, we are interested in how the explanation provided by the model generalizes to a target population from which the sample is presumably derived. Having used all of our available data, we cannot honestly evaluate how well the model fits (and thus explains) variation at the population level.
To resolve this issue, cross-validation allows for a particular procedure (e.g., linear regression) to be implemented over subsets of the data, evaluating how well the procedure fits on a testing (“validation”) set, thereby providing an honest evaluation of the error.
We can easily add cross-validation to our linear regression procedure using
origami
. First, let us define a new function to perform linear regression on a
specific partition of the data (called a “fold”):
cv_lm <- function(fold, data, reg_form) {
# get name and index of outcome variable from regression formula
out_var <- as.character(unlist(str_split(reg_form, " "))[1])
out_var_ind <- as.numeric(which(colnames(data) == out_var))
# split up data into training and validation sets
train_data <- training(data)
valid_data <- validation(data)
# fit linear model on training set and predict on validation set
mod <- lm(as.formula(reg_form), data = train_data)
preds <- predict(mod, newdata = valid_data)
valid_data <- as.data.frame(valid_data)
# capture results to be returned as output
out <- list(
coef = data.frame(t(coef(mod))),
SE = (preds - valid_data[, out_var_ind])^2
)
return(out)
}
Our cv_lm
function is rather simple: we merely split the available data into a
training and validation sets (using the eponymous functions provided in
origami
) fit the linear model on the training set, and evaluate the model on
the validation set. This is a simple example of what origami
considers to be
cv_fun
— functions for using cross-validation to perform a particular routine
over an input data set. Having defined such a function, we can simply generate a
set of partitions using origami
’s make_folds
function, and apply our cv_lm
function over the resultant folds
object. Below, we replicate the
re-substitution estimate of the error — we did this “by hand” above — using
the functions make_folds
and cv_lm
.
# re-substitution estimate
resub <- make_folds(washb_data, fold_fun = folds_resubstitution)[[1]]
resub_results <- cv_lm(fold = resub, data = washb_data, reg_form = "whz ~ .")
mean(resub_results$SE, na.rm = TRUE)
#> [1] 0.86568
This (nearly) matches the estimate of the error that we obtained above.
We can more honestly evaluate the error by V-fold cross-validation, which
partitions the data into \(v\) subsets, fitting the model on \(v - 1\) of the
subsets and evaluating on the subset that was held out for testing. This is
repeated such that each subset is used for validation. We can easily apply our
cv_lm
function using origami
’s cross_validate
(n.b., by default this
performs 10-fold cross-validation):
# cross-validated estimate
folds <- make_folds(washb_data)
cvlm_results <- cross_validate(
cv_fun = cv_lm, folds = folds, data = washb_data, reg_form = "whz ~ .",
use_future = FALSE
)
mean(cvlm_results$SE, na.rm = TRUE)
#> [1] 1.35
Having performed 10-fold cross-validation, we quickly notice that our previous estimate of the model error (by resubstitution) was a bit optimistic. The honest estimate of the error is larger!
4.7.2 Cross-validation with ARIMA
Cross-validation can also be used for forecast model selection in a time series
setting. Here, the partitioning scheme mirrors the application of the
forecasting model: we’ll train the data on past observations (either all
available or a recent subset), and then use the ARIMA (AutoRegressive
Integrated Moving Average) model fit to predict the next few observations. We
consider the AirPassengers
dataset again, a monthly time series of passenger
air traffic in thousands of people.
data(AirPassengers)
print(AirPassengers)
#> Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
#> 1949 112 118 132 129 121 135 148 148 136 119 104 118
#> 1950 115 126 141 135 125 149 170 170 158 133 114 140
#> 1951 145 150 178 163 172 178 199 199 184 162 146 166
#> 1952 171 180 193 181 183 218 230 242 209 191 172 194
#> 1953 196 196 236 235 229 243 264 272 237 211 180 201
#> 1954 204 188 235 227 234 264 302 293 259 229 203 229
#> 1955 242 233 267 269 270 315 364 347 312 274 237 278
#> 1956 284 277 317 313 318 374 413 405 355 306 271 306
#> 1957 315 301 356 348 355 422 465 467 404 347 305 336
#> 1958 340 318 362 348 363 435 491 505 404 359 310 337
#> 1959 360 342 406 396 420 472 548 559 463 407 362 405
#> 1960 417 391 419 461 472 535 622 606 508 461 390 432
Suppose we want to pick between two forecasting models with different arima
configurations. We can do that by evaluating their forecasting performance.
First, we set up the appropriate cross-validation scheme for time-series.
folds <- make_folds(AirPassengers,
fold_fun = folds_rolling_origin,
first_window = 36, validation_size = 24, batch = 10
)
# How many folds where generated?
length(folds)
#> [1] 9
# Examine the first 2 folds.
folds[[1]]
#> $v
#> [1] 1
#>
#> $training_set
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36
#>
#> $validation_set
#> [1] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
#>
#> attr(,"class")
#> [1] "fold"
folds[[2]]
#> $v
#> [1] 2
#>
#> $training_set
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
#>
#> $validation_set
#> [1] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
#>
#> attr(,"class")
#> [1] "fold"
By default, folds_rolling_origin
will increase the size of the training set by
one time point each fold. Had we followed the default option, we would have 85
folds to train! Luckily, we can pass the batch
as option to
folds_rolling_origin
that tells it to increase the size of the training set by
10 points each iteration. Since we want to forecast the immediate next point,
gap
argument remains the default (0).
# make sure to load the package!
library(forecast)
# function to calculate cross-validated squared error
cv_forecasts <- function(fold, data) {
# Get training and validation data
train_data <- training(data)
valid_data <- validation(data)
valid_size <- length(valid_data)
train_ts <- ts(log10(train_data), frequency = 12)
# First arima model
arima_fit <- arima(train_ts, c(0, 1, 1),
seasonal = list(
order = c(0, 1, 1),
period = 12
)
)
raw_arima_pred <- predict(arima_fit, n.ahead = valid_size)
arima_pred <- 10^raw_arima_pred$pred
arima_MSE <- mean((arima_pred - valid_data)^2)
# Second arima model
arima_fit2 <- arima(train_ts, c(5, 1, 1),
seasonal = list(
order = c(0, 1, 1),
period = 12
)
)
raw_arima_pred2 <- predict(arima_fit2, n.ahead = valid_size)
arima_pred2 <- 10^raw_arima_pred2$pred
arima_MSE2 <- mean((arima_pred2 - valid_data)^2)
out <- list(mse = data.frame(
fold = fold_index(),
arima = arima_MSE, arima2 = arima_MSE2
))
return(out)
}
mses <- cross_validate(
cv_fun = cv_forecasts, folds = folds, data = AirPassengers,
use_future = FALSE
)
mses$mse
#> fold arima arima2
#> 1 1 68.21 137.28
#> 2 2 319.68 313.15
#> 3 3 578.35 713.36
#> 4 4 428.69 505.31
#> 5 5 407.33 371.27
#> 6 6 281.82 250.99
#> 7 7 827.56 910.12
#> 8 8 2099.59 2213.15
#> 9 9 398.37 293.38
colMeans(mses$mse[, c("arima", "arima2")])
#> arima arima2
#> 601.07 634.22
The arima model with no AR component seems to be a better fit for this data.
4.8 Exercises
4.8.1 Real World Data and origami
In this section, we review all the concepts learned via a real world data application.
We will be using the WASH Benefits data, corresponding to the effect of water quality, sanitation, hand washing, and nutritional interventions on child development in rural Bangladesh trial. The main aim of the WASH Benefits cluster-randomized controlled trial was to assess the impact of six intervention groups. In this problem set, we aim to estimate the conditional mean of the weight-for-height Z-score given treatment and all baseline covarites; we goal is also to report the performance of the Random Forest algorithm fit (Breiman 2001). We review the list of used covariates below:
outcome
#> [1] "whz"
covars
#> [1] "tr" "fracode" "month" "aged"
#> [5] "sex" "momage" "momedu" "momheight"
#> [9] "hfiacat" "Nlt18" "Ncomp" "watmin"
#> [13] "elec" "floor" "walls" "roof"
#> [17] "asset_wardrobe" "asset_table" "asset_chair" "asset_khat"
#> [21] "asset_chouki" "asset_tv" "asset_refrig" "asset_bike"
#> [25] "asset_moto" "asset_sewmach" "asset_mobile"
Questions:
Using Breiman’s
randomForest
package, predict the weight-for-height Z-score (whz
) based on a subset of data (n=500) using all of the available covariates. Which loss function is appropriate for this statistical problem? Report the error of the model using all the available data to fit and evaluate the model.Write a function to obtain the random forest fit on a specific partition of the data that aligns with
cv_fun
argument tocross_validate
in theorigami
package. The function should return the appropriate loss per fold.Using the
make_folds
function inorigami
package, define:
- V-fold CV scheme with
V = 5
, - Monte-carlo CV scheme with
V = 5
andpvalidation = 0.2
, - Re-substitution scheme.
- Report the risk (mean loss) and variance of the loss per fold for each of the CV schemes set in the previous question. How does the estimated risk corresponding to each CV scheme compare to the risk in Question 1?
4.8.2 Review of key concepts
Compare and contrast V-fold cross-validation with resubstitution cross-validation. What are some of the differences between the two methods? How are they similar? Describe a scenario when you would use one over the other.
-
What are the advantages and disadvantages of \(v\)-fold CV relative to:
- holdout CV?
- leave-one-out CV?
Why can’t we use V-fold cross-validation for time-series data?
Would you use rolling window or origin for non-stationary time-series? Why?
4.8.3 Ideas in action
- Let \(Y\) be a binary variable with \(P(Y=1 \mid W) = 0.01\). What kind of
cross-validation scheme should be used for a rare outcome? How can we do this
with the
origami
package?
4.8.4 Advanced topics
-
Consider a classification problem with a large number of predictors. A statistician proposes the following analysis:
- First screen the predictors, leaving only covariates with a strong correlation with the class labels.
- Fit some algorithm using only the subset of highly correlated covariates.
- Use cross-validation to estimate the tuning parameters and the performance of the proposed algorithm.
Is this a correct application of cross-validation? Why?