Run a cross-validation

cv() executes a cross-validation procedure. For each fold (specified in argument nfold or folds), the original model is re-fitted using the complement of the fold as training data. Cross-validations of multiple models are executed using identical folds.

Usage

cv(x, ...)

# S3 method for model
cv(
  x,
  nfold = getOption("cv_nfold"),
  folds = NULL,
  ...,
  metric = NULL,
  iter = getOption("cv_iter"),
  param = TRUE,
  keep_fits = FALSE,
  verbose = getOption("cv_verbose")
)

# S3 method for multimodel
cv(
  x,
  nfold = getOption("cv_nfold"),
  folds = NULL,
  metric = NULL,
  iter = getOption("cv_iter"),
  param = TRUE,
  keep_fits = FALSE,
  verbose = getOption("cv_verbose"),
  ...
)

# S3 method for default
cv(
  x,
  nfold = getOption("cv_nfold"),
  folds = NULL,
  ...,
  metric = NULL,
  iter = getOption("cv_iter"),
  param = TRUE,
  keep_fits = FALSE
)

# S3 method for cv
print(
  x,
  what = c("class", "formula", "weights"),
  show_metric = TRUE,
  abbreviate = TRUE,
  n = getOption("print_max_model"),
  width = getOption("width"),
  param = TRUE,
  ...
)

Arguments

x: A model, multimodel or fitted model (see sections “Methods”).
...: These arguments are passed internally to methods of cv_simple(), a currently undocumented generic that runs the cross-validation on a single model.
nfold, folds: Passed to make_folds.
metric: A metric (see metrics). metric=NULL selects the default metric, see default_metric.
iter: A preference criterion, or a list of several criteria. Only relevant for iteratively fitted models (see ifm), ignored otherwise.
param: Logical. Include parameter table in output? See ?multimodel.
keep_fits: Logical: Keep the cross-validation's model fits?
verbose: Logical: Output information on execution progress in console?
what: Which elements of the multimodel should be printed? See print.model.
show_metric: Logical: Whether to print the cross-validated models' metric.
abbreviate: Logical. If TRUE (the default), long formulas and calls are printed in abbreviated mode, such that they usually fit on 4 or fewer output lines; otherwise they are printed entirely, no matter how long they are.
n: Integer: Model details are printed for first n models in print.cv().
width: Integer: Width of printed output.

Value

The output from cv() is a list of class “cv” having the following elements:

multimodel: a multimodel;
folds: the folds, as defined in nfold or folds (see make_folds);
fits: if keep_fits=FALSE (the default): NULL; if keep_fits=TRUE: the list of the model fits resulting from the cross-validation, see extract_fits);
metric: a list: the default evaluation metrics, not necessarily the same for all models;
predictions: a list of matrices of dimension \(n \times k\) where \(n\) is the number of observations in the model data and \(k\) is the number of folds; each of these list entries corresponds to a model;
performance: a list of performance tables (see cv_performance), that are saved only for certain model classes; often NULL;
timing: execution time of cross-validation;
extras: a list of extra results from cross-validation, which are saved only for certain model classes; often NULL. If the model x is an iteratively fitted model (ifm), extras contain the cross-validated model's evaluation log and information on preferred iterations.

Details

The same cross-validations groups (folds) are used for all models.

Each model in x is processed separately with the function cv_simple(), a generic function for internal use. Besides the standard method cv_simple.model(), there are currently specific methods of cv_simple() for models generated with fm_xgb() and fm_glmnet().

Methods

cv.multimodel(), the core method.
cv.model(x, ...) corresponds to x %>% multimodel %>% cv(...).
The default method essentially executes x %>% model %>% cv(...) and thus expects a fitted model as its x.

Examples

mm <- multimodel(model(fm_knn(Sepal.Length ~ ., iris)), k = 1:5)
cv(mm)
#> --- A “cv” object containing 5 validated models ---
#> 
#> Validation procedure: Complete k-fold Cross-Validation
#>   Number of obs in data:  150
#>   Number of test sets:     10
#>   Size of test sets:       15
#>   Size of training sets:  135
#> 
#> Models:
#> 
#> ‘model1’:
#>   model class:  fm_knn
#>   formula:      Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + 
#>                     Species
#>   metric:       rmse
#> 
#> ‘model2’:
#>   model class:  fm_knn
#>   formula:      Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + 
#>                     Species
#>   metric:       rmse
#> 
#> ‘model3’:
#>   model class:  fm_knn
#>   formula:      Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + 
#>                     Species
#>   metric:       rmse
#> 
#> and 2 models more, labelled:
#>   ‘model4’, ‘model5’
#> 
#> 
#> Parameter table:
#>        k
#> model1 1
#> model2 2
#> model3 3
#> ... 2 rows omitted (nrow=5)

mm_cars <- c(simpleLinear = model(lm(mpg ~ cyl, mtcars)),
             linear = model(lm(mpg ~ ., mtcars)),
             if (require(ranger)) model(ranger(mpg ~ ., mtcars), label = "forest"))
#> Loading required package: ranger
mm_cars
#> --- A “multimodel” object containing 3 models ---
#> 
#> ‘simpleLinear’:
#>   model class:  lm
#>   formula:      mpg ~ cyl
#>   data:         data.frame [32 x 11], 
#>                 input as: ‘data = mtcars’
#>   call:         lm(formula = mpg ~ cyl, data = data)
#> 
#> ‘linear’:
#>   model class:  lm
#>   formula:      mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
#>                     am + gear + carb
#>   data:         data.frame [32 x 11], 
#>                 input as: ‘data = mtcars’
#>   call:         lm(formula = mpg ~ ., data = data)
#> 
#> ‘forest’:
#>   model class:  ranger
#>   formula:      mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
#>                     am + gear + carb
#>   data:         data.frame [32 x 11], 
#>                 input as: ‘data = mtcars’
#>   call:         ranger(formula = mpg ~ ., data = data)
cv_cars <- cv(mm_cars, nfold = 5)
cv_cars
#> --- A “cv” object containing 3 validated models ---
#> 
#> Validation procedure: Complete k-fold Cross-Validation
#>   Number of obs in data:   32
#>   Number of test sets:      5
#>   Size of test sets:       ~6
#>   Size of training sets:  ~26
#> 
#> Models:
#> 
#> ‘simpleLinear’:
#>   model class:  lm
#>   formula:      mpg ~ cyl
#>   metric:       rmse
#> 
#> ‘linear’:
#>   model class:  lm
#>   formula:      mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
#>                     am + gear + carb
#>   metric:       rmse
#> 
#> ‘forest’:
#>   model class:  ranger
#>   formula:      mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
#>                     am + gear + carb
#>   metric:       rmse
cv_performance(cv_cars)
#> --- Performance table ---
#> Metric: rmse
#>              train_rmse test_rmse time_cv
#> simpleLinear     3.0495    3.5248   0.005
#> linear           2.0169    3.7303   0.009
#> forest           1.3057    2.4861   0.046

# Non-default metric:
cv_performance(cv_cars, metric = "medae")
#> --- Performance table ---
#> Metric: medae
#>              train_medae test_medae time_cv
#> simpleLinear      1.7604     2.3055   0.005
#> linear            1.4328     2.3144   0.009
#> forest            0.9158     2.0434   0.046