The cv.lm() function in R is used to perform k-fold cross-validation on linear regression models. The cv.lm() function is primarily available in the DAAD R Package, though similar functions exist in other R packages, such as lmvar and cv.
Table of Contents
The main purpose of cv.lm() function is to evaluate how well the regression model will generalize to an independent dataset, helping to assess predictive accuracy and detect overfitting. In this post, I will explain the use, application, and visualizations of the cv.lm() function.
K-Fold Cross-Validation using cv.lm() Function
Before diving into the cv.lm() function, it is helpful to understand the underlying mechanics. The cv.lm() function performs k-fold cross-validation by following these steps:
- Data Splitting: The dataset is randomly shuffled and split into $m$ or $k$ equal-sized groups, called folds.
- Iterative Training and Testing: The model is trained and evaluated $m$ times. In each iteration:
- A different fold is held out as the validation set.
- The model is re-fitted using the data from the remaining $m-1$ folds (the training set).
- This fitted model is then used to predict the outcomes for the observations in the held-out validation fold.
- Result Compilation: This process yields a set of cross-validated predictions (cvpred) for every observation in the original dataset, since each data point serves in the validation set exactly once.
cv.lm() Function Syntax and Important Arguments
The primary implementation of this function is CVlm() (or its alias cv.lm()) in the DAAD package.
CVlm(data, form.lm, m = 3, dots = FALSE, seed = 29,
plotit = c("Observed", "Residual"),
col.folds = NULL, main = NULL, legend.pos = "topleft",
printit = TRUE, ...)The important arguments are:
- data: The data frame containing the variables to be used in the model
- form.lm: A formula, such as
y ~ x1 + x2, which defines the linear regression model to be evaluated. m: The number of folds to use. For example, $m=5$ performs 5-fold cross-validation. The default value is 3.- seed: An integer to set the random number generator seed. This is crucial for reproducibility, ensuring that the random assignment of data points to folds is the same each time you run the function.
- plotit: Controls the graphical output. It can be set to “Observed” (the default), “Residual”, or “FALSE” to suppress plotting.
- printit: A logical value. If set to TRUE (the default), the function prints a detailed summary of the cross-validation results, including the fold number and cross-validated predictions.
- …: Additional arguments to be passed to other functions, such as
legend()for customizing the plot.
Practical Applications of cv.lm() Function
The cv.lm() is a valuable function in the model-building process. Its primary applications are:
- Estimating Prediction Error: It provides a more reliable estimate of a model’s prediction error on unseen data than the training error (Residual Sum of Squares), which is often overly optimistic.
- Model Comparison: One can use it to compare the cross-validated prediction error of different models (such as a simple model vs a complex model with many predictors). The model with the lower cross-validated error is generally preferred.
- Detecting Overfitting: If a model performs very well on the training data but yields a much higher cross-validated error, it is a strong sign of overfitting.
- Evaluating Model Stability: By repeating cross-validation (or looking at the predictions per fold), one can assess how stable the model’s predictions are across different subsets of the data.
Understanding the Visual Output
One of the most helpful features of cv.lm() function is its ability to visualize the results makes it easier to spot patterns and potential problems.
plotit Argument | Description of Visualization | Interpretation and Use |
|---|---|---|
"Observed" (or TRUE) | This plot shows the observed values (on the y-axis) against the cross-validated predicted values (on the x-axis). Data points from different folds are usually marked with distinct symbols or colors. A reference line (usually y = x) is added, representing perfect prediction. | For a good model, the points should cluster closely around the diagonal line. Systematic deviations from the line, such as a curve, can indicate non-linearity or model misspecification. The different colors/symbols for each fold help you see if the model’s performance is consistent across all data subsets. |
"Residual" | This plot displays the residuals (observed minus predicted) against the fitted values. For models with more than one predictor, the lines shown for each fold are approximations. | This is essentially a cross-validated residual plot. You can use it to check for patterns in the residuals. Ideally, the residuals should be randomly scattered around zero with no clear trend (like a funnel shape), which would suggest constant variance. This plot helps in diagnosing violations of regression assumptions like heteroscedasticity. |
To get a more concrete idea of how to use the function, here is a conceptual example (assuming the DAAG package is installed and loaded). Let us perform a 5-fold cross-validation on a model predicting $sale.price$ from area.
CVlm(data = houseprices, form.lm = formula(sale.price ~ area),
m = 5, seed = 123, plotit = "Observed", printit = TRUE)
## Output
fold 1
Observations in test set: 3
9 12 15
area 694.000000 1366.0000 821.00000
cvpred 190.426899 380.6042 226.36815
sale.price 192.000000 274.0000 212.00000
CV residual 1.573101 -106.6042 -14.36815
Sum of squares = 11573.38 Mean square = 3857.79 n = 3
fold 2
Observations in test set: 3
14 19 20
area 963.00000 790.000000 696.00000
cvpred 254.26927 216.149108 195.43642
sale.price 185.00000 221.500000 255.00000
CV residual -69.26927 5.350892 59.56358
Sum of squares = 8374.68 Mean square = 2791.56 n = 3
fold 3
Observations in test set: 3
13 17 18
area 716.0000 1018.00000 887.00000
cvpred 216.4961 261.86662 242.18604
sale.price 112.7000 276.00000 260.00000
CV residual -103.7961 14.13338 17.81396
Sum of squares = 11290.73 Mean square = 3763.58 n = 3
fold 4
Observations in test set: 3
10 21 22
area 905.0000 771.00000 1006.00000
cvpred 236.4021 210.77943 255.71471
sale.price 215.0000 260.00000 293.00000
CV residual -21.4021 49.22057 37.28529
Sum of squares = 4270.91 Mean square = 1423.64 n = 3
fold 5
Observations in test set: 3
11 16 23
area 802.000000 714.00000 1191.00
cvpred 218.659431 207.12203 269.66
sale.price 215.000000 220.00000 375.00
CV residual -3.659431 12.87797 105.34
Sum of squares = 11275.75 Mean square = 3758.58 n = 3
Overall (Sum over all 3 folds)
ms
3119.03One can plot a residual plot instead of “Observed.”
CVlm(data = houseprices, form.lm = formula(sale.price ~ area),
m = 5, seed = 123, plotit = "Residual", printit = TRUE)
## Output
fold 1
Observations in test set: 3
9 12 15
area 694.000000 1366.0000 821.00000
cvpred 190.426899 380.6042 226.36815
sale.price 192.000000 274.0000 212.00000
CV residual 1.573101 -106.6042 -14.36815
Sum of squares = 11573.38 Mean square = 3857.79 n = 3
fold 2
Observations in test set: 3
14 19 20
area 963.00000 790.000000 696.00000
cvpred 254.26927 216.149108 195.43642
sale.price 185.00000 221.500000 255.00000
CV residual -69.26927 5.350892 59.56358
Sum of squares = 8374.68 Mean square = 2791.56 n = 3
fold 3
Observations in test set: 3
13 17 18
area 716.0000 1018.00000 887.00000
cvpred 216.4961 261.86662 242.18604
sale.price 112.7000 276.00000 260.00000
CV residual -103.7961 14.13338 17.81396
Sum of squares = 11290.73 Mean square = 3763.58 n = 3
fold 4
Observations in test set: 3
10 21 22
area 905.0000 771.00000 1006.00000
cvpred 236.4021 210.77943 255.71471
sale.price 215.0000 260.00000 293.00000
CV residual -21.4021 49.22057 37.28529
Sum of squares = 4270.91 Mean square = 1423.64 n = 3
fold 5
Observations in test set: 3
11 16 23
area 802.000000 714.00000 1191.00
cvpred 218.659431 207.12203 269.66
sale.price 215.000000 220.00000 375.00
CV residual -3.659431 12.87797 105.34
Sum of squares = 11275.75 Mean square = 3758.58 n = 3
Overall (Sum over all 3 folds)
ms
3119.03Important Considerations for using cv.lm() Function
- Package Source: Be aware that
cv.lm()exists in multiple packages (e.g.,DAAG,lmvar). Their implementations and outputs can differ. The version in theDAAGpackage is the most common for basic linear regression - For Other Model Types: If you need to perform cross-validation for other types of models, such as logistic regression (
glm) or mixed models, consider using the more modern and flexiblecvpackage with its genericcv()function. - Advanced Computational Methods: For large datasets, refitting the model for each fold can be slow. More advanced implementations, like those discussed in the
cvpackage’s vignettes, use computational tricks (like the Woodbury matrix identity) to calculate cross-validated results more efficiently without completely refitting the model each time.



