K-Fold Cross-Validation

The cv.lm() function in R is used to perform k-fold cross-validation on linear regression models. The cv.lm() function is primarily available in the DAAD R Package, though similar functions exist in other R packages, such as lmvar and cv.

The main purpose of cv.lm() function is to evaluate how well the regression model will generalize to an independent dataset, helping to assess predictive accuracy and detect overfitting. In this post, I will explain the use, application, and visualizations of the cv.lm() function.

K-Fold Cross-Validation using cv.lm() Function

Before diving into the cv.lm() function, it is helpful to understand the underlying mechanics. The cv.lm() function performs k-fold cross-validation by following these steps:

  1. Data Splitting: The dataset is randomly shuffled and split into $m$ or $k$ equal-sized groups, called folds.
  2. Iterative Training and Testing: The model is trained and evaluated $m$ times. In each iteration:
    • A different fold is held out as the validation set.
    • The model is re-fitted using the data from the remaining $m-1$ folds (the training set).
    • This fitted model is then used to predict the outcomes for the observations in the held-out validation fold.
  3. Result Compilation: This process yields a set of cross-validated predictions (cvpred) for every observation in the original dataset, since each data point serves in the validation set exactly once.

cv.lm() Function Syntax and Important Arguments

The primary implementation of this function is CVlm() (or its alias cv.lm()) in the DAAD package.

CVlm(data, form.lm, m = 3, dots = FALSE, seed = 29, 
     plotit = c("Observed", "Residual"), 
     col.folds = NULL, main = NULL, legend.pos = "topleft", 
     printit = TRUE, ...)

The important arguments are:

  • data: The data frame containing the variables to be used in the model
  • form.lm: A formula, such as y ~ x1 + x2, which defines the linear regression model to be evaluated.
  • m: The number of folds to use. For example, $m=5$ performs 5-fold cross-validation. The default value is 3.
  • seed: An integer to set the random number generator seed. This is crucial for reproducibility, ensuring that the random assignment of data points to folds is the same each time you run the function.
  • plotit: Controls the graphical output. It can be set to “Observed” (the default), “Residual”, or “FALSE” to suppress plotting.
  • printit: A logical value. If set to TRUE (the default), the function prints a detailed summary of the cross-validation results, including the fold number and cross-validated predictions.
  • …: Additional arguments to be passed to other functions, such as legend() for customizing the plot.

Practical Applications of cv.lm() Function

The cv.lm() is a valuable function in the model-building process. Its primary applications are:

  • Estimating Prediction Error: It provides a more reliable estimate of a model’s prediction error on unseen data than the training error (Residual Sum of Squares), which is often overly optimistic.
  • Model Comparison: One can use it to compare the cross-validated prediction error of different models (such as a simple model vs a complex model with many predictors). The model with the lower cross-validated error is generally preferred.
  • Detecting Overfitting: If a model performs very well on the training data but yields a much higher cross-validated error, it is a strong sign of overfitting.
  • Evaluating Model Stability: By repeating cross-validation (or looking at the predictions per fold), one can assess how stable the model’s predictions are across different subsets of the data.

Understanding the Visual Output

One of the most helpful features of cv.lm() function is its ability to visualize the results makes it easier to spot patterns and potential problems.

plotit ArgumentDescription of VisualizationInterpretation and Use
"Observed" (or TRUE)This plot shows the observed values (on the y-axis) against the cross-validated predicted values (on the x-axis). Data points from different folds are usually marked with distinct symbols or colors. A reference line (usually y = x) is added, representing perfect prediction.For a good model, the points should cluster closely around the diagonal line. Systematic deviations from the line, such as a curve, can indicate non-linearity or model misspecification. The different colors/symbols for each fold help you see if the model’s performance is consistent across all data subsets.
"Residual"This plot displays the residuals (observed minus predicted) against the fitted values. For models with more than one predictor, the lines shown for each fold are approximations.This is essentially a cross-validated residual plot. You can use it to check for patterns in the residuals. Ideally, the residuals should be randomly scattered around zero with no clear trend (like a funnel shape), which would suggest constant variance. This plot helps in diagnosing violations of regression assumptions like heteroscedasticity.

To get a more concrete idea of how to use the function, here is a conceptual example (assuming the DAAG package is installed and loaded). Let us perform a 5-fold cross-validation on a model predicting $sale.price$ from area.

CVlm(data = houseprices, form.lm = formula(sale.price ~ area), 
     m = 5, seed = 123, plotit = "Observed", printit = TRUE)

## Output
fold 1 
Observations in test set: 3 
                     9        12        15
area        694.000000 1366.0000 821.00000
cvpred      190.426899  380.6042 226.36815
sale.price  192.000000  274.0000 212.00000
CV residual   1.573101 -106.6042 -14.36815

Sum of squares = 11573.38    Mean square = 3857.79    n = 3 

fold 2 
Observations in test set: 3 
                   14         19        20
area        963.00000 790.000000 696.00000
cvpred      254.26927 216.149108 195.43642
sale.price  185.00000 221.500000 255.00000
CV residual -69.26927   5.350892  59.56358

Sum of squares = 8374.68    Mean square = 2791.56    n = 3 

fold 3 
Observations in test set: 3 
                   13         17        18
area         716.0000 1018.00000 887.00000
cvpred       216.4961  261.86662 242.18604
sale.price   112.7000  276.00000 260.00000
CV residual -103.7961   14.13338  17.81396

Sum of squares = 11290.73    Mean square = 3763.58    n = 3 

fold 4 
Observations in test set: 3 
                  10        21         22
area        905.0000 771.00000 1006.00000
cvpred      236.4021 210.77943  255.71471
sale.price  215.0000 260.00000  293.00000
CV residual -21.4021  49.22057   37.28529

Sum of squares = 4270.91    Mean square = 1423.64    n = 3 

fold 5 
Observations in test set: 3 
                    11        16      23
area        802.000000 714.00000 1191.00
cvpred      218.659431 207.12203  269.66
sale.price  215.000000 220.00000  375.00
CV residual  -3.659431  12.87797  105.34

Sum of squares = 11275.75    Mean square = 3758.58    n = 3 

Overall (Sum over all 3 folds) 
     ms 
3119.03
K-fold cross-Validation using cv.lm() function in R

One can plot a residual plot instead of “Observed.”

CVlm(data = houseprices, form.lm = formula(sale.price ~ area), 
     m = 5, seed = 123, plotit = "Residual", printit = TRUE)

## Output
fold 1 
Observations in test set: 3 
                     9        12        15
area        694.000000 1366.0000 821.00000
cvpred      190.426899  380.6042 226.36815
sale.price  192.000000  274.0000 212.00000
CV residual   1.573101 -106.6042 -14.36815

Sum of squares = 11573.38    Mean square = 3857.79    n = 3 

fold 2 
Observations in test set: 3 
                   14         19        20
area        963.00000 790.000000 696.00000
cvpred      254.26927 216.149108 195.43642
sale.price  185.00000 221.500000 255.00000
CV residual -69.26927   5.350892  59.56358

Sum of squares = 8374.68    Mean square = 2791.56    n = 3 

fold 3 
Observations in test set: 3 
                   13         17        18
area         716.0000 1018.00000 887.00000
cvpred       216.4961  261.86662 242.18604
sale.price   112.7000  276.00000 260.00000
CV residual -103.7961   14.13338  17.81396

Sum of squares = 11290.73    Mean square = 3763.58    n = 3 

fold 4 
Observations in test set: 3 
                  10        21         22
area        905.0000 771.00000 1006.00000
cvpred      236.4021 210.77943  255.71471
sale.price  215.0000 260.00000  293.00000
CV residual -21.4021  49.22057   37.28529

Sum of squares = 4270.91    Mean square = 1423.64    n = 3 

fold 5 
Observations in test set: 3 
                    11        16      23
area        802.000000 714.00000 1191.00
cvpred      218.659431 207.12203  269.66
sale.price  215.000000 220.00000  375.00
CV residual  -3.659431  12.87797  105.34

Sum of squares = 11275.75    Mean square = 3758.58    n = 3 

Overall (Sum over all 3 folds) 
     ms 
3119.03
K fold Cross Validation cv.lm() function in R

Important Considerations for using cv.lm() Function

  • Package Source: Be aware that cv.lm() exists in multiple packages (e.g., DAAG, lmvar). Their implementations and outputs can differ. The version in the DAAG package is the most common for basic linear regression
  • For Other Model Types: If you need to perform cross-validation for other types of models, such as logistic regression (glm) or mixed models, consider using the more modern and flexible cv package with its generic cv() function.
  • Advanced Computational Methods: For large datasets, refitting the model for each fold can be slow. More advanced implementations, like those discussed in the cv package’s vignettes, use computational tricks (like the Woodbury matrix identity) to calculate cross-validated results more efficiently without completely refitting the model each time.

Statistics and Data Analysis

What is Vector Recycling in R?

When two vectors of different lengths are involved in an operation, then the elements of the shorter vector are reused to complete the operation. This is called element recycling.

Understanding Vector Recycling

In R, vector recycling is an automatic process where shorter vectors are recycled (repeated) to match the length of longer vectors during operations. This fundamental behavior makes R’s vectorized operations both powerful and potentially surprising for newcomers.

How Recycling Works

When ones perform different operations between two vectors of different lengths:

  • The shorter vector is repeated until it matches the length of the longer vector
  • The operation is performed element-wise
  • If the length of a vector is not a multiple of the shorter vector length, the R language will recycle, but will give a warning too.

Vector Recycling Examples

The following are some useful examples for vector recycling

Example 1: Vector Recycling for Simple Arithmetic

V1 <- c(1, 2, 3, 4, 5, 6)
V2 <- c(10, 20)
V1 + V2
## Output
[1] 11 22 13 24 15 26

In the above example, the vector of short length will be recycled and will become c(10, 20, 10, 20, 10, 20). It can be understood as explained below:

# 1 + 10 = 11
# 2 + 20 = 22
# 3 + 10 = 13 (recycled: 10 again)
# 4 + 20 = 24 (recycled: 20 again)
# 5 + 10 = 15 (recycled: 10 again)
# 6 + 20 = 26 (recycled: 20 again)

Example 2: Single Element Recycling

A single element is recycled to match a longer vector. For example

V3 <- 1:5
S <- 2
V3 * S

## Output
[1]  2  4  6  8 10

The above code is equivalent to

c(1, 2, 3, 4, 5) * c(2, 2, 2, 2, 2)

Vector Recycling for Non-Multiple Lengths

When a vector elements are not a multiple of another vector, the vector recycling will occur with warning message. For example

V4 <- c(1, 2, 3, 4, 5)   # length 5
V5 <- c(10, 20, 30)      # length 3
V4 + V5

## Output
[1] 11 22 33 14 25
Warning message:
In V4 + V5 :
  longer object length is not a multiple of shorter object length

V <- 1:10
cond <- 5
replace <- c(0, 1) # will be recycled

result <- ifelse(V > cond, replace[1], replace[2])
result

In the above example, the following recycling was performed:

# vec2 was recycled as: c(10, 20, 30, 10, 20)
# 1 + 10 = 11
# 2 + 20 = 22
# 3 + 30 = 33
# 4 + 10 = 14 (recycled from beginning: 10)
# 5 + 20 = 25 (recycled: 20)
Vector Recycling with Warning

Uses of Vector Recycling

Conditional vectorization

The vectorized ifelse can be used for vector recycling. Consider the following example

V <- 1:10
cond <- 5
replace <- c(0, 1) # will be recycled

result <- ifelse(V > cond, replace[1], replace[2])
result

## Output
[1] 1 1 1 1 1 0 0 0 0 0

In the example above, $V > 5$ becomes 0; otherwise becomes 1.

Explicit Recycling

The rep() function (sequences with rep()) uses explicit recycling.

R1 <- rep(c ("A", "B", "C"), times = 2)
R1
## Output
Output: [1] "A" "B" "C" "A" "B" "C"

R2 <- rep(c("A", "B", "C"), each = 2)
R2
## Output
[1] "A" "A" "B" "B" "C" "C"

Best Practices and Caveats for Vector Recycling

Always Check Vector Lengths

length_vec1 <- length(vec1)
length_vec2 <- length(vec2)

if(length_vec1 != length_vec2) {
  cat("Warning: Different lengths (", length_vec1, " vs ", length_vec2, ")\n")
  cat("Recycling will occur\n")
}
Checking Vector Recycling

Common Pitfalls to Avoid

The following are some common pitfalls that need to be avoided.

# Pitfall 1: Unexpected results with non-multiples
x <- 1:6
y <- 1:3
z <- x + y  # Works, but might not be intended

# Pitfall 2: Silent errors
df <- data.frame(a = 1:6, b = 1:3)  # Warning for mismatched lengths

# Pitfall 3: Matrix filling confusion
mat <- matrix(1:4, nrow = 2, ncol = 6)  # Recycles 1:4 to fill matrix
Vector Recycling pitfal for non-multiple elements

Summary

Vector recycling is a core R feature that

  • enables concise code for vectorized operations
  • makes scaler operations natural
  • facilitates pattern creation and data transformation
  • can cause unexpected results if not understood
  • may produce warnings with non-multiple lengths

Therefore, always be aware of vector lengths. Whenever in doubt, make vectors the same length explicitly using the rep() function or check lengths with the length() function.

Data Science Quizzes

corrgram Function in R

The corrgram function in R from the corrgram package is a powerful tool for creating correlation matrix visualizations. It combines numerical correlation values with graphical representations to help identify patterns, relationships, and outliers in multivariate data.

corrgram Package Installation

The corrgram first need to be installed. The following commands can be used for the installation and loading of corrgram package.

# Install and load the package
install.packages("corrgram")
library(corrgram)

# Load additional packages for examples
library(datasets)
library(corrplot)  # For comparison

One can also use corrplot package for visualization.

Syntax of the corrgram Function

The general syntax of the corrgram function in the R Language is:

corrgram(x, order = FALSE, panel = panel.shade, lower.panel = panel, 
         upper.panel = panel, diag.panel = NULL, text.panel = textPanel,
         label.pos = c(0.5, 0.5), label.srt = 0, cex.labels = NULL,
         font.labels = 1, row1attop = TRUE, dir = "", gap = 0, abs = FALSE, ...)

corrgram Examples

The following example makes use of the mtcars data set to draw a correlation matrix visualization. A numerical correlation matrix is also produced by using cor_matrix() function.

# Load mtcars dataset
data(mtcars)

# Basic corrgram
corrgram(mtcars, 
         main = "Correlation Matrix of mtcars Dataset",
         cex.main = 1.2)

# Calculate numerical correlations
cor_matrix <- cor(mtcars)
round(cor_matrix, 3)

## OUTPUT
        mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb
mpg   1.000 -0.852 -0.848 -0.776  0.681 -0.868  0.419  0.664  0.600  0.480 -0.551
cyl  -0.852  1.000  0.902  0.832 -0.700  0.782 -0.591 -0.811 -0.523 -0.493  0.527
disp -0.848  0.902  1.000  0.791 -0.710  0.888 -0.434 -0.710 -0.591 -0.556  0.395
hp   -0.776  0.832  0.791  1.000 -0.449  0.659 -0.708 -0.723 -0.243 -0.126  0.750
drat  0.681 -0.700 -0.710 -0.449  1.000 -0.712  0.091  0.440  0.713  0.700 -0.091
wt   -0.868  0.782  0.888  0.659 -0.712  1.000 -0.175 -0.555 -0.692 -0.583  0.428
qsec  0.419 -0.591 -0.434 -0.708  0.091 -0.175  1.000  0.745 -0.230 -0.213 -0.656
vs    0.664 -0.811 -0.710 -0.723  0.440 -0.555  0.745  1.000  0.168  0.206 -0.570
am    0.600 -0.523 -0.591 -0.243  0.713 -0.692 -0.230  0.168  1.000  0.794  0.058
gear  0.480 -0.493 -0.556 -0.126  0.700 -0.583 -0.213  0.206  0.794  1.000  0.274
carb -0.551  0.527  0.395  0.750 -0.091  0.428 -0.656 -0.570  0.058  0.274  1.000

Note that the diagonal shows the variable names. The upper triangle shows the colored squares with correlation coefficients. The lower triangle shows the colored ellipses/pies showing the strength and direction of the correlation.

corrgram function in R

The visualization and numerical results show the following

  • Dark blue/positive ellipses indicate strong positive correlations
  • Red/negative ellipses indicate strong negative correlations
  • Lighter colors indicate weaker correlations
  • For example, mpg and wt show a strong negative correlation (-0.87)

Customizing Panel and Ordering

One can easily customize the panel and ordering. For example

# Custom panel functions
corrgram(mtcars, 
         order = TRUE,  # PCA ordering
         lower.panel = panel.pie,  # Pies in lower triangle
         upper.panel = panel.conf, # Confidence intervals in upper
         diag.panel = panel.density, # Density plots on diagonal
         main = "Customized Corrgram")

# Alternative with different panels
corrgram(mtcars,
         lower.panel = panel.shade,
         upper.panel = panel.pts,  # Scatter plots
         diag.panel = panel.minmax, # Min-max values
         cex.labels = 1.2)
  • PCA ordering groups of highly correlated variables together
  • Pies show the proportion of correlation (filled portion = |r|)
  • Shading intensity indicates correlation strength
  • Scatter plots in the upper triangle show actual data relationships

Best Practices and Interpretation Guidelines

The following are best practices and interpretation guidelines when using corrgram function in R:

  1. Color Interpretation:
    • Blue = Positive correlation
    • Red = Negative correlation
    • Saturation intensity = Strength of correlation
    • White = No correlation
  2. Pattern Recognition:
    • Blocks of similar colors indicate variable clusters
    • Check for multicollinearity (high correlations among predictors)
    • Look for unexpected correlations that might indicate data issues
  3. Statistical Considerations:
  4. When to Use Different Panels:
    • panel.shade: Quick overview of correlation structure
    • panel.pie: Emphasize correlation magnitude
    • panel.ellipse: Show confidence and data spread
    • panel.pts: Identify outliers and nonlinear patterns

Summary of using corrgram Function in R

The corrgram function in R is an excellent tool for exploratory data analysis, providing both visual and numerical insights into correlation structures. The key takeaways:

  1. Start with basic plots and add customization as needed
  2. Always complement visual analysis with numerical correlation values
  3. Consider statistical significance when interpreting patterns
  4. Use appropriate correlation methods for your data type
  5. Combine corrgram() with other EDA tools for comprehensive analysis

The corrgram function in R is particularly valuable in the early stages of data analysis, helping to identify relationships, potential problems, and directions for further investigation.