K-Fold Cross-Validation

The cv.lm() function in R is used to perform k-fold cross-validation on linear regression models. The cv.lm() function is primarily available in the DAAD R Package, though similar functions exist in other R packages, such as lmvar and cv.

The main purpose of cv.lm() function is to evaluate how well the regression model will generalize to an independent dataset, helping to assess predictive accuracy and detect overfitting. In this post, I will explain the use, application, and visualizations of the cv.lm() function.

K-Fold Cross-Validation using cv.lm() Function

Before diving into the cv.lm() function, it is helpful to understand the underlying mechanics. The cv.lm() function performs k-fold cross-validation by following these steps:

  1. Data Splitting: The dataset is randomly shuffled and split into $m$ or $k$ equal-sized groups, called folds.
  2. Iterative Training and Testing: The model is trained and evaluated $m$ times. In each iteration:
    • A different fold is held out as the validation set.
    • The model is re-fitted using the data from the remaining $m-1$ folds (the training set).
    • This fitted model is then used to predict the outcomes for the observations in the held-out validation fold.
  3. Result Compilation: This process yields a set of cross-validated predictions (cvpred) for every observation in the original dataset, since each data point serves in the validation set exactly once.

cv.lm() Function Syntax and Important Arguments

The primary implementation of this function is CVlm() (or its alias cv.lm()) in the DAAD package.

CVlm(data, form.lm, m = 3, dots = FALSE, seed = 29, 
     plotit = c("Observed", "Residual"), 
     col.folds = NULL, main = NULL, legend.pos = "topleft", 
     printit = TRUE, ...)

The important arguments are:

  • data: The data frame containing the variables to be used in the model
  • form.lm: A formula, such as y ~ x1 + x2, which defines the linear regression model to be evaluated.
  • m: The number of folds to use. For example, $m=5$ performs 5-fold cross-validation. The default value is 3.
  • seed: An integer to set the random number generator seed. This is crucial for reproducibility, ensuring that the random assignment of data points to folds is the same each time you run the function.
  • plotit: Controls the graphical output. It can be set to “Observed” (the default), “Residual”, or “FALSE” to suppress plotting.
  • printit: A logical value. If set to TRUE (the default), the function prints a detailed summary of the cross-validation results, including the fold number and cross-validated predictions.
  • …: Additional arguments to be passed to other functions, such as legend() for customizing the plot.

Practical Applications of cv.lm() Function

The cv.lm() is a valuable function in the model-building process. Its primary applications are:

  • Estimating Prediction Error: It provides a more reliable estimate of a model’s prediction error on unseen data than the training error (Residual Sum of Squares), which is often overly optimistic.
  • Model Comparison: One can use it to compare the cross-validated prediction error of different models (such as a simple model vs a complex model with many predictors). The model with the lower cross-validated error is generally preferred.
  • Detecting Overfitting: If a model performs very well on the training data but yields a much higher cross-validated error, it is a strong sign of overfitting.
  • Evaluating Model Stability: By repeating cross-validation (or looking at the predictions per fold), one can assess how stable the model’s predictions are across different subsets of the data.

Understanding the Visual Output

One of the most helpful features of cv.lm() function is its ability to visualize the results makes it easier to spot patterns and potential problems.

plotit ArgumentDescription of VisualizationInterpretation and Use
"Observed" (or TRUE)This plot shows the observed values (on the y-axis) against the cross-validated predicted values (on the x-axis). Data points from different folds are usually marked with distinct symbols or colors. A reference line (usually y = x) is added, representing perfect prediction.For a good model, the points should cluster closely around the diagonal line. Systematic deviations from the line, such as a curve, can indicate non-linearity or model misspecification. The different colors/symbols for each fold help you see if the model’s performance is consistent across all data subsets.
"Residual"This plot displays the residuals (observed minus predicted) against the fitted values. For models with more than one predictor, the lines shown for each fold are approximations.This is essentially a cross-validated residual plot. You can use it to check for patterns in the residuals. Ideally, the residuals should be randomly scattered around zero with no clear trend (like a funnel shape), which would suggest constant variance. This plot helps in diagnosing violations of regression assumptions like heteroscedasticity.

To get a more concrete idea of how to use the function, here is a conceptual example (assuming the DAAG package is installed and loaded). Let us perform a 5-fold cross-validation on a model predicting $sale.price$ from area.

CVlm(data = houseprices, form.lm = formula(sale.price ~ area), 
     m = 5, seed = 123, plotit = "Observed", printit = TRUE)

## Output
fold 1 
Observations in test set: 3 
                     9        12        15
area        694.000000 1366.0000 821.00000
cvpred      190.426899  380.6042 226.36815
sale.price  192.000000  274.0000 212.00000
CV residual   1.573101 -106.6042 -14.36815

Sum of squares = 11573.38    Mean square = 3857.79    n = 3 

fold 2 
Observations in test set: 3 
                   14         19        20
area        963.00000 790.000000 696.00000
cvpred      254.26927 216.149108 195.43642
sale.price  185.00000 221.500000 255.00000
CV residual -69.26927   5.350892  59.56358

Sum of squares = 8374.68    Mean square = 2791.56    n = 3 

fold 3 
Observations in test set: 3 
                   13         17        18
area         716.0000 1018.00000 887.00000
cvpred       216.4961  261.86662 242.18604
sale.price   112.7000  276.00000 260.00000
CV residual -103.7961   14.13338  17.81396

Sum of squares = 11290.73    Mean square = 3763.58    n = 3 

fold 4 
Observations in test set: 3 
                  10        21         22
area        905.0000 771.00000 1006.00000
cvpred      236.4021 210.77943  255.71471
sale.price  215.0000 260.00000  293.00000
CV residual -21.4021  49.22057   37.28529

Sum of squares = 4270.91    Mean square = 1423.64    n = 3 

fold 5 
Observations in test set: 3 
                    11        16      23
area        802.000000 714.00000 1191.00
cvpred      218.659431 207.12203  269.66
sale.price  215.000000 220.00000  375.00
CV residual  -3.659431  12.87797  105.34

Sum of squares = 11275.75    Mean square = 3758.58    n = 3 

Overall (Sum over all 3 folds) 
     ms 
3119.03
K-fold cross-Validation using cv.lm() function in R

One can plot a residual plot instead of “Observed.”

CVlm(data = houseprices, form.lm = formula(sale.price ~ area), 
     m = 5, seed = 123, plotit = "Residual", printit = TRUE)

## Output
fold 1 
Observations in test set: 3 
                     9        12        15
area        694.000000 1366.0000 821.00000
cvpred      190.426899  380.6042 226.36815
sale.price  192.000000  274.0000 212.00000
CV residual   1.573101 -106.6042 -14.36815

Sum of squares = 11573.38    Mean square = 3857.79    n = 3 

fold 2 
Observations in test set: 3 
                   14         19        20
area        963.00000 790.000000 696.00000
cvpred      254.26927 216.149108 195.43642
sale.price  185.00000 221.500000 255.00000
CV residual -69.26927   5.350892  59.56358

Sum of squares = 8374.68    Mean square = 2791.56    n = 3 

fold 3 
Observations in test set: 3 
                   13         17        18
area         716.0000 1018.00000 887.00000
cvpred       216.4961  261.86662 242.18604
sale.price   112.7000  276.00000 260.00000
CV residual -103.7961   14.13338  17.81396

Sum of squares = 11290.73    Mean square = 3763.58    n = 3 

fold 4 
Observations in test set: 3 
                  10        21         22
area        905.0000 771.00000 1006.00000
cvpred      236.4021 210.77943  255.71471
sale.price  215.0000 260.00000  293.00000
CV residual -21.4021  49.22057   37.28529

Sum of squares = 4270.91    Mean square = 1423.64    n = 3 

fold 5 
Observations in test set: 3 
                    11        16      23
area        802.000000 714.00000 1191.00
cvpred      218.659431 207.12203  269.66
sale.price  215.000000 220.00000  375.00
CV residual  -3.659431  12.87797  105.34

Sum of squares = 11275.75    Mean square = 3758.58    n = 3 

Overall (Sum over all 3 folds) 
     ms 
3119.03
K fold Cross Validation cv.lm() function in R

Important Considerations for using cv.lm() Function

  • Package Source: Be aware that cv.lm() exists in multiple packages (e.g., DAAG, lmvar). Their implementations and outputs can differ. The version in the DAAG package is the most common for basic linear regression
  • For Other Model Types: If you need to perform cross-validation for other types of models, such as logistic regression (glm) or mixed models, consider using the more modern and flexible cv package with its generic cv() function.
  • Advanced Computational Methods: For large datasets, refitting the model for each fold can be slow. More advanced implementations, like those discussed in the cv package’s vignettes, use computational tricks (like the Woodbury matrix identity) to calculate cross-validated results more efficiently without completely refitting the model each time.

Statistics and Data Analysis

corrgram Function in R

The corrgram function in R from the corrgram package is a powerful tool for creating correlation matrix visualizations. It combines numerical correlation values with graphical representations to help identify patterns, relationships, and outliers in multivariate data.

corrgram Package Installation

The corrgram first need to be installed. The following commands can be used for the installation and loading of corrgram package.

# Install and load the package
install.packages("corrgram")
library(corrgram)

# Load additional packages for examples
library(datasets)
library(corrplot)  # For comparison

One can also use corrplot package for visualization.

Syntax of the corrgram Function

The general syntax of the corrgram function in the R Language is:

corrgram(x, order = FALSE, panel = panel.shade, lower.panel = panel, 
         upper.panel = panel, diag.panel = NULL, text.panel = textPanel,
         label.pos = c(0.5, 0.5), label.srt = 0, cex.labels = NULL,
         font.labels = 1, row1attop = TRUE, dir = "", gap = 0, abs = FALSE, ...)

corrgram Examples

The following example makes use of the mtcars data set to draw a correlation matrix visualization. A numerical correlation matrix is also produced by using cor_matrix() function.

# Load mtcars dataset
data(mtcars)

# Basic corrgram
corrgram(mtcars, 
         main = "Correlation Matrix of mtcars Dataset",
         cex.main = 1.2)

# Calculate numerical correlations
cor_matrix <- cor(mtcars)
round(cor_matrix, 3)

## OUTPUT
        mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb
mpg   1.000 -0.852 -0.848 -0.776  0.681 -0.868  0.419  0.664  0.600  0.480 -0.551
cyl  -0.852  1.000  0.902  0.832 -0.700  0.782 -0.591 -0.811 -0.523 -0.493  0.527
disp -0.848  0.902  1.000  0.791 -0.710  0.888 -0.434 -0.710 -0.591 -0.556  0.395
hp   -0.776  0.832  0.791  1.000 -0.449  0.659 -0.708 -0.723 -0.243 -0.126  0.750
drat  0.681 -0.700 -0.710 -0.449  1.000 -0.712  0.091  0.440  0.713  0.700 -0.091
wt   -0.868  0.782  0.888  0.659 -0.712  1.000 -0.175 -0.555 -0.692 -0.583  0.428
qsec  0.419 -0.591 -0.434 -0.708  0.091 -0.175  1.000  0.745 -0.230 -0.213 -0.656
vs    0.664 -0.811 -0.710 -0.723  0.440 -0.555  0.745  1.000  0.168  0.206 -0.570
am    0.600 -0.523 -0.591 -0.243  0.713 -0.692 -0.230  0.168  1.000  0.794  0.058
gear  0.480 -0.493 -0.556 -0.126  0.700 -0.583 -0.213  0.206  0.794  1.000  0.274
carb -0.551  0.527  0.395  0.750 -0.091  0.428 -0.656 -0.570  0.058  0.274  1.000

Note that the diagonal shows the variable names. The upper triangle shows the colored squares with correlation coefficients. The lower triangle shows the colored ellipses/pies showing the strength and direction of the correlation.

corrgram function in R

The visualization and numerical results show the following

  • Dark blue/positive ellipses indicate strong positive correlations
  • Red/negative ellipses indicate strong negative correlations
  • Lighter colors indicate weaker correlations
  • For example, mpg and wt show a strong negative correlation (-0.87)

Customizing Panel and Ordering

One can easily customize the panel and ordering. For example

# Custom panel functions
corrgram(mtcars, 
         order = TRUE,  # PCA ordering
         lower.panel = panel.pie,  # Pies in lower triangle
         upper.panel = panel.conf, # Confidence intervals in upper
         diag.panel = panel.density, # Density plots on diagonal
         main = "Customized Corrgram")

# Alternative with different panels
corrgram(mtcars,
         lower.panel = panel.shade,
         upper.panel = panel.pts,  # Scatter plots
         diag.panel = panel.minmax, # Min-max values
         cex.labels = 1.2)
  • PCA ordering groups of highly correlated variables together
  • Pies show the proportion of correlation (filled portion = |r|)
  • Shading intensity indicates correlation strength
  • Scatter plots in the upper triangle show actual data relationships

Best Practices and Interpretation Guidelines

The following are best practices and interpretation guidelines when using corrgram function in R:

  1. Color Interpretation:
    • Blue = Positive correlation
    • Red = Negative correlation
    • Saturation intensity = Strength of correlation
    • White = No correlation
  2. Pattern Recognition:
    • Blocks of similar colors indicate variable clusters
    • Check for multicollinearity (high correlations among predictors)
    • Look for unexpected correlations that might indicate data issues
  3. Statistical Considerations:
  4. When to Use Different Panels:
    • panel.shade: Quick overview of correlation structure
    • panel.pie: Emphasize correlation magnitude
    • panel.ellipse: Show confidence and data spread
    • panel.pts: Identify outliers and nonlinear patterns

Summary of using corrgram Function in R

The corrgram function in R is an excellent tool for exploratory data analysis, providing both visual and numerical insights into correlation structures. The key takeaways:

  1. Start with basic plots and add customization as needed
  2. Always complement visual analysis with numerical correlation values
  3. Consider statistical significance when interpreting patterns
  4. Use appropriate correlation methods for your data type
  5. Combine corrgram() with other EDA tools for comprehensive analysis

The corrgram function in R is particularly valuable in the early stages of data analysis, helping to identify relationships, potential problems, and directions for further investigation.

stepAIC in R: Shortcut to the Perfect Model

If you are tired of guessing which variables belong in your regression model, stepAIC in R MASS package is about to become your new best friend. This powerful function automates model selection using the Akaike Information Criterion (AIC), essentially providing a data-driven approach to building better predictive models without the headache.

Think of stepAIC as your personal statistical assistant that tries different combinations of variables, keeping what works and dropping what does not, all while balancing model complexity with explanatory power.

How Does stepAIC Actually Work? The Simple Explanation

The AIC Magic Behind the Scenes

AIC (Akaike Information Criterion) is like a “model score” that balances two things:

  1. How well your model fits the data (goodness of fit)
  2. How many variables does it use (model complexity)

Lower AIC score = Better model

The job of stepAIC in R is to find the lowest score. The Three Search Strategies

  • Forward selection
  • Backward elimination
  • Both directions
# Forward selection: Starts simple, adds helpful variables
stepAIC(model, direction = "forward")

# Backward elimination: Starts with everything, removes useless variables  
stepAIC(model, direction = "backward")

# Both directions: The most thorough approach (usually best)
stepAIC(model, direction = "both")

Installation and Basic Setup

# Install and load the MASS package
install.packages("MASS")
library(MASS)

# Load sample data
data(mtcars)

# Create a full model with all variables
full_model <- lm(mpg ~ ., data = mtcars)

Running Your First stepAIC

# Let stepAIC find the optimal model
best_model <- stepAIC(full_model, direction = "both")

# See what variables survived
summary(best_model)
stepAIC in R for Best model selection MASS Package

Real Business Example: Predicting House Prices

# Imagine you are a real estate analyst with 20 potential predictors
# The stepAIC will helps you identify which actually matter:
# Square footage ✓
# Number of bedrooms ✓
# Distance to school ✗ (not significant)
# Pool ✗ (not worth the complexity)
# Result: Cleaner model, better predictions, clearer insights

Why Data Scientists Love stepAIC: Key Benefits

Automated Variable Selection: No more manual “try-and-check” loops. The stepAIC in R systematically tests combinations far more efficiently than manual searching.
Avoids Overfitting: By penalizing unnecessary complexity, stepAIC helps create models that generalize better to new data.
Interpretable Results: The final model typically includes only meaningful predictors, making it easier to explain to stakeholders.
Time-Saving Efficiency: What might take hours of manual testing can be completed in seconds.

Common Pitfalls and How to Avoid Them

Do not Blindly Trust the Output

The stepAIC in R is a tool, not an oracle. Always

  • Check model assumptions
  • Consider domain knowledge
  • Validate with ‘out-of-sample testing.’

The Multicollinearity Trap

The stepAIC in R does not automatically detect correlated predictors. Use VIF (Variance Inflation Factor) checks:

car::vif(best_model)  # Values > 5 indicate problems

See the mctest package for the detection of multicollinearity.

Small Sample Warnings

With limited data, stepAIC can be unstable. Consider cross-validation for small datasets.

Setting Search Limits

# Control how hard stepAIC searches
stepAIC(model, 
        scope = list(lower = ~1, upper = full_model),
        k = 2,  # Penalty for complexity (default 2)
        steps = 1000)  # Maximum steps to prevent endless search

Combining with Other Techniques

# Use after regularization for best results
library(glmnet)
# First: LASSO for variable screening
# Then: stepAIC for final refinement

Note that

  • stepAIC: Selects/ drops whole variables
  • LASSO: Can shrink coefficients to zero
  • Best practice: Use LASSO first, then stepAIC

Frequently Asked Questions

Is stepAIC cheating? Is not this p-hacking?

When used properly (with validation), it is legitimate model building. When abused (testing dozens of models without correction), it becomes problematic. Always use cross-validation!

How many variables can stepAIC handle?

Practically, 30-40. Beyond that, consider dimensionality reduction first.

Can I use stepAIC for logistic regression?

Absolutely! It works with glm() models too:

logit_model <- glm(outcome ~ ., data, family = binomial)
best_logit <- stepAIC(logit_model)

Summary

stepAIC in R from the MASS package is one of those R programming tools that delivers tremendous value for minimal effort. While it would not replace statistical thinking, it eliminates the grunt work of model selection, letting you focus on interpretation and application.

Remember: The best model is not always the one with the highest R-squared or the lowest AIC: it is the one that solves your business problem most effectively. stepAIC gets you 80% of the way there; your expertise covers the final, crucial 20%.