K-Fold Cross-Validation

The cv.lm() function in R is used to perform k-fold cross-validation on linear regression models. The cv.lm() function is primarily available in the DAAD R Package, though similar functions exist in other R packages, such as lmvar and cv.

The main purpose of cv.lm() function is to evaluate how well the regression model will generalize to an independent dataset, helping to assess predictive accuracy and detect overfitting. In this post, I will explain the use, application, and visualizations of the cv.lm() function.

K-Fold Cross-Validation using cv.lm() Function

Before diving into the cv.lm() function, it is helpful to understand the underlying mechanics. The cv.lm() function performs k-fold cross-validation by following these steps:

  1. Data Splitting: The dataset is randomly shuffled and split into $m$ or $k$ equal-sized groups, called folds.
  2. Iterative Training and Testing: The model is trained and evaluated $m$ times. In each iteration:
    • A different fold is held out as the validation set.
    • The model is re-fitted using the data from the remaining $m-1$ folds (the training set).
    • This fitted model is then used to predict the outcomes for the observations in the held-out validation fold.
  3. Result Compilation: This process yields a set of cross-validated predictions (cvpred) for every observation in the original dataset, since each data point serves in the validation set exactly once.

cv.lm() Function Syntax and Important Arguments

The primary implementation of this function is CVlm() (or its alias cv.lm()) in the DAAD package.

CVlm(data, form.lm, m = 3, dots = FALSE, seed = 29, 
     plotit = c("Observed", "Residual"), 
     col.folds = NULL, main = NULL, legend.pos = "topleft", 
     printit = TRUE, ...)

The important arguments are:

  • data: The data frame containing the variables to be used in the model
  • form.lm: A formula, such as y ~ x1 + x2, which defines the linear regression model to be evaluated.
  • m: The number of folds to use. For example, $m=5$ performs 5-fold cross-validation. The default value is 3.
  • seed: An integer to set the random number generator seed. This is crucial for reproducibility, ensuring that the random assignment of data points to folds is the same each time you run the function.
  • plotit: Controls the graphical output. It can be set to “Observed” (the default), “Residual”, or “FALSE” to suppress plotting.
  • printit: A logical value. If set to TRUE (the default), the function prints a detailed summary of the cross-validation results, including the fold number and cross-validated predictions.
  • …: Additional arguments to be passed to other functions, such as legend() for customizing the plot.

Practical Applications of cv.lm() Function

The cv.lm() is a valuable function in the model-building process. Its primary applications are:

  • Estimating Prediction Error: It provides a more reliable estimate of a model’s prediction error on unseen data than the training error (Residual Sum of Squares), which is often overly optimistic.
  • Model Comparison: One can use it to compare the cross-validated prediction error of different models (such as a simple model vs a complex model with many predictors). The model with the lower cross-validated error is generally preferred.
  • Detecting Overfitting: If a model performs very well on the training data but yields a much higher cross-validated error, it is a strong sign of overfitting.
  • Evaluating Model Stability: By repeating cross-validation (or looking at the predictions per fold), one can assess how stable the model’s predictions are across different subsets of the data.

Understanding the Visual Output

One of the most helpful features of cv.lm() function is its ability to visualize the results makes it easier to spot patterns and potential problems.

plotit ArgumentDescription of VisualizationInterpretation and Use
"Observed" (or TRUE)This plot shows the observed values (on the y-axis) against the cross-validated predicted values (on the x-axis). Data points from different folds are usually marked with distinct symbols or colors. A reference line (usually y = x) is added, representing perfect prediction.For a good model, the points should cluster closely around the diagonal line. Systematic deviations from the line, such as a curve, can indicate non-linearity or model misspecification. The different colors/symbols for each fold help you see if the model’s performance is consistent across all data subsets.
"Residual"This plot displays the residuals (observed minus predicted) against the fitted values. For models with more than one predictor, the lines shown for each fold are approximations.This is essentially a cross-validated residual plot. You can use it to check for patterns in the residuals. Ideally, the residuals should be randomly scattered around zero with no clear trend (like a funnel shape), which would suggest constant variance. This plot helps in diagnosing violations of regression assumptions like heteroscedasticity.

To get a more concrete idea of how to use the function, here is a conceptual example (assuming the DAAG package is installed and loaded). Let us perform a 5-fold cross-validation on a model predicting $sale.price$ from area.

CVlm(data = houseprices, form.lm = formula(sale.price ~ area), 
     m = 5, seed = 123, plotit = "Observed", printit = TRUE)

## Output
fold 1 
Observations in test set: 3 
                     9        12        15
area        694.000000 1366.0000 821.00000
cvpred      190.426899  380.6042 226.36815
sale.price  192.000000  274.0000 212.00000
CV residual   1.573101 -106.6042 -14.36815

Sum of squares = 11573.38    Mean square = 3857.79    n = 3 

fold 2 
Observations in test set: 3 
                   14         19        20
area        963.00000 790.000000 696.00000
cvpred      254.26927 216.149108 195.43642
sale.price  185.00000 221.500000 255.00000
CV residual -69.26927   5.350892  59.56358

Sum of squares = 8374.68    Mean square = 2791.56    n = 3 

fold 3 
Observations in test set: 3 
                   13         17        18
area         716.0000 1018.00000 887.00000
cvpred       216.4961  261.86662 242.18604
sale.price   112.7000  276.00000 260.00000
CV residual -103.7961   14.13338  17.81396

Sum of squares = 11290.73    Mean square = 3763.58    n = 3 

fold 4 
Observations in test set: 3 
                  10        21         22
area        905.0000 771.00000 1006.00000
cvpred      236.4021 210.77943  255.71471
sale.price  215.0000 260.00000  293.00000
CV residual -21.4021  49.22057   37.28529

Sum of squares = 4270.91    Mean square = 1423.64    n = 3 

fold 5 
Observations in test set: 3 
                    11        16      23
area        802.000000 714.00000 1191.00
cvpred      218.659431 207.12203  269.66
sale.price  215.000000 220.00000  375.00
CV residual  -3.659431  12.87797  105.34

Sum of squares = 11275.75    Mean square = 3758.58    n = 3 

Overall (Sum over all 3 folds) 
     ms 
3119.03
K-fold cross-Validation using cv.lm() function in R

One can plot a residual plot instead of “Observed.”

CVlm(data = houseprices, form.lm = formula(sale.price ~ area), 
     m = 5, seed = 123, plotit = "Residual", printit = TRUE)

## Output
fold 1 
Observations in test set: 3 
                     9        12        15
area        694.000000 1366.0000 821.00000
cvpred      190.426899  380.6042 226.36815
sale.price  192.000000  274.0000 212.00000
CV residual   1.573101 -106.6042 -14.36815

Sum of squares = 11573.38    Mean square = 3857.79    n = 3 

fold 2 
Observations in test set: 3 
                   14         19        20
area        963.00000 790.000000 696.00000
cvpred      254.26927 216.149108 195.43642
sale.price  185.00000 221.500000 255.00000
CV residual -69.26927   5.350892  59.56358

Sum of squares = 8374.68    Mean square = 2791.56    n = 3 

fold 3 
Observations in test set: 3 
                   13         17        18
area         716.0000 1018.00000 887.00000
cvpred       216.4961  261.86662 242.18604
sale.price   112.7000  276.00000 260.00000
CV residual -103.7961   14.13338  17.81396

Sum of squares = 11290.73    Mean square = 3763.58    n = 3 

fold 4 
Observations in test set: 3 
                  10        21         22
area        905.0000 771.00000 1006.00000
cvpred      236.4021 210.77943  255.71471
sale.price  215.0000 260.00000  293.00000
CV residual -21.4021  49.22057   37.28529

Sum of squares = 4270.91    Mean square = 1423.64    n = 3 

fold 5 
Observations in test set: 3 
                    11        16      23
area        802.000000 714.00000 1191.00
cvpred      218.659431 207.12203  269.66
sale.price  215.000000 220.00000  375.00
CV residual  -3.659431  12.87797  105.34

Sum of squares = 11275.75    Mean square = 3758.58    n = 3 

Overall (Sum over all 3 folds) 
     ms 
3119.03
K fold Cross Validation cv.lm() function in R

Important Considerations for using cv.lm() Function

  • Package Source: Be aware that cv.lm() exists in multiple packages (e.g., DAAG, lmvar). Their implementations and outputs can differ. The version in the DAAG package is the most common for basic linear regression
  • For Other Model Types: If you need to perform cross-validation for other types of models, such as logistic regression (glm) or mixed models, consider using the more modern and flexible cv package with its generic cv() function.
  • Advanced Computational Methods: For large datasets, refitting the model for each fold can be slow. More advanced implementations, like those discussed in the cv package’s vignettes, use computational tricks (like the Woodbury matrix identity) to calculate cross-validated results more efficiently without completely refitting the model each time.

Statistics and Data Analysis

stepAIC in R: Shortcut to the Perfect Model

If you are tired of guessing which variables belong in your regression model, stepAIC in R MASS package is about to become your new best friend. This powerful function automates model selection using the Akaike Information Criterion (AIC), essentially providing a data-driven approach to building better predictive models without the headache.

Think of stepAIC as your personal statistical assistant that tries different combinations of variables, keeping what works and dropping what does not, all while balancing model complexity with explanatory power.

How Does stepAIC Actually Work? The Simple Explanation

The AIC Magic Behind the Scenes

AIC (Akaike Information Criterion) is like a “model score” that balances two things:

  1. How well your model fits the data (goodness of fit)
  2. How many variables does it use (model complexity)

Lower AIC score = Better model

The job of stepAIC in R is to find the lowest score. The Three Search Strategies

  • Forward selection
  • Backward elimination
  • Both directions
# Forward selection: Starts simple, adds helpful variables
stepAIC(model, direction = "forward")

# Backward elimination: Starts with everything, removes useless variables  
stepAIC(model, direction = "backward")

# Both directions: The most thorough approach (usually best)
stepAIC(model, direction = "both")

Installation and Basic Setup

# Install and load the MASS package
install.packages("MASS")
library(MASS)

# Load sample data
data(mtcars)

# Create a full model with all variables
full_model <- lm(mpg ~ ., data = mtcars)

Running Your First stepAIC

# Let stepAIC find the optimal model
best_model <- stepAIC(full_model, direction = "both")

# See what variables survived
summary(best_model)
stepAIC in R for Best model selection MASS Package

Real Business Example: Predicting House Prices

# Imagine you are a real estate analyst with 20 potential predictors
# The stepAIC will helps you identify which actually matter:
# Square footage ✓
# Number of bedrooms ✓
# Distance to school ✗ (not significant)
# Pool ✗ (not worth the complexity)
# Result: Cleaner model, better predictions, clearer insights

Why Data Scientists Love stepAIC: Key Benefits

Automated Variable Selection: No more manual “try-and-check” loops. The stepAIC in R systematically tests combinations far more efficiently than manual searching.
Avoids Overfitting: By penalizing unnecessary complexity, stepAIC helps create models that generalize better to new data.
Interpretable Results: The final model typically includes only meaningful predictors, making it easier to explain to stakeholders.
Time-Saving Efficiency: What might take hours of manual testing can be completed in seconds.

Common Pitfalls and How to Avoid Them

Do not Blindly Trust the Output

The stepAIC in R is a tool, not an oracle. Always

  • Check model assumptions
  • Consider domain knowledge
  • Validate with ‘out-of-sample testing.’

The Multicollinearity Trap

The stepAIC in R does not automatically detect correlated predictors. Use VIF (Variance Inflation Factor) checks:

car::vif(best_model)  # Values > 5 indicate problems

See the mctest package for the detection of multicollinearity.

Small Sample Warnings

With limited data, stepAIC can be unstable. Consider cross-validation for small datasets.

Setting Search Limits

# Control how hard stepAIC searches
stepAIC(model, 
        scope = list(lower = ~1, upper = full_model),
        k = 2,  # Penalty for complexity (default 2)
        steps = 1000)  # Maximum steps to prevent endless search

Combining with Other Techniques

# Use after regularization for best results
library(glmnet)
# First: LASSO for variable screening
# Then: stepAIC for final refinement

Note that

  • stepAIC: Selects/ drops whole variables
  • LASSO: Can shrink coefficients to zero
  • Best practice: Use LASSO first, then stepAIC

Frequently Asked Questions

Is stepAIC cheating? Is not this p-hacking?

When used properly (with validation), it is legitimate model building. When abused (testing dozens of models without correction), it becomes problematic. Always use cross-validation!

How many variables can stepAIC handle?

Practically, 30-40. Beyond that, consider dimensionality reduction first.

Can I use stepAIC for logistic regression?

Absolutely! It works with glm() models too:

logit_model <- glm(outcome ~ ., data, family = binomial)
best_logit <- stepAIC(logit_model)

Summary

stepAIC in R from the MASS package is one of those R programming tools that delivers tremendous value for minimal effort. While it would not replace statistical thinking, it eliminates the grunt work of model selection, letting you focus on interpretation and application.

Remember: The best model is not always the one with the highest R-squared or the lowest AIC: it is the one that solves your business problem most effectively. stepAIC gets you 80% of the way there; your expertise covers the final, crucial 20%.

The glm Function in R

Learn about the glm function in R with this comprehensive Q&A guide. Understand logistic regression, Poisson regression, syntax, families, key components, use cases, model diagnostics, and goodness of fit. Includes a practical example for logistic regression using glm() function in R.

What is the glm function in the R language?

The glm (Generalized Linear Models) function in R is a powerful tool for fitting linear models to data where the response variable may have a non-normal distribution. It extends the capabilities of traditional linear regression to handle various types of response variables through the use of link functions and exponential family distributions.

Since the distribution of the response depends on the stimulus variables through a single linear function only, the same mechanism as was used for linear models can still be used to specify the linear part of a generalized model.

What is Logistic Regression?

Logistic regression is used to predict the binary outcome from the given set of continuous predictor variables.

What is the Poisson Regression?

The Poisson regression is used to predict the outcome variable, which represents counts from the given set of continuous predictor variables.

What is the general syntax of the glm function in R Language?

The general syntax to fit a Generalized Linear Model is glm() function in R is:

glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL,
    etastart, mustart, offset, control = list(...), model = TRUE, method = "glm.fit",
    x = FALSE, y = TRUE, contrasts = NULL, ...)

What are families in R?

The class of Generalized Linear Models handled by facilities supplied in R includes Gaussian, Binomial, Poisson, Inverse Gaussian, and Gamma response distributions, and also quasi-likelihood models where the response distribution is not explicitly specified. In the latter case, the variance function must be specified as a function of the mean, but in other cases, this function is implied by the response distribution.

Write about the Key components of glm Function in R

Formula

It specifies the relationship between variables, similar to lm(). For example,

y ~ x1 + x2 + x3  # main effects
y ~ x1*x2         # main effects plus interaction
y ~ .    

Family

It defines the error distribution and link function. The Common families are:

  • gaussian(): Normal distribution (default)
  • binomial(): Logistic regression (binary outcomes)
  • poisson(): Poisson regression (count data)
  • Gamma(): Gamma regression
  • inverse.gaussian(): Inverse Gaussian distribution

What are the common use cases of glm() function?

Each family has link functions (e.g., logit for binomial, log for Poisson).

Logistic Regression (Binary Outcomes

model <- glm(outcome ~ predictor1 + predictor2, family = binomial(link = "logit"),
             data = mydata)

Poisson Regression (Count Data)

model <- glm(count ~ treatment + offset(log(exposure)), family = poisson(link = "log"),
             data = count_data)

What statistics can be computed after fitting glm() model?

After fitting a model, one can use:

summary(model)   # Detailed output including coefficients
coef(model)      # Model coefficients
confint(model)   # Confidence intervals
predict(model)   # Predicted values

What are model diagnostics and goodness-of-fit?

The following are built-in glm() model diagnostics and goodness of fit:

anova(model, test = "Chisq")  # Analysis of deviance
residuals(model)              # Various residual types available
plot(model)                   # Diagnostic plots

Give an example of logistic regression fitting using glm() function.

Consider the mtcars data set, where am is the response variable

# Fit model
data(mtcars)
model <- glm(am ~ hp + wt, family = binomial, data = mtcars)

# View results
summary(model)

# Predict probabilities
predict(model, type = "response")

# Plot
par(mfrow = c(2, 2))
plot(model)
The glm Function in R

Tips for the Effective Use of glm() function?

  1. Always check model assumptions and diagnostics
  2. For binomial models, the response can be:
    • A factor (first level = failure, others = success)
    • A numeric vector of 0/1 values
    • A two-column matrix of successes/failures
  3. Use drop1() or add1() for model selection
  4. Consider glm.nb() from the MASS package for overdispersed count data

The glm() function is fundamental for many statistical analyses in R, providing flexibility to handle various types of response variables beyond normal distributions.

Try Pedagogy Quizzes