Cluster Analysis Validation (`clav`)

Slides from New York Open Statistical Programming Meetup talk given on March 10, 2026

Slides from Joint Statistical Meeting (JSM), August, 2025

Poster from the National Council on Measurement in Education, April, 2026.
This poster uses the ShinyPoster framework and is interactive. The poster can be viewed using R with the following command:

shiny::runGitHub(repo = 'clav', username = 'jbryer', subdir = 'poster', destdir = getwd())

The clav package provides utilities for conducting cluster (profile) analysis with an emphasis on the validating the stability of the profiles both within a given data set as well as across data sets. Unlike supervised models where the known class is measured, validation of unsupervised models where there is no known class can be challenging. The approach implemented here attempts to compare the cluster results across many random samples.

Installation

You can install the development version of clav like so:

remotes::install_github('jbryer/clav')

Development

The following commands are useful for working with the package source locally.

# Prep the PISA data set. This will take a while to run the first time.
source('data-raw/data-prep-pisa-2015.R')
# Generate the package documentation
usethis::use_tidy_description()
devtools::document()
devtools::check_man()
# Install the package
devtools::install()
devtools::install(build_vignettes = TRUE)
# Run CRAN check
devtools::check(cran = TRUE)

# Build the pkgdown site
pkgdown::build_site()

Clustering Functions

This package is designed to work with nearly any clustering methods. However, there is no consistency in how clustering functions are implemented with regard to parameters and returned values. The functions clav package expect a format similar to the kmeans function. Specifically, there are two parameters passed:

A data frame with the variables used to do the clustering.
The number of clusters, k.

The parameters are passed unnamed so the order of the parameters matters. Additionally, the returned object should have an element cluster that provides the cluster membership. The length of this vector should be equal to the number of rows in the data frame.

For example, to use hierarchical clustering with the clav package, we need to implement a wrapper (note that the hclust2 function below is included in the package).

hclust2 <- function(x, k, ...) {
    result <- stats::hclust(dist(x), ...)
    result$k <- k
    result$data <- x
    result$cluster <- stats::cutree(result, k = k)
    class(result) <- c('hclust2', class(result))
    return(result)
}

Note that the first parameter, x is the data frame of variables used for the clustering. The stats::hclust() function expect a dissimilarity structure as it’s first parameter (the result of stats::dist()). Since hclust determines cluster membership for all n - 1 possibilities, we use the stats::cutree() function for the desired k. Other parameters for hclust are passed using the ... (dots) operator. Lastly, we add the cluster vector and modify the class so that we could implement S3 methods if necessary.

Additionally, an S3 method for predict needs to be implemented. The following example implements predict() for hclust2().

Example

library(clav)
data(pisa2015, package = 'clav')

cluster_vars <- c('interest', 'enjoyment', 'motivation', 'efficacy', "belonging")
outcome_vars <- c('science_score', 'principals')

pisa_usa <- pisa2015 |> dplyr::filter(country == 'UNITED STATES')

Finding the optimal number of clusters.

optimal <- optimal_clusters(pisa_usa[,cluster_vars], max_k = 5)
optimal
#>   k  wss silhoutte  gap calinski_harabasz davies_bouldin rand_index
#> 1 1 9232        NA 0.83               NaN            NaN         NA
#> 2 2 6746      0.24 0.87              1708            1.6       0.50
#> 3 3 5869      0.20 0.86              1332            1.8       0.63
#> 4 4 5259      0.20 0.86              1160            1.7       0.74
#> 5 5 4658      0.20 0.86              1125            1.6       0.68
optimmal_plots <- plot(optimal, ncol = 2)
names(optimmal_plots)
#> [1] "davies_bouldin" "calinski"       "wss"            "silhouette"    
#> [5] "gap"            "rand"
# optimmal_plots[['wss']]
optimmal_plots

Validating cluster profiles using random samples of 50%. Out-of-bag uses the remaining 50% to predict cluster membership.

pisa_cv_random <- pisa_usa |> 
    dplyr::select(dplyr::all_of(cluster_vars)) |>
    clav::cluster_validation(
        n_clusters = 3,
        sample_size = 0.5 * nrow(pisa_usa),
        replace = FALSE,
        n_samples = 100,
        seed = 42
)
plot(pisa_cv_random)

Re-estimate the clusters using the OOB sample instead of predicting using the in sample model.

pisa_cv_random2 <- pisa_usa |> 
    dplyr::select(dplyr::all_of(cluster_vars)) |>
    clav::cluster_validation(
        n_clusters = 3,
        oob_predict_fun = function(fit, newdata) {
            newfit <- stats::kmeans(newdata, 3)
            newfit$cluster
        },
        sample_size = 0.5 * nrow(pisa_usa),
        replace = FALSE,
        n_samples = 100,
        seed = 42
)
plot(pisa_cv_random2)

Bootstrap approach to validation.

pisa_cv_bootstrap <- pisa_usa |> 
    dplyr::select(dplyr::all_of(cluster_vars)) |>
    clav::cluster_validation(
        n_clusters = 3,
        sample_size = nrow(pisa_usa),
        replace = TRUE,
        n_samples = 100,
        seed = 42
)
plot(pisa_cv_bootstrap)

Using latent profile analysis for estimating clusters.

library(tidyLPA)
lpa <- pisa_usa |> 
    dplyr::select(dplyr::all_of(cluster_vars)) |>
    tidyLPA::estimate_profiles(3)
# lpa_predict <- predict(lpa, pisaUSA15[sample(nrow(pisaUSA15), 100),])
# lpa_estimates <- get_estimates(lpa)
lpa_data <- get_data(lpa)

plot_profiles(lpa)
clav::profile_plot(pisa_usa[,cluster_vars],
                   clusters = lpa_data$Class,
                   df_dep = pisa_usa[,c('science_score')])

lpa_cv_random <- cluster_validation(
    pisaUSA15,
    n_clusters = 3,
    cluster_fun = estimate_profiles,
    oob_predict_fun = function(fit, newdata) {
        estimate_profiles(newdata, n_clusters)
    },
    sample_size = 0.5 * nrow(pisaUSA15),
    replace = FALSE,
    n_samples = 50,
    seed = 42
)
plot(lpa_cv_random)

Profile Plot

fit <- pisa_usa |> 
    dplyr::select(dplyr::all_of(cluster_vars)) |>
    stats::kmeans(centers = 3)
clav::profile_plot(pisa_usa[,cluster_vars],
                   clusters = fit$cluster,
                   df_dep = pisa_usa[,outcome_vars],
                   center_band = 0.33,
                   cluster_order = cluster_vars)

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github		.github
R		R
data-raw		data-raw
data		data
docs		docs
inst		inst
man		man
poster		poster
slides		slides
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
clav.Rproj		clav.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cluster Analysis Validation (`clav`)

Installation

Development

Clustering Functions

Example

Profile Plot

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cluster Analysis Validation (clav)

Installation

Development

Clustering Functions

Example

Profile Plot

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Cluster Analysis Validation (`clav`)

Packages