Slides from New York Open Statistical Programming Meetup talk given on March 10, 2026
Slides from Joint Statistical Meeting (JSM), August, 2025
Poster from the National Council on Measurement in Education, April,
2026.
This poster uses the
ShinyPoster framework and is
interactive. The poster can be viewed using R with the following
command:
shiny::runGitHub(repo = 'clav', username = 'jbryer', subdir = 'poster', destdir = getwd())The clav package provides utilities for conducting cluster (profile)
analysis with an emphasis on the validating the stability of the
profiles both within a given data set as well as across data sets.
Unlike supervised models where the known class is measured, validation
of unsupervised models where there is no known class can be challenging.
The approach implemented here attempts to compare the cluster results
across many random samples.
You can install the development version of clav like so:
remotes::install_github('jbryer/clav')The following commands are useful for working with the package source locally.
# Prep the PISA data set. This will take a while to run the first time.
source('data-raw/data-prep-pisa-2015.R')
# Generate the package documentation
usethis::use_tidy_description()
devtools::document()
devtools::check_man()
# Install the package
devtools::install()
devtools::install(build_vignettes = TRUE)
# Run CRAN check
devtools::check(cran = TRUE)
# Build the pkgdown site
pkgdown::build_site()This package is designed to work with nearly any clustering methods.
However, there is no consistency in how clustering functions are
implemented with regard to parameters and returned values. The functions
clav package expect a format similar to the kmeans function.
Specifically, there are two parameters passed:
- A data frame with the variables used to do the clustering.
- The number of clusters, k.
The parameters are passed unnamed so the order of the parameters
matters. Additionally, the returned object should have an element
cluster that provides the cluster membership. The length of this
vector should be equal to the number of rows in the data frame.
For example, to use hierarchical clustering with the clav package, we
need to implement a wrapper (note that the hclust2 function below is
included in the package).
hclust2 <- function(x, k, ...) {
result <- stats::hclust(dist(x), ...)
result$k <- k
result$data <- x
result$cluster <- stats::cutree(result, k = k)
class(result) <- c('hclust2', class(result))
return(result)
}Note that the first parameter, x is the data frame of variables used
for the clustering. The stats::hclust() function expect a
dissimilarity structure as it’s first parameter (the result of
stats::dist()). Since hclust determines cluster membership for all
n - 1 possibilities, we use the stats::cutree() function for the
desired k. Other parameters for hclust are passed using the ...
(dots) operator. Lastly, we add the cluster vector and modify the
class so that we could implement S3 methods if necessary.
Additionally, an S3
method for predict
needs to be implemented. The following example implements predict()
for hclust2().
library(clav)
data(pisa2015, package = 'clav')
cluster_vars <- c('interest', 'enjoyment', 'motivation', 'efficacy', "belonging")
outcome_vars <- c('science_score', 'principals')
pisa_usa <- pisa2015 |> dplyr::filter(country == 'UNITED STATES')Finding the optimal number of clusters.
optimal <- optimal_clusters(pisa_usa[,cluster_vars], max_k = 5)
optimal
#> k wss silhoutte gap calinski_harabasz davies_bouldin rand_index
#> 1 1 9232 NA 0.83 NaN NaN NA
#> 2 2 6746 0.24 0.87 1708 1.6 0.50
#> 3 3 5869 0.20 0.86 1332 1.8 0.63
#> 4 4 5259 0.20 0.86 1160 1.7 0.74
#> 5 5 4658 0.20 0.86 1125 1.6 0.68
optimmal_plots <- plot(optimal, ncol = 2)
names(optimmal_plots)
#> [1] "davies_bouldin" "calinski" "wss" "silhouette"
#> [5] "gap" "rand"
# optimmal_plots[['wss']]
optimmal_plotsValidating cluster profiles using random samples of 50%. Out-of-bag uses the remaining 50% to predict cluster membership.
pisa_cv_random <- pisa_usa |>
dplyr::select(dplyr::all_of(cluster_vars)) |>
clav::cluster_validation(
n_clusters = 3,
sample_size = 0.5 * nrow(pisa_usa),
replace = FALSE,
n_samples = 100,
seed = 42
)
plot(pisa_cv_random)Re-estimate the clusters using the OOB sample instead of predicting using the in sample model.
pisa_cv_random2 <- pisa_usa |>
dplyr::select(dplyr::all_of(cluster_vars)) |>
clav::cluster_validation(
n_clusters = 3,
oob_predict_fun = function(fit, newdata) {
newfit <- stats::kmeans(newdata, 3)
newfit$cluster
},
sample_size = 0.5 * nrow(pisa_usa),
replace = FALSE,
n_samples = 100,
seed = 42
)
plot(pisa_cv_random2)Bootstrap approach to validation.
pisa_cv_bootstrap <- pisa_usa |>
dplyr::select(dplyr::all_of(cluster_vars)) |>
clav::cluster_validation(
n_clusters = 3,
sample_size = nrow(pisa_usa),
replace = TRUE,
n_samples = 100,
seed = 42
)
plot(pisa_cv_bootstrap)Using latent profile analysis for estimating clusters.
library(tidyLPA)
lpa <- pisa_usa |>
dplyr::select(dplyr::all_of(cluster_vars)) |>
tidyLPA::estimate_profiles(3)
# lpa_predict <- predict(lpa, pisaUSA15[sample(nrow(pisaUSA15), 100),])
# lpa_estimates <- get_estimates(lpa)
lpa_data <- get_data(lpa)
plot_profiles(lpa)
clav::profile_plot(pisa_usa[,cluster_vars],
clusters = lpa_data$Class,
df_dep = pisa_usa[,c('science_score')])
lpa_cv_random <- cluster_validation(
pisaUSA15,
n_clusters = 3,
cluster_fun = estimate_profiles,
oob_predict_fun = function(fit, newdata) {
estimate_profiles(newdata, n_clusters)
},
sample_size = 0.5 * nrow(pisaUSA15),
replace = FALSE,
n_samples = 50,
seed = 42
)
plot(lpa_cv_random)fit <- pisa_usa |>
dplyr::select(dplyr::all_of(cluster_vars)) |>
stats::kmeans(centers = 3)
clav::profile_plot(pisa_usa[,cluster_vars],
clusters = fit$cluster,
df_dep = pisa_usa[,outcome_vars],
center_band = 0.33,
cluster_order = cluster_vars)




