What is cluster.stats()?
The cluster.stats() function from the fpc R package (Flexible Procedures for Clustering) is a powerful tool for cluster validation and evaluating clustering performance. cluster.stats() helps data scientists measure how good their clustering results are by calculating various cluster quality metrics.
Table of Contents
It is defined in the fpc R package, which provides a method for comparing the similarity of two cluster solutions using different validation criteria.
Why use cluster.stats() for Cluster Analysis?
When you perform unsupervised machine learning with clustering algorithms like k-means clustering, hierarchical clustering, or DBSCAN clustering, you need to answer: “How good are my clusters?” That is where cluster.stats() comes in!
Key metrics provided by cluster.stats()?
- Within-cluster distance: How compact are clusters?
- Between-cluster distance: How separated are clusters?
- Average silhouette width: Overall clustering quality (-1 to 1)
- Dunn index: Ratio of the smallest between-cluster distance to the largest within-cluster distance.
- Entropy: Purity of clusters.
What are the main inputs?
The general syntax for cluster.stats() is:
cluster.stats(d, clustering, alt.clustering = NULL, noisecluster = FALSE,
silhouette = TRUE, G2 = FALSE, G3 = FALSE, ...)- d: Distance matrix or dissimilarity object
- clustering: Integer vector of cluster assignments (1, 2, 3, …)
- alt.clustering: Alternative clustering for comparison (optional)
- noisecluster: Should cluster 0 be treated as noise? (TRUE/FALSE)
Simple Example with K-means Clustering
# Install and load required packages
# uncomment if fpc package is not installed
# install.packages("fpc")
library(fpc)
library(cluster)
# Create sample data for clustering analysis
set.seed(123)
data <- matrix(c(rnorm(100, mean = 0),
rnorm(100, mean = 5),
rnorm(100, mean = 0),
rnorm(100, mean = 5)), ncol = 2)
# Perform k-means clustering (k=2 clusters)
kmeans_result <- kmeans(data, centers = 2)
# Calculate clustering statistics
clust_stats <- cluster.stats(dist(data), kmeans_result$cluster)
# View key metrics
cat("Average silhouette width:", clust_stats$avg.silwidth, "\n")
cat("Dunn index:", clust_stats$dunn, "\n")
cat("Within cluster sum of squares:", clust_stats$within.cluster.ss, "\n")
# Compare two different clusterings
kmeans_result_3 <- kmeans(data, centers = 3)
clust_stats_3 <- cluster.stats(dist(data), kmeans_result_3$cluster)
# Compare using corrected Rand index
comparison <- cluster.stats(dist(data),
kmeans_result$cluster,
kmeans_result_3$cluster)
cat("Corrected Rand index for comparison:", comparison$corrected.rand, "\n")Real-World Application: Customer Segmentation
# Customer segmentation with clustering validation
# Using mtcars dataset as example
data(mtcars)
cars_scaled <- scale(mtcars) # Standardize data
# Perform hierarchical clustering
dist_matrix <- dist(cars_scaled)
hc <- hclust(dist_matrix, method = "ward.D2")
clusters <- cutree(hc, k = 3) # Cut tree to get 3 clusters
# Validate clustering results
validation_stats <- cluster.stats(dist_matrix, clusters)
# Print comprehensive validation report
print(paste("Silhouette score:", round(validation_stats$avg.silwidth, 3)))
print(paste("Dunn index:", round(validation_stats$dunn, 3)))
print(paste("Number of clusters:", validation_stats$cluster.number))
print(paste("Within sum of squares:", round(validation_stats$within.cluster.ss, 2)))
# Check if clustering is better than random
if(validation_stats$avg.silwidth > 0.5) {
print("Good clustering structure detected!")
} else {
print("Consider trying different number of clusters or algorithm.")
}Best Practices for Using cluster.stats()
The following are best practices for using cluster.stats() function in the fpc Package:
- Always validate your clustering results
- Instead of one metric, use multiple metrics
- Compare different algorithms for the same data
- Experiment with different numbers of clusters
- Interpret silhouettes width:
- 0.7: Strong structure
- 0.5-0.7: Reasonable structure
- 0.25-0.5: Weak structure
- <0.25: No substantial structure
Common Errors and Solutions
The following are some common errors and their relevant solutions when using cluster.stats() in fpc R Package:
# Error 1: Invalid distance matrix
# Solution: Ensure you use dist() function
correct_stats <- cluster.stats(dist(your_data), clusters)
# Error 2: NA values in data
# Solution: Clean data first
clean_data <- na.omit(your_data)
# Error 3: Single cluster
# Solution: Check if clustering created only one group
if(length(unique(clusters)) == 1) {
warning("Only one cluster detected. Try different parameters.")
}- Error: NA/NaN/Inf in foreign function call: Check distance matrix for missing/infinite values
- clustering must be a vector: Ensure clustering is an integer vector, not a factor
- Dimension mismatch: Ensure
length(clustering) == nrow(as.matrix(d))
Summary
The cluster.stats() function is essential for data science professionals working with clustering algorithms. By providing quantitative measures of clustering quality, it takes the guesswork out of cluster analysis and helps you make data-driven decisions about your unsupervised learning models.
Remember: Good clustering is not just about creating groups: it is about creating meaningful, well-separated, and compact groups that provide real business or research insights!
