In this post, we will dive deep into what pvclust() R package does, how to use it, and how to interpret its unique graphical output to tell a more compelling story with your data.
Table of Contents
Hierarchical clustering is a tool for exploring patterns in the data. It builds a tree-like structure (a dendrogram) that groups similar items. But a question that often lingers: How confident can we be in those clusters? Are they genuinely reflective of the underlying structure, or could they be random artifacts of the algorithm or data quirks?
This is where the pvclust R package comes to the rescue. It provides a powerful function that adds a layer of statistical rigor to your clustering analysis by calculating p-values for every cluster
What is pvclust() Function in R?
The pvclust() function in R is used to perform hierarchical clustering with statistical significance testing using bootstrap resampling. Unlike standard clustering methods, pvclust() provides p-values for clusters, helping analysts determine how reliable each cluster is. It belongs to the pvclust package in R.
Uncertainty in Hierarchical Clustering
Standard hierarchical clustering (hclust) always produce a dendrogram, regardless of whether real clusters exist. It does not tell which groupings are strongly supported by the data and which are tenuous. Traditional bootstrap methods can help, but they often provide biased p-values.
pvclust and Multiscale Bootstrap Resampling
The pvclust() function calculates two types of p-values for each cluster using a technique called multiscale bootstrap resampling:
- AU (Approximately Unbiased) p-value: Calculated through the multiscale bootstrap, the AU p-value is considered a more accurate, unbiased measure of how strongly the data supports a cluster. In most plots, it’s shown in red.
- BP (Bootstrap Probability) value: This is the standard, less sophisticated bootstrap probability. It tends to be biased but is still provided for comparison. It’s typically shown in green.
The core idea is to resample the data with various sample sizes (controlled by the r parameter), see how often the same clusters appear, and then fit a theoretical model to correct the bias in the p-values. A cluster with a high AU p-value (e.g., 95% or higher) is considered to be strongly supported by the data.
Getting Started: Installation of pvclust R package
First, you will need to install and load the pvclust R package. Open your R console and run:
install.packages("pvclust")
library(pvclust)Let us use a classic and simple dataset: iris. Note that pvclust clusters the columns of the data by default. Since we want to cluster the four flower characteristics, we need to transpose the data frame containing those columns.
# Load the iris data
data(iris)
# Use only the first 4 columns (the measurements) and transpose.
# The t() function swaps rows and columns.
iris_data <- t(iris[1:4])
# Perform the clustering with p-values
# nboot is set low (100) for speed in this example. For real analysis, use 1000 or more!
set.seed(123) # for reproducibility
result <- pvclust(iris_data, method.hclust = "average",
method.dist = "correlation", nboot = 100)
# Plot the result
plot(result)
## Output
Bootstrap (r = 0.5)... Done.
Bootstrap (r = 0.75)... Done.
Bootstrap (r = 1.0)... Done.
Bootstrap (r = 1.25)... Done.
Warning messages:
1: inappropriate distance matrices are omitted in computation: r = 0.5
2: inappropriate distance matrices are omitted in computation: r = 0.75
3: inappropriate distance matrices are omitted in computation: r = 1
4: inappropriate distance matrices are omitted in computation: r = 1.25 The first pvclust plot is shown above. Do not worry if it looks a bit messy with the iris data. Let us move to a better example.
Demonstrating Boston Housing Data
The pvclust R package documentation suggests using the Boston Housing dataset from the MASS package. This dataset is more suitable for demonstrating the power of pvclust.
# Load the Boston data from the MASS package
if(!require(MASS)) install.packages("MASS")
library(MASS)
data(Boston)
# Perform multiscale bootstrap resampling
# Again, nboot=100 is for demonstration. Use nboot=1000 for publication-ready results.
boston_pv <- pvclust(Boston, nboot = 100, parallel = FALSE)
# Plot the dendrogram with p-values
plot(boston_pv, main = "Clustering of Boston Housing Attributes with p-values")
## Output
Bootstrap (r = 0.5)... Done.
Bootstrap (r = 0.6)... Done.
Bootstrap (r = 0.7)... Done.
Bootstrap (r = 0.8)... Done.
Bootstrap (r = 0.9)... Done.
Bootstrap (r = 1.0)... Done.
Bootstrap (r = 1.1)... Done.
Bootstrap (r = 1.2)... Done.
Bootstrap (r = 1.3)... Done.
Bootstrap (r = 1.4)... Done.Deconstructing the pvclust Plot
The resulting plot is a standard dendrogram, but with crucial annotations:
- Red Numbers (AU p-values): At each branch (or node) of the tree, you will see a red number. This is the AU p-value for the cluster formed by that branch. A value of 0.98, for example, means that the cluster is supported with approximately 98% confidence.
- Green Numbers (BP values): For comparison, the BP values are often printed in green next to the AU values.
- Grey Numbers (Edge Numbers): These are identifiers for each cluster, useful for diagnostic plots.
Highlighting Significant Clusters
Looking at a tree full of numbers can be overwhelming. The pvclust package provides handy functions to automatically highlight the clusters that meet your significance threshold.
pvrect()Draws red rectangles around the strongly supported clusters.pvpick()Returns a list of the items within those significant clusters.
You can add these to your existing plot:
# Draw rectangles around clusters with AU p-value >= 0.95 pvrect(boston_pv, alpha = 0.95, pv = "au", border = 4) # border=4 makes blue rectangles # Get the list of members in significant clusters significant_clusters <- pvpick(boston_pv, alpha = 0.95, pv = "au") print(significant_clusters) ## Output $clusters $clusters[[1]] [1] "crim" "indus" "nox" "age" "rad" "tax" "ptratio" [8] "lstat" $clusters[[2]] [1] "zn" "rm" "dis" "black" "medv" $edges [1] 9 11
The pvrect2() function from the dendextend package offers even more flexibility for drawing these rectangles, allowing you to extend them all the way down to the labels.
Diagnostic Plots
How do you know if the multiscale bootstrap fitting was reliable? The msplot() function lets you visualize the curve fitting for specific clusters. This is an advanced but important step for validating your results.
To plot the diagnostics for a few clusters (identified by their edge numbers from the main plot):
# Example: plot diagnostic for edges 2, 4, 6, and 7 msplot(boston_pv, edges = c(2, 4, 6, 7))
Best Practices when using the package pvclust In R
nbootis key: Always use a sufficient number of bootstrap replications. The package authors recommendnboot = 1000a larger sample size for reliable results. Setting it to 100, which we did in some examples, is only for quick demos.- Set a seed: Use
set.seed()before runningpvclustto ensure your results are reproducible . - Interpret AU, not just BP: Focus your conclusions on the red AU p-values, as they are the statistically sound ones.
- Combine with
pvrect: Use highlighting functions to make your plots presentation-ready and easy to interpret.
The pvclust package transforms hierarchical clustering from a descriptive tool into an inferential one. By adding p-values to your dendrograms, you can move beyond mere description and start making confident, data-driven claims about the groupings in your data.


