Problem Statement
Hierarchical clustering builds a tree of clusters (dendrogram), allowing analysis at multiple granularities. Currently missing from aprender.
Use Cases:
- Taxonomy creation (biology, document organization)
- Customer segmentation with hierarchy
- Gene expression analysis
- No need to pre-specify number of clusters
Proposed Solution
Implement agglomerative (bottom-up) hierarchical clustering with EXTREME TDD.
Algorithm
Linkage Methods:
- Single: min distance between clusters
- Complete: max distance between clusters
- Average: mean distance between all pairs
- Ward: minimize within-cluster variance
Steps:
- Start: each point is own cluster
- Repeat until n_clusters reached:
- Find two closest clusters
- Merge them
- Update distance matrix
- Can cut dendrogram at any height
Implementation
Trait: UnsupervisedEstimator
API Design:
pub struct AgglomerativeClustering {
n_clusters: usize,
linkage: Linkage, // Single, Complete, Average, Ward
labels: Option<Vec<usize>>,
dendrogram: Option<Vec<Merge>>, // For visualization
}
pub enum Linkage {
Single,
Complete,
Average,
Ward,
}
impl UnsupervisedEstimator for AgglomerativeClustering {
type Labels = Vec<usize>;
fn fit(&mut self, x: &Matrix<f32>) -> Result<(), &'static str>;
fn predict(&self, x: &Matrix<f32>) -> Vec<usize>;
}
Success Criteria
- ✅ AgglomerativeClustering with 4 linkage methods
- ✅ UnsupervisedEstimator trait
- ✅ Dendrogram structure for visualization
- ✅ 15+ tests (one per linkage method + edge cases)
- ✅ Zero clippy warnings
- ✅ Example: examples/hierarchical_clustering.rs
Estimated Effort
Timeline: 3-4 days
Complexity: Medium-High (efficient distance matrix updates)
Problem Statement
Hierarchical clustering builds a tree of clusters (dendrogram), allowing analysis at multiple granularities. Currently missing from aprender.
Use Cases:
Proposed Solution
Implement agglomerative (bottom-up) hierarchical clustering with EXTREME TDD.
Algorithm
Linkage Methods:
Steps:
Implementation
Trait:
UnsupervisedEstimatorAPI Design:
Success Criteria
Estimated Effort
Timeline: 3-4 days
Complexity: Medium-High (efficient distance matrix updates)