Skip to content

Implement Hierarchical Clustering (Agglomerative) #15

@noahgift

Description

@noahgift

Problem Statement

Hierarchical clustering builds a tree of clusters (dendrogram), allowing analysis at multiple granularities. Currently missing from aprender.

Use Cases:

  • Taxonomy creation (biology, document organization)
  • Customer segmentation with hierarchy
  • Gene expression analysis
  • No need to pre-specify number of clusters

Proposed Solution

Implement agglomerative (bottom-up) hierarchical clustering with EXTREME TDD.

Algorithm

Linkage Methods:

  • Single: min distance between clusters
  • Complete: max distance between clusters
  • Average: mean distance between all pairs
  • Ward: minimize within-cluster variance

Steps:

  1. Start: each point is own cluster
  2. Repeat until n_clusters reached:
    • Find two closest clusters
    • Merge them
    • Update distance matrix
  3. Can cut dendrogram at any height

Implementation

Trait: UnsupervisedEstimator

API Design:

pub struct AgglomerativeClustering {
    n_clusters: usize,
    linkage: Linkage,  // Single, Complete, Average, Ward
    labels: Option<Vec<usize>>,
    dendrogram: Option<Vec<Merge>>,  // For visualization
}

pub enum Linkage {
    Single,
    Complete,
    Average,
    Ward,
}

impl UnsupervisedEstimator for AgglomerativeClustering {
    type Labels = Vec<usize>;
    fn fit(&mut self, x: &Matrix<f32>) -> Result<(), &'static str>;
    fn predict(&self, x: &Matrix<f32>) -> Vec<usize>;
}

Success Criteria

  • ✅ AgglomerativeClustering with 4 linkage methods
  • ✅ UnsupervisedEstimator trait
  • ✅ Dendrogram structure for visualization
  • ✅ 15+ tests (one per linkage method + edge cases)
  • ✅ Zero clippy warnings
  • ✅ Example: examples/hierarchical_clustering.rs

Estimated Effort

Timeline: 3-4 days
Complexity: Medium-High (efficient distance matrix updates)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions