Skip to content

[v0.5.0] Implement Decision Tree Regression (CART) #29

@noahgift

Description

@noahgift

Overview

Implement Decision Tree Regressor using the CART (Classification and Regression Trees) algorithm for regression tasks.

Background

We currently have DecisionTreeClassifier for classification. This issue extends the decision tree implementation to support regression by predicting continuous values instead of discrete classes.

Implementation Details

Core Algorithm

  • Splitting criterion: Mean Squared Error (MSE) or Mean Absolute Error (MAE)
  • Prediction: Mean of target values in leaf nodes
  • Tree structure: Reuse existing TreeNode structure or extend for regression

API Design

pub struct DecisionTreeRegressor {
    max_depth: Option<usize>,
    min_samples_split: usize,
    min_samples_leaf: usize,
    criterion: RegressionCriterion, // MSE or MAE
    random_state: Option<u64>,
}

pub enum RegressionCriterion {
    MeanSquaredError,
    MeanAbsoluteError,
}

impl Estimator for DecisionTreeRegressor {
    fn fit(&mut self, x: &Matrix, y: &Vector) -> Result<(), &'static str>;
    fn predict(&self, x: &Matrix) -> Vector;
    fn score(&self, x: &Matrix, y: &Vector) -> f32; // R² score
}

Key Features

  • Configurable max_depth (prevent overfitting)
  • min_samples_split and min_samples_leaf (pruning parameters)
  • Support for MSE and MAE splitting criteria
  • Reproducible with random_state for tie-breaking
  • feature_importances() method

Acceptance Criteria (EXTREME TDD)

Tests (15+ comprehensive tests)

  • Constructor and basic fitting
  • Predict on simple linear data
  • Predict on non-linear data (quadratic, sinusoidal)
  • MSE criterion splitting
  • MAE criterion splitting
  • max_depth parameter limits tree depth
  • min_samples_split prevents oversplitting
  • min_samples_leaf ensures minimum leaf size
  • score() returns R² metric
  • Reproducibility with random_state
  • Error handling (predict before fit)
  • Multidimensional features (2D, 3D)
  • Feature importance calculation
  • Edge cases (single sample, constant target)
  • Comparison with LinearRegression on linear data

Examples

  • examples/decision_tree_regression.rs
    • Boston housing price prediction
    • Comparison with LinearRegression
    • max_depth effects visualization
    • Feature importance demonstration

Documentation

  • book/src/ml-fundamentals/decision-trees.md
    • Update theory chapter to cover regression trees
    • CART algorithm explanation
    • MSE vs MAE criterion comparison
  • book/src/examples/decision-tree-regression.md
    • Case study walkthrough
    • When to use trees vs linear models
    • Hyperparameter tuning guidance

Technical Design

MSE Criterion

For a split at value v on feature f:

MSE(split) = (N_left / N) * Var(y_left) + (N_right / N) * Var(y_right)

Choose split that minimizes MSE.

Prediction

For a new sample, traverse tree to leaf node and return:

prediction = mean(y_samples_in_leaf)

Feature Importance

importance(f) = Σ (weighted_samples * improvement_in_MSE) for all splits on feature f

Complexity Analysis

  • Training: O(n · d · n log n) - n samples, d features
  • Prediction: O(log n) average, O(n) worst case (unbalanced tree)
  • Space: O(n) for tree structure

Benefits

  1. Non-linear regression: Captures non-linear relationships without feature engineering
  2. Interpretability: Tree structure is human-readable
  3. Feature importance: Automatically identifies important features
  4. No assumptions: Works without assuming data distribution
  5. Foundation for RF: Required for Random Forest Regression

Testing Strategy

  • Unit tests for splitting criteria
  • Integration tests with known datasets
  • Property tests: predictions within range of training data
  • Comparison tests: matches sklearn on simple examples

Related Work

  • Existing: DecisionTreeClassifier (src/tree/mod.rs)
  • Future: RandomForestRegressor (Issue #TBD)
  • Depends on: Regression metrics (R², MSE) - already implemented

Priority

High - First step toward v0.5.0 "Regression Trees & Advanced Ensemble Methods"

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions