Skip to content

[v0.5.0] Implement Random Forest Regression #30

@noahgift

Description

@noahgift

Overview

Implement Random Forest Regressor to extend ensemble methods for regression tasks. This builds on the existing DecisionTreeRegressor (Issue #29) and RandomForestClassifier implementations.

Background

Random Forests are ensemble models that combine multiple decision trees to reduce overfitting and improve generalization. For regression, each tree predicts a continuous value, and the final prediction is the average across all trees.

Implementation Details

Core Algorithm

  • Bootstrap aggregating (Bagging): Train each tree on random sample with replacement
  • Feature subsampling: Each split considers random subset of features
  • Prediction: Average predictions from all trees
  • Tree structure: Uses DecisionTreeRegressor as base estimator

API Design

pub struct RandomForestRegressor {
    trees: Vec<DecisionTreeRegressor>,
    n_estimators: usize,
    max_depth: Option<usize>,
    min_samples_split: usize,
    min_samples_leaf: usize,
    max_features: MaxFeatures,
    random_state: Option<u64>,
    bootstrap: bool,
}

pub enum MaxFeatures {
    Sqrt,        // sqrt(n_features) - recommended default
    Log2,        // log2(n_features)
    Auto,        // Same as Sqrt for regression
    All,         // n_features (no subsampling)
    Fixed(usize), // Specific number
}

impl RandomForestRegressor {
    pub fn new() -> Self;
    pub fn with_n_estimators(mut self, n: usize) -> Self;
    pub fn with_max_depth(mut self, depth: usize) -> Self;
    pub fn with_min_samples_split(mut self, n: usize) -> Self;
    pub fn with_min_samples_leaf(mut self, n: usize) -> Self;
    pub fn with_max_features(mut self, mf: MaxFeatures) -> Self;
    pub fn with_random_state(mut self, seed: u64) -> Self;
    pub fn with_bootstrap(mut self, b: bool) -> Self;
    
    pub fn fit(&mut self, x: &Matrix, y: &Vector) -> Result<()>;
    pub fn predict(&self, x: &Matrix) -> Vector;
    pub fn score(&self, x: &Matrix, y: &Vector) -> f32; // R² score
}

Key Features

  • n_estimators: Number of trees (default: 100)
  • max_features: Features per split (default: sqrt(n_features))
  • bootstrap: Sample with replacement (default: true)
  • random_state: Reproducible random sampling
  • Inherits tree params: max_depth, min_samples_split, min_samples_leaf

Acceptance Criteria (EXTREME TDD)

Tests (15+ comprehensive tests)

  • Constructor and basic configuration
  • fit() and predict() on simple linear data
  • Predict on non-linear data (better than single tree)
  • n_estimators parameter (more trees → better predictions)
  • max_features parameter (sqrt vs all features)
  • bootstrap parameter (with/without replacement)
  • random_state reproducibility
  • score() returns R² metric
  • Comparison with single DecisionTreeRegressor (forest should be better)
  • Comparison with LinearRegression on non-linear data
  • Error handling (predict before fit, mismatched dimensions)
  • Multidimensional features (2D, 3D)
  • Edge cases (single tree, single sample)
  • Default trait implementation
  • Variance reduction vs single tree

Examples

  • examples/random_forest_regression.rs
    • Housing price prediction
    • Comparison with DecisionTreeRegressor
    • n_estimators effects (10, 50, 100 trees)
    • Overfitting comparison (RF vs single tree)
    • Feature importance demonstration (future)

Documentation

  • book/src/ml-fundamentals/ensemble-methods.md
    • Update with Random Forest regression section
    • Bootstrap aggregating explanation
    • Bias-variance tradeoff
    • Why averaging reduces variance
  • book/src/examples/random-forest-regression.md
    • Complete case study
    • When to use RF vs single tree
    • Hyperparameter tuning guide

Technical Design

Bootstrap Sampling

For each tree, sample n_train rows with replacement:

bootstrap_indices = random_choice(0..n_samples, size=n_samples, replace=true)
X_bootstrap = X[bootstrap_indices]
y_bootstrap = y[bootstrap_indices]

Feature Subsampling

At each split, randomly select max_features subset:

available_features = random_choice(0..n_features, size=max_features, replace=false)
best_split = find_best_split(X[:, available_features], y)

Prediction

Average predictions from all trees:

predictions = (1/n_trees) Σ tree_i.predict(X)

Variance Reduction

Var(avg of n predictions) = Var(single prediction) / n  (if independent)

Random sampling decorrelates trees → variance reduction

Complexity Analysis

  • Training: O(n_estimators · n · d · log n) where:
    • n_estimators = number of trees
    • n = samples
    • d = features
    • log n = average tree depth
  • Prediction: O(n_estimators · log n) average
  • Space: O(n_estimators · tree_size)

Benefits

  1. Reduced overfitting: Averaging decorrelated trees generalizes better
  2. Variance reduction: Lower prediction variance than single tree
  3. No hyperparameter tuning: Works well with defaults
  4. Feature importance: Aggregated across trees (future enhancement)
  5. Parallelizable: Trees train independently (future optimization)

Testing Strategy

  • Unit tests for bootstrap sampling logic
  • Integration tests with known datasets
  • Property tests: RF variance ≤ single tree variance
  • Comparison tests: matches sklearn on simple examples

Related Work

Priority

High - Core algorithm for v0.5.0 "Regression Trees & Advanced Ensemble Methods"

References

  • Breiman, L. (2001). "Random Forests". Machine Learning, 45(1), 5-32.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). "The Elements of Statistical Learning" (Chapter 15)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions