Overview
Implement Random Forest Regressor to extend ensemble methods for regression tasks. This builds on the existing DecisionTreeRegressor (Issue #29) and RandomForestClassifier implementations.
Background
Random Forests are ensemble models that combine multiple decision trees to reduce overfitting and improve generalization. For regression, each tree predicts a continuous value, and the final prediction is the average across all trees.
Implementation Details
Core Algorithm
- Bootstrap aggregating (Bagging): Train each tree on random sample with replacement
- Feature subsampling: Each split considers random subset of features
- Prediction: Average predictions from all trees
- Tree structure: Uses DecisionTreeRegressor as base estimator
API Design
pub struct RandomForestRegressor {
trees: Vec<DecisionTreeRegressor>,
n_estimators: usize,
max_depth: Option<usize>,
min_samples_split: usize,
min_samples_leaf: usize,
max_features: MaxFeatures,
random_state: Option<u64>,
bootstrap: bool,
}
pub enum MaxFeatures {
Sqrt, // sqrt(n_features) - recommended default
Log2, // log2(n_features)
Auto, // Same as Sqrt for regression
All, // n_features (no subsampling)
Fixed(usize), // Specific number
}
impl RandomForestRegressor {
pub fn new() -> Self;
pub fn with_n_estimators(mut self, n: usize) -> Self;
pub fn with_max_depth(mut self, depth: usize) -> Self;
pub fn with_min_samples_split(mut self, n: usize) -> Self;
pub fn with_min_samples_leaf(mut self, n: usize) -> Self;
pub fn with_max_features(mut self, mf: MaxFeatures) -> Self;
pub fn with_random_state(mut self, seed: u64) -> Self;
pub fn with_bootstrap(mut self, b: bool) -> Self;
pub fn fit(&mut self, x: &Matrix, y: &Vector) -> Result<()>;
pub fn predict(&self, x: &Matrix) -> Vector;
pub fn score(&self, x: &Matrix, y: &Vector) -> f32; // R² score
}
Key Features
- n_estimators: Number of trees (default: 100)
- max_features: Features per split (default: sqrt(n_features))
- bootstrap: Sample with replacement (default: true)
- random_state: Reproducible random sampling
- Inherits tree params: max_depth, min_samples_split, min_samples_leaf
Acceptance Criteria (EXTREME TDD)
Tests (15+ comprehensive tests)
Examples
Documentation
Technical Design
Bootstrap Sampling
For each tree, sample n_train rows with replacement:
bootstrap_indices = random_choice(0..n_samples, size=n_samples, replace=true)
X_bootstrap = X[bootstrap_indices]
y_bootstrap = y[bootstrap_indices]
Feature Subsampling
At each split, randomly select max_features subset:
available_features = random_choice(0..n_features, size=max_features, replace=false)
best_split = find_best_split(X[:, available_features], y)
Prediction
Average predictions from all trees:
predictions = (1/n_trees) Σ tree_i.predict(X)
Variance Reduction
Var(avg of n predictions) = Var(single prediction) / n (if independent)
Random sampling decorrelates trees → variance reduction
Complexity Analysis
- Training: O(n_estimators · n · d · log n) where:
- n_estimators = number of trees
- n = samples
- d = features
- log n = average tree depth
- Prediction: O(n_estimators · log n) average
- Space: O(n_estimators · tree_size)
Benefits
- Reduced overfitting: Averaging decorrelated trees generalizes better
- Variance reduction: Lower prediction variance than single tree
- No hyperparameter tuning: Works well with defaults
- Feature importance: Aggregated across trees (future enhancement)
- Parallelizable: Trees train independently (future optimization)
Testing Strategy
- Unit tests for bootstrap sampling logic
- Integration tests with known datasets
- Property tests: RF variance ≤ single tree variance
- Comparison tests: matches sklearn on simple examples
Related Work
Priority
High - Core algorithm for v0.5.0 "Regression Trees & Advanced Ensemble Methods"
References
- Breiman, L. (2001). "Random Forests". Machine Learning, 45(1), 5-32.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). "The Elements of Statistical Learning" (Chapter 15)
Overview
Implement Random Forest Regressor to extend ensemble methods for regression tasks. This builds on the existing DecisionTreeRegressor (Issue #29) and RandomForestClassifier implementations.
Background
Random Forests are ensemble models that combine multiple decision trees to reduce overfitting and improve generalization. For regression, each tree predicts a continuous value, and the final prediction is the average across all trees.
Implementation Details
Core Algorithm
API Design
Key Features
Acceptance Criteria (EXTREME TDD)
Tests (15+ comprehensive tests)
Examples
Documentation
Technical Design
Bootstrap Sampling
For each tree, sample n_train rows with replacement:
Feature Subsampling
At each split, randomly select max_features subset:
Prediction
Average predictions from all trees:
Variance Reduction
Random sampling decorrelates trees → variance reduction
Complexity Analysis
Benefits
Testing Strategy
Related Work
RandomForestClassifier(src/tree/mod.rs)DecisionTreeRegressor(src/tree/mod.rs, Issue [v0.5.0] Implement Decision Tree Regression (CART) #29)Priority
High - Core algorithm for v0.5.0 "Regression Trees & Advanced Ensemble Methods"
References