Skip to content

LinearRegression: Support for small sample sizes or graceful degradation #4

@noahgift

Description

@noahgift

Problem

When training LinearRegression with fewer samples than features (underdetermined system), the model fails with:

Matrix is not positive definite

This happens because the normal equations (X^T X) require n_samples ≥ n_features for the matrix to be invertible via Cholesky decomposition.

Current Behavior

let x = Matrix::from_vec(3, 18, features)?; // 3 samples, 18 features
let y = Vector::from_vec(vec![1.0, 0.0, 1.0]);
let mut model = LinearRegression::new();
model.fit(&x, &y)?; // ERROR: Matrix is not positive definite

Desired Behavior

Option 1: Ridge Regression (L2 Regularization)

Add regularization to make the system solvable:

let model = LinearRegression::new().with_regularization(0.01);

Option 2: Graceful Error Messages

Return a clear error explaining the constraint:

Err("LinearRegression requires n_samples >= n_features (got 3 samples, 18 features). Consider using Ridge regression or collecting more training data.")

Option 3: Pseudo-inverse (SVD-based)

Use SVD-based Moore-Penrose pseudo-inverse instead of Cholesky decomposition for underdetermined systems.

Use Case

Real-world ML applications often start with small datasets and need graceful handling:

  • Early-stage training with limited data
  • Cross-validation with small folds
  • Incremental learning scenarios

Impact

This affects PMAT's mutation testing ML predictor migration from linfa to aprender. Currently falling back to statistical baseline when model training fails.

References

  • scikit-learn handles this via Ridge estimator with alpha parameter
  • linfa-linear has ridge parameter for regularization
  • Pure Rust implementation could use nalgebra's SVD for pseudo-inverse

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions