FIX LinearRegression sample weight bug (numpy solver)#30030
FIX LinearRegression sample weight bug (numpy solver)#30030antoinebaker wants to merge 5 commits intoscikit-learn:mainfrom
Conversation
|
Just by curiosity, are NumPy and SciPy using different implementation of LAPACK? Is there a way to choose the right driver in SciPy that would provide the same behaviour than the one of NumPy or the issue is elsewhere? |
In scipy the test unfortunately fails for all |
|
So looking a bit more, it seems that NumPy use the GELSD driver. https://github.com/numpy/numpy/blob/v2.0.0/numpy/linalg/umath_linalg.cpp#L4037 |
|
Note that before merging such a change we would need to verify whether there is any speed impact on a few synthetic datasets with various shapes and sizes. But first we need to make all the tests pass consistently. Some of the tests fail because of a future warning for the Note that
So maybe scipy is not buggy, but it's just our way of calling it with the default Once settled on a choice of setting scipy As explained in #28959. EDIT: fixed a typo in the commit message used to trigger all random seeds tests. |
test_linear_regression_sample_weight_consistency
|
Indeed @ogrisel the cut-off ratio for the singular values I guess we need the benchmark to decide between numpy and the various options for scipy. |
|
The remaining failures are only for |
|
Could you then please open a similar but concurrent draft PR that attempts to set the |
| ) | ||
| self.coef_ = np.vstack([out[0] for out in outs]) | ||
| else: | ||
| self.coef_, _, self.rank_, self.singular_ = linalg.lstsq(X, y) |
There was a problem hiding this comment.
In case that we keep scipy lstsq, it might be safe here to pass check_finite=False since we validate X and y or do we consider that the centering could trigger inf and nan for some reason?
There was a problem hiding this comment.
Seems like a good idea to improve the performance, I'll add the option in the benchmark. After validating X,y and checking sample_weight I think we are safe, except that _check_sample_weight with ensure_non_negative=True still allows all zero weights, so we should probably raise an error in that case.
test_linear_regression_sample_weight_consistency
|
I revert to the old test (with @ogrisel was also speaking of a more extensive test suite (beyond just the sample weight) for |
|
The more extensive test suite should better be tackled by reviving the stalled #25948 PR. |
|
@antoinebaker could you please add a changelog entry targeting scikit-learn 1.6? |
|
Please also open an alternative PR that does the minimal fix to make the test pass with I suspect that |
test_linear_regression_sample_weight_consistency
|
Oh sorry @ogrisel I forgot to link to the concurrent PR #30040. All tests are passing with the scipy solver, with Doing some naive benchmarks, it seems that scipy with |
test_linear_regression_sample_weight_consistency
|
Closing in favor of #30040. |


Reference Issues/PRs
#29818 and #26164 revealed that LinearRegression was failing the sample weight consistency check (using weights should be equivalent to removing/repeating samples).
Related to #22947 #25948
What does this implement/fix? Explain your changes.
The
scipy.linalg.lstsqsolver fails the sample weight consistency test for wide dataset (n_features > n_samples) after centeringX,y(as done whenfit_intercept=True).Using the
numpy.linalg.lstsqsolver seems to solve the bug.test_linear_regression_sample_weight_consistencynow usesn_features > n_samplesto fail more consistently.