Fix linear svc handling sample weights under class_weight="balanced"#30057
Fix linear svc handling sample weights under class_weight="balanced"#30057jeremiedbb merged 26 commits intoscikit-learn:mainfrom
Conversation
|
Thanks for the PR. Please sync again with Please also add a changelog entry for v1.6. |
There was a problem hiding this comment.
We need to review how compute_class_weight is used in scikit-learn. I suspect that this problem is not just for LinearSVC but for any model that accepts class_weight="balanced". As a result we should probably change compute_class_weight to accept an additional sample_weight parameter and implement the fix within compute_class_weight itself rather than in _fit_liblinear only.
|
I have merged it into the main branch and checked the check_sample_weight_equivalence test with the following output Perhaps I need to dig in a bit more, seems like the fix hasn't fully solved the issue |
|
I also noticed other failures related to the liblinear solver used in We need to investigate if those are caused by fundamental limitations of the liblinear solver, an unappropriate expecation on the numerical precision of the solver on particularly challenging test data with many repeated data points (a "bug" in the sample_weight estimator check itself) or the way we (mis-)use liblinear it in So I think we can keep the xfail marker for this estimator and merge the fix from this PR independently of the resolution of the problem described in the previous paragraph. |
sklearn/utils/class_weight.py
Outdated
| raise ValueError("classes should have valid labels that are in y") | ||
|
|
||
| recip_freq = len(y) / (len(le.classes_) * np.bincount(y_ind).astype(np.float64)) | ||
| n_effective_samples = np.bincount(y_ind, weights=sample_weight).sum() |
There was a problem hiding this comment.
The result np.bincount(y_ind, weights=sample_weight) should be stored in a local variable to avoid computing it twice in a row.
ogrisel
left a comment
There was a problem hiding this comment.
The CI is still failing: since the new parameter is added to a function part of the scikit-learn public API, it needs to be documented in the docstring. We should also use the .. versionadded:: 1.6 directive on this part of the docstring. Look for other example of usage of versionadded in our codebase as reference.
And see adding a new parameter to a public function, we also need to document this part of the PR as an enhancement in a dedicated file entry under the doc/whats_new/upcoming_changes/sklearn.utils folder.
And we need a fix changelog entry for the actual fix for the sklearn.svm.LinearSVC estimator impacted by this change.
|
Oh no, I think I did something wrong in my git pull origin --rebase, and it has all the changes of the past few days, will try revert it. |
…imator that is known to handle sample_weight properly
1f59cc6 to
defba7b
Compare
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
|
@snath-xoc since we merged #30149, we need to move This explains the remaining conflicts on this PR. |
|
Oh great, will move the xfails then, thanks for the pointer |
ogrisel
left a comment
There was a problem hiding this comment.
assert_array_equal should be reserved to comparing arrays with non-continuous dtypes (e.g. integers, str, object...).
To compare floating-point values arrays, we should almost always use assert_allclose (or keep assert_array_almost_equal on older code that already use that function).
Similarly, to compare Python scalar float values, it's better to use assert a == pytest.approx(b) instead of assert a == b to avoid failures related to rounding errors.
Other than that an other small suggestions below, LGTM.
| assert_array_almost_equal(cw, [2.0 / 3, 2.0, 1.0]) | ||
| assert len(cw_rep) == len(classes) | ||
|
|
||
| class_counts = np.bincount(y + 2, weights=sw) |
There was a problem hiding this comment.
Maybe renamed this to class_counts_weighted and the other one to class_counts_repeated to make the assertions easier to follow?
Similarly, for cw and cw_rep.
doc/whats_new/upcoming_changes/sklearn.linear_model/30057.fix.rst
Outdated
Show resolved
Hide resolved
|
Maybe we could update the docstring to reflect the changes scikit-learn/sklearn/utils/class_weight.py Lines 26 to 31 in c4e1819 Maybe in plain english, something like: |
|
There are a few estimators supporting
scikit-learn/sklearn/ensemble/_forest.py Line 879 in c4e1819
scikit-learn/sklearn/linear_model/_stochastic_gradient.py Lines 617 to 620 in c4e1819 Idem for scikit-learn/sklearn/svm/_base.py Line 748 in c4e1819 |
Thank you for that @antoinebaker I can start looking into these next week, I recall looking into BaseSVC as well and the fix should be somewhat similar(-ish). |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
| raise ValueError("classes should have valid labels that are in y") | ||
|
|
||
| recip_freq = len(y) / (len(le.classes_) * np.bincount(y_ind).astype(np.float64)) | ||
| weighted_class_counts = np.bincount(y_ind, weights=sample_weight) |
There was a problem hiding this comment.
We need to validate sample_weight first using _check_sample_weight
There was a problem hiding this comment.
@snath-xoc, this PR is almost ready for merge. This is the last thing. Something like
sample_weight = _check_sample_weight(sample_weight, y)
weighted_class_counts = np.bincount(y_ind, weights=sample_weight)
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Fixes #30056 and cross-linked to meta-issue #16298
Under class_weight='balanced' strategy class weights are calculated as:
n_samples / (n_classes * np.bincount(y))
Sample weights were previously not passed through under this strategy leading to different class weights when calculated on weighted vs repeated samples.
TO DO: