Fix linear svc handling sample weights under class_weight="balanced" by snath-xoc · Pull Request #30057 · scikit-learn/scikit-learn

snath-xoc · 2024-10-13T15:14:16Z

Fixes #30056 and cross-linked to meta-issue #16298

Under class_weight='balanced' strategy class weights are calculated as:

n_samples / (n_classes * np.bincount(y))

Sample weights were previously not passed through under this strategy leading to different class weights when calculated on weighted vs repeated samples.

TO DO:

Add tests for compute_class_weight

github-actions · 2024-10-13T15:15:42Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: d31272d. Link to the linter CI: here}

ogrisel · 2024-10-16T12:25:38Z

Thanks for the PR.

Please sync again with main to make sure that check_sample_weight_equivalence can work on your branch, and then add an entry for LinearSVC with class_weight="balanced" to sklearn.utils._test_common.instance_generator.PER_ESTIMATOR_CHECK_PARAMS.

Please also add a changelog entry for v1.6.

ogrisel

We need to review how compute_class_weight is used in scikit-learn. I suspect that this problem is not just for LinearSVC but for any model that accepts class_weight="balanced". As a result we should probably change compute_class_weight to accept an additional sample_weight parameter and implement the fix within compute_class_weight itself rather than in _fit_liblinear only.

sklearn/svm/_base.py

snath-xoc · 2024-11-03T18:15:01Z

I have merged it into the main branch and checked the check_sample_weight_equivalence test with the following output

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=1e-09
Comparing the output of LinearSVC.decision_function revealed that fitting with `sample_weight` is not equivalent to fitting with removed or repeated data points.
Mismatched elements: 15 / 45 (33.3%)
Max absolute difference among violations: 4.05525969e-07
Max relative difference among violations: 4.24634322e-05
 ACTUAL: array([[ 0.938949, -0.948518, -0.873634],
       [-1.505082,  0.967837, -0.951662],
       [-1.042943, -0.954705,  0.934979],...
 DESIRED: array([[ 0.938949, -0.948518, -0.873634],
       [-1.505082,  0.967837, -0.951661],
       [-1.042943, -0.954705,  0.934979],...

Perhaps I need to dig in a bit more, seems like the fix hasn't fully solved the issue

ogrisel · 2024-11-04T15:21:56Z

I also noticed other failures related to the liblinear solver used in LinearSVC as part of #30143 with violation of the numerical assertion on the same scale as what you observe:

Max absolute difference among violations: 4.05525969e-07
Max relative difference among violations: 4.24634322e-05

We need to investigate if those are caused by fundamental limitations of the liblinear solver, an unappropriate expecation on the numerical precision of the solver on particularly challenging test data with many repeated data points (a "bug" in the sample_weight estimator check itself) or the way we (mis-)use liblinear it in LinearSVC (a bug in the scikit-learn estimator code).

So I think we can keep the xfail marker for this estimator and merge the fix from this PR independently of the resolution of the problem described in the previous paragraph.

ogrisel · 2024-11-04T17:15:21Z

sklearn/utils/class_weight.py

            raise ValueError("classes should have valid labels that are in y")

-        recip_freq = len(y) / (len(le.classes_) * np.bincount(y_ind).astype(np.float64))
+        n_effective_samples = np.bincount(y_ind, weights=sample_weight).sum()


The result np.bincount(y_ind, weights=sample_weight) should be stored in a local variable to avoid computing it twice in a row.

ogrisel

The CI is still failing: since the new parameter is added to a function part of the scikit-learn public API, it needs to be documented in the docstring. We should also use the .. versionadded:: 1.6 directive on this part of the docstring. Look for other example of usage of versionadded in our codebase as reference.

And see adding a new parameter to a public function, we also need to document this part of the PR as an enhancement in a dedicated file entry under the doc/whats_new/upcoming_changes/sklearn.utils folder.

And we need a fix changelog entry for the actual fix for the sklearn.svm.LinearSVC estimator impacted by this change.

sklearn/utils/class_weight.py

snath-xoc · 2024-11-06T22:13:32Z

Oh no, I think I did something wrong in my git pull origin --rebase, and it has all the changes of the past few days, will try revert it.

…imator that is known to handle sample_weight properly

sklearn/svm/_base.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel · 2024-11-14T16:29:36Z

@snath-xoc since we merged #30149, we need to move _xfail_checks to a dictionary entry in PER_ESTIMATOR_XFAIL_CHECKS (or even in _get_expected_failed_checks when needed).

This explains the remaining conflicts on this PR.

snath-xoc · 2024-11-14T18:33:34Z

Oh great, will move the xfails then, thanks for the pointer

ogrisel

assert_array_equal should be reserved to comparing arrays with non-continuous dtypes (e.g. integers, str, object...).

To compare floating-point values arrays, we should almost always use assert_allclose (or keep assert_array_almost_equal on older code that already use that function).

Similarly, to compare Python scalar float values, it's better to use assert a == pytest.approx(b) instead of assert a == b to avoid failures related to rounding errors.

Other than that an other small suggestions below, LGTM.

sklearn/utils/tests/test_class_weight.py

sklearn/utils/class_weight.py

sklearn/utils/tests/test_class_weight.py

ogrisel · 2024-12-09T16:50:47Z

sklearn/utils/tests/test_class_weight.py

-    assert_array_almost_equal(cw, [2.0 / 3, 2.0, 1.0])
+    assert len(cw_rep) == len(classes)
+
+    class_counts = np.bincount(y + 2, weights=sw)


Maybe renamed this to class_counts_weighted and the other one to class_counts_repeated to make the assertions easier to follow?

Similarly, for cw and cw_rep.

doc/whats_new/upcoming_changes/sklearn.svm/30057.fix.rst

doc/whats_new/upcoming_changes/sklearn.linear_model/30057.fix.rst

sklearn/utils/tests/test_class_weight.py

antoinebaker · 2025-01-03T10:26:10Z

Maybe we could update the docstring to reflect the changes

scikit-learn/sklearn/utils/class_weight.py

Lines 26 to 31 in c4e1819

    
               class_weight : dict, "balanced" or None 
        
                   If "balanced", class weights will be given by 
        
                   `n_samples / (n_classes * np.bincount(y))`. 
        
                   If a dictionary is given, keys are classes and values are corresponding class 
        
                   weights. 
        
                   If `None` is given, the class weights will be uniform.

Maybe in plain english, something like:
If "balanced", class weights will be inversely proportional to the class frequencies.
or maybe add a comment:
If "balanced", class weights will be given by `n_samples / (n_classes * np.bincount(y))` or their weighted equivalent if `sample_weight` is provided.

antoinebaker · 2025-01-03T10:38:56Z

There are a few estimators supporting class_weight="balanced" and sample_weight for which I have some doubts.

ForestClassifier seems to rely on compute_sample_weight to compute the class weights:

scikit-learn/sklearn/ensemble/_forest.py

Line 879 in c4e1819

expanded_class_weight = compute_sample_weight(class_weight, y_original)

BaseSGDClassifier does not take into account sample_weight

scikit-learn/sklearn/linear_model/_stochastic_gradient.py

Lines 617 to 620 in c4e1819

    
           # Allocate datastructures from input arguments 
        
           self._expanded_class_weight = compute_class_weight( 
        
               self.class_weight, classes=self.classes_, y=y 
        
           )

Idem for BaseSVC

scikit-learn/sklearn/svm/_base.py

Line 748 in c4e1819

self.class_weight_ = compute_class_weight(self.class_weight, classes=cls, y=y_)

snath-xoc · 2025-01-05T17:15:56Z

There are a few estimators supporting class_weight="balanced" and sample_weight for which I have some doubts.

ForestClassifier seems to rely on compute_sample_weight to compute the class weights:

scikit-learn/sklearn/ensemble/_forest.py

Line 879 in c4e1819

expanded_class_weight = compute_sample_weight(class_weight, y_original)

BaseSGDClassifier does not take into account sample_weight

scikit-learn/sklearn/linear_model/_stochastic_gradient.py

Lines 617 to 620 in c4e1819

# Allocate datastructures from input arguments

self._expanded_class_weight = compute_class_weight(

self.class_weight, classes=self.classes_, y=y

)

Idem for BaseSVC

scikit-learn/sklearn/svm/_base.py

Line 748 in c4e1819

self.class_weight_ = compute_class_weight(self.class_weight, classes=cls, y=y_)

Thank you for that @antoinebaker I can start looking into these next week, I recall looking into BaseSVC as well and the fix should be somewhat similar(-ish).

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

sklearn/utils/class_weight.py

jeremiedbb · 2025-01-20T13:02:25Z

sklearn/utils/class_weight.py

            raise ValueError("classes should have valid labels that are in y")

-        recip_freq = len(y) / (len(le.classes_) * np.bincount(y_ind).astype(np.float64))
+        weighted_class_counts = np.bincount(y_ind, weights=sample_weight)


We need to validate sample_weight first using _check_sample_weight

@snath-xoc, this PR is almost ready for merge. This is the last thing. Something like

sample_weight = _check_sample_weight(sample_weight, y) weighted_class_counts = np.bincount(y_ind, weights=sample_weight)

sklearn/utils/class_weight.py

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>

github-actions bot added the module:svm label Oct 13, 2024

adrinjalali changed the title ~~Fix linear svc~~ Fix linear svc handling sample weights under class_weight="balanced" Oct 14, 2024

ogrisel reviewed Oct 16, 2024

View reviewed changes

sklearn/svm/_base.py Outdated Show resolved Hide resolved

ogrisel mentioned this pull request Oct 17, 2024

List of estimators with known incorrect handling of sample_weight #16298

Open

54 tasks

ogrisel reviewed Nov 4, 2024

View reviewed changes

ogrisel reviewed Nov 5, 2024

View reviewed changes

sklearn/utils/class_weight.py Show resolved Hide resolved

snath-xoc and others added 10 commits November 7, 2024 10:47

added sample weight handling into _fit_liblinear comput_class_weight

ad9a7e3

specified case only for class_weight is balanced

8ba638d

added sample_weights to compute_class_weight

b5e1210

Try the sample_weight propagation into compute_class_weight on an est…

724a1b1

…imator that is known to handle sample_weight properly

Fix the test parametrization of LinearSVC

545b287

added sample weight handling into _fit_liblinear comput_class_weight

d5dfb49

specified case only for class_weight is balanced

8331777

added sample_weights to compute_class_weight

44a2683

add estimator checks

78e24ad

add changelog

defba7b

snath-xoc force-pushed the fix_linearSVC branch from 1f59cc6 to defba7b Compare November 7, 2024 11:30

ogrisel reviewed Nov 7, 2024

View reviewed changes

sklearn/svm/_base.py Outdated Show resolved Hide resolved

Update sklearn/svm/_base.py

723e046

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

snath-xoc and others added 4 commits November 15, 2024 13:10

added compute_class_weight_balanced tests for sample_weight equivalence

ab4df5d

Merge branch 'main' into fix_linearSVC

69f4793

add sample weight desc. under compute_class_weight

6bf5fd5

Merge branch 'main' into fix_linearSVC

0cf6a66

snath-xoc marked this pull request as ready for review November 18, 2024 12:40

snath-xoc added 2 commits November 18, 2024 13:40

Merge branch 'main' into fix_linearSVC

11181c3

Merge branch 'main' into fix_linearSVC

c4e1819

ogrisel approved these changes Dec 9, 2024

View reviewed changes

snath-xoc and others added 7 commits January 10, 2025 16:28

Apply suggestions from code review

e6d4ed9

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge branch 'main' into fix_linearSVC

ad87083

fix linting issues

91b393f

Merge branch 'main' into fix_linearSVC

71b1a10

final comments added

72628db

added antoine's suggestion under compute_class_weight desc.

c88933c

fix bug since bincount returns amax+1

413b2bd

jeremiedbb reviewed Jan 20, 2025

View reviewed changes

snath-xoc and others added 2 commits January 24, 2025 16:26

Apply suggestions from code review

8a55462

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>

validate sample weights

d31272d

jeremiedbb enabled auto-merge (squash) February 11, 2025 17:32

jeremiedbb merged commit 9523006 into scikit-learn:main Feb 11, 2025
26 checks passed

antoinebaker mentioned this pull request Jun 19, 2025

FIX Draw indices using sample_weight in Random Forests #31529

Merged

6 tasks

Uh oh!

Conversation

snath-xoc commented Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ogrisel commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

snath-xoc commented Nov 3, 2024

Uh oh!

ogrisel commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

snath-xoc commented Nov 6, 2024

Uh oh!

Uh oh!

ogrisel commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snath-xoc commented Nov 14, 2024

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antoinebaker commented Jan 3, 2025

Uh oh!

antoinebaker commented Jan 3, 2025

Uh oh!

snath-xoc commented Jan 5, 2025

Uh oh!

Uh oh!

Uh oh!

jeremiedbb Jan 20, 2025

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

snath-xoc commented Oct 13, 2024 •

edited

Loading

github-actions bot commented Oct 13, 2024 •

edited

Loading

ogrisel commented Oct 16, 2024 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel commented Nov 4, 2024 •

edited

Loading

ogrisel commented Nov 14, 2024 •

edited

Loading