Added sample weight handling to BinMapper under HGBT by snath-xoc · Pull Request #29641 · scikit-learn/scikit-learn

snath-xoc · 2024-08-08T21:46:57Z

Calls sample weight within BinMapper and passes it on to _find_binning_thresholds where bin midpoints are calculated using weighted percentile. Sample weights also passed to rng.choice() subsampling in BinMapper for samples larger than 2e5

NOTE: when the n_bins<discrete_values the best workaround was to set the bins as the midpoints. In future, it may be worth getting rid of this altogether, however at the risk of getting inhomogeneous array from weighted_percentile. We will need to agree on the best methods of trimming.

Major changes proposed:

Subsampling with weights after which sample_weight is not propagated in _BinMapper (set to None)
Tests for sample_weight invariance for _BinMapper under deterministic case added
Some tests failed with the new changes and n_samples had to be increased e.g., test_interaction_cst_numerically
Some tests failed under special cases when sample weights are passed through but distinct values<max_bins. To be checked what to do
Statistical test added in comments below

NOTE: this also allows HGBT to pass weighted tests for more than 256 samples (but still less than 2e6)

TO DO:

Update tests under test_gradient_boosting so that sample weight is passed as positional argument
Fix test failures, something fishy changes such that e.g., test_interaction_cst_numerically is not working
Modified tests to have larger number of samples otherwise expected results are not attaines (i.e., r2 value threshold etc. is not reached).... was expecting this to be the other way so not sure if I set off another bug.
When distinct values are less than max_bins and sample weights are given we get a discrepancy between weighted and repeated under special cases (see test_zero_sample_weights_classification under test_gradient_boosting.py). Need to see what to do here.
Further test failure when max_bins are specified as > distinct values under weighted and repeated cases
Check that sample weight invariance holds under deterministic case (test added under test_binning)
For the case n_samples > subsample conduct a statistical analysis to check for that the repeated/reweighted equivalence holds for the binning procedure in expectation, similar to issue Incorrect sample weight handling in KBinsDiscretizer #29906. Conduct analysis for many different values random_state for match of the mean bin edges.

ogrisel

I did not analyse the test failures yet but here is some early feedback.

For the case n_samples > subsample we would need to conduct a statistical analysis to check for that the repeated/reweighted equivalence holds for the binning procedure in expectation.

Once the tests pass for the deterministic case, we should conduct such an analysis (e.g. using a side notebook, not included in the repo, where we rerun the binning for many different values random_state and then check for match of the mean bin edges).

Please add a TODO item to the description of this PR not to forget about this.

sklearn/ensemble/_hist_gradient_boosting/binning.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

…oosting.py Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

snath-xoc · 2025-03-21T13:29:43Z

Statistical tests were conducted for n_samples>subsample (subsample=int(2e5)) using the following code:

import numpy as np
from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper
from scipy.stats import kstest
import matplotlib.pyplot as plt

BONFERRONI_CORRECTION = 1 #To adjust later according  to agreed test dim.

rng = np.random.RandomState(0)

n_samples = int(3e5)
X = rng.randint(0, 30, size=(n_samples,3))
# Use random integers (including zero) as weights.
sw = rng.randint(0, 5, size=n_samples)

X_repeated = np.repeat(X, sw,axis=0)
assert len(X_repeated)>int(2e5)

bin_thresholds_weighted=[]
bin_thresholds_repeated = []
for seed in np.arange(100):
    est_weighted = _BinMapper(n_bins=6, random_state=seed).fit(
                X, sample_weight=sw
            )

    est_repeated = _BinMapper(n_bins=6, random_state=seed+500).fit(
                X_repeated, sample_weight=None)

    bin_thresholds_weighted.append(est_weighted.bin_thresholds_)
    bin_thresholds_repeated.append(est_repeated.bin_thresholds_)

bin_thresholds_weighted = np.asarray(bin_thresholds_weighted)
bin_thresholds_repeated = np.asarray(bin_thresholds_repeated)

fig,axs = plt.subplots(3,4,figsize=(14,12))

j=0
for i,ax in enumerate(axs.flatten()):
    if i>0 and i%4==0:
        j+=1
    ax.hist(bin_thresholds_weighted[:,j,i%4].flatten())
    ax.hist(bin_thresholds_repeated[:,j,i%4].flatten(),alpha=0.5)
    pval = kstest(bin_thresholds_weighted[:,j,i%4].flatten(),
                                    bin_thresholds_repeated[:,j,i%4].flatten()).pvalue
    if pval<(0.05*BONFERRONI_CORRECTION):
        ax.set_title(f'p-value: {pval:.4f},failed')
    else:
        ax.set_title(f'p-value: {pval:.4f},passed')

The output is as follows:

snath-xoc · 2025-03-21T13:40:08Z

The ARM lint test is still failing due to the test_find_binning_thresholds_small_regular_data assertion error but I can't reproduce it locally.

ogrisel

Modified tests to have larger number of samples otherwise expected results are not attaines (i.e., r2 value threshold etc. is not reached).... was expecting this to be the other way so not sure if I set off another bug.

This is annoying, but it's probably caused by the fact that we switched from np.percentile(..., method="linear") to np.percentile(..., method="averaged_inverted_cdf") in the sample_weight=None case. Assuming this change is really the cause and since it's necessary to fix the weight/repetition equivalence, we might consider this behavior change as a bug fix. It should be clearly documented as such in the changelog, possibly with a dedicated note in the "Changed models" section of the release notes to better warn the users about this change.

Another concern is that _averaged_weighted_percentile implementation is currently very naive and will cause a performance regression whenever the users passes sample_weight != None. We might want to postpone the final merge of this PR to wait for the review and merge of #30945 first.

sklearn/ensemble/_hist_gradient_boosting/binning.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

sklearn/ensemble/_hist_gradient_boosting/binning.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

snath-xoc · 2026-02-04T12:44:32Z

@antoinebaker can you see if you can reproduce the test_subsampled_weighted_vs_repeated_equivalence failure on your side? I do not get it on my laptop :/

ogrisel

The failure is probably caused by the fact that the test is not fully deterministic:

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

snath-xoc · 2026-02-04T19:20:10Z

@antoinebaker this is ready for another pass of review

antoinebaker

Thanks @snath-xoc, besides a few nitpicks, LGTM!

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

doc/whats_new/upcoming_changes/changed-models/29641.fix.rst

Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>

ogrisel

@snath-xoc Since this change only impacts estimators from the sklearn.ensemble submodule, hence the changelog entry should be moved to the doc/whats_new/upcoming_changes/sklearn.ensemble folder. The changed-model folder should be reserved to library wide changes that impact estimators scattered in many modules.

Besides this and the following comment, LGTM.

doc/whats_new/upcoming_changes/changed-models/29641.fix.rst

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

…ted_equivalence is high-powered to detect subtle bugs

ogrisel

A few more refinements to better address the feedback in @lorentzenchr's review.

ogrisel · 2026-02-16T15:45:39Z

sklearn/ensemble/_hist_gradient_boosting/binning.py

+            # statistically equivalent to passing unit weights.
+            subset = rng.choice(
+                X.shape[0], self.subsample, p=subsampling_probabilities, replace=True
+            )


I checked that the number of iterations in test_subsampled_weighted_vs_repeated_equivalence is large enough to detect the statistical discrepancy if we either set replace=False (easy) or with replace=sample_weight is not None (more subtle statistical bug).

When I use replace=sample_weight is not None on my side the test_subsampled_weighted_vs_repeated_equivalence fails as well, so seems like it is good enough at detecting the bug (from what I understand we would expect replace=sample_weight is not None to fail, unless I misunderstood)

doc/whats_new/upcoming_changes/sklearn.ensemble/29641.fix.rst

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

snath-xoc · 2026-02-26T11:56:37Z

@ogrisel this is ready it seems, I suppose we are waiting for @lorentzenchr?

adrinjalali · 2026-03-03T09:14:08Z

Since this fixes the case with sample weights, and make the sample_weight=None compatible with sample weights case, let's merge this, even though it's a slight change in behavior and consider it a bug fix. We can always revert if users come with major complaints.

PR: scikit-learn#29641 Issue: scikit-learn#29640 Base commit: 57aa064 Changed lines: 233

snath-xoc added 4 commits August 2, 2024 15:52

add weighted percentile to binning threshold calculation in HGBT

43da2a6

add weighted percentile to binning threshold calculation in HGBT

b7b1ce2

add sample_weight pass through to binning

d21a19c

fix _find_binning_thresholds and BinMapper sample weight handling

6b3f8ea

github-actions bot added the module:ensemble label Aug 8, 2024

snath-xoc added 4 commits August 8, 2024 17:49

minor fix

cf1149b

updated sampe weight handling in test_binning

fbd0365

updated sampe weight handling in test_binning

22c4c11

[all random seeds]

90665f1

ogrisel reviewed Aug 12, 2024

View reviewed changes

ogrisel mentioned this pull request Oct 25, 2024

List of estimators with known incorrect handling of sample_weight #16298

Open

54 tasks

snath-xoc and others added 11 commits March 13, 2025 21:43

Merge branch 'main' into HGBT_bin_weights

da8344e

Update sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_b…

00ead80

…oosting.py Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

fix _find_binning_thresholds and BinMapper sample weight handling

2d74df6

[all random seeds]

398598e

comment

8d9544f

add None handling to _find_binning_threshold

3ffcbca

change sampling prob to None when no sample weights passed

0758886

fix distinct value handling and modify tests for convergence

4f58074

Merge branch 'main' into HGBT_bin_weights

b7ce7a9

update tests

664d9e4

Merge branch 'main' into HGBT_bin_weights

47e6b65

snath-xoc marked this pull request as ready for review March 21, 2025 13:32

ogrisel reviewed Mar 21, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/binning.py Show resolved Hide resolved

ogrisel reviewed Mar 21, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/binning.py Outdated Show resolved Hide resolved

ogrisel reviewed Mar 21, 2025

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py Outdated Show resolved Hide resolved

Apply suggestions from code review

628a87a

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel reviewed Feb 4, 2026

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py Outdated Show resolved Hide resolved

snath-xoc and others added 2 commits February 4, 2026 19:00

Update sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

4c0871e

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge branch 'main' into HGBT_bin_weights

a689bce

github-actions bot added the CI:Linter failure The linter CI is failing on this PR label Feb 4, 2026

linting issues

e18c04a

github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Feb 4, 2026

antoinebaker approved these changes Feb 5, 2026

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py Outdated Show resolved Hide resolved

sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py Outdated Show resolved Hide resolved

doc/whats_new/upcoming_changes/changed-models/29641.fix.rst Outdated Show resolved Hide resolved

snath-xoc and others added 3 commits February 6, 2026 18:39

Update sklearn/ensemble/_hist_gradient_boosting/tests/test_binning.py

345be64

Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>

add reviews

e55f4d3

Merge branch 'main' into HGBT_bin_weights

a073a5a

ogrisel approved these changes Feb 9, 2026

View reviewed changes

doc/whats_new/upcoming_changes/changed-models/29641.fix.rst Outdated Show resolved Hide resolved

auguste-probabl moved this from In progress - High Priority to In progress in Labs Feb 9, 2026

snath-xoc and others added 5 commits February 12, 2026 18:30

Update doc/whats_new/upcoming_changes/changed-models/29641.fix.rst

1e9f69e

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge branch 'main' into HGBT_bin_weights

36104e8

move changelog

f300982

Merge branch 'main' into HGBT_bin_weights

d6d7890

Ensure that the statistical test in test_subsampled_weighted_vs_repea…

944a642

…ted_equivalence is high-powered to detect subtle bugs

ogrisel approved these changes Feb 16, 2026

View reviewed changes

snath-xoc and others added 3 commits February 19, 2026 14:20

Update doc/whats_new/upcoming_changes/sklearn.ensemble/29641.fix.rst

1a13df2

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge branch 'main' into HGBT_bin_weights

50df208

Merge branch 'main' into HGBT_bin_weights

80e0bc1

adrinjalali merged commit 5f0f25f into scikit-learn:main Mar 3, 2026
37 checks passed

github-project-automation bot moved this from In progress to Done in Labs Mar 3, 2026

lorentzenchr mentioned this pull request Mar 5, 2026

META track sample_weight support for all estimators #33457

Open

51 tasks

tonybaloney pushed a commit to tonybaloney/swe-complex-scikit-learn that referenced this pull request Mar 19, 2026

Human gold patch for scikit-learn__scikit-learn-29641

a8a500b

PR: scikit-learn#29641 Issue: scikit-learn#29640 Base commit: 57aa064 Changed lines: 233

lorentzenchr mentioned this pull request Mar 28, 2026

Add sample_weight support to binning in HGBT #27117

Closed

Uh oh!

Conversation

snath-xoc commented Aug 8, 2024 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

snath-xoc commented Mar 21, 2025

Uh oh!

snath-xoc commented Mar 21, 2025

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

snath-xoc commented Feb 4, 2026

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

snath-xoc commented Feb 4, 2026

Uh oh!

antoinebaker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

snath-xoc Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

snath-xoc commented Feb 26, 2026

Uh oh!

adrinjalali commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

snath-xoc commented Aug 8, 2024 •

edited by ogrisel

Loading

ogrisel left a comment •

edited

Loading

snath-xoc Feb 19, 2026 •

edited

Loading