Added sample weight handling to BinMapper under HGBT#29641
Added sample weight handling to BinMapper under HGBT#29641adrinjalali merged 96 commits intoscikit-learn:mainfrom
Conversation
There was a problem hiding this comment.
I did not analyse the test failures yet but here is some early feedback.
For the case n_samples > subsample we would need to conduct a statistical analysis to check for that the repeated/reweighted equivalence holds for the binning procedure in expectation.
Once the tests pass for the deterministic case, we should conduct such an analysis (e.g. using a side notebook, not included in the repo, where we rerun the binning for many different values random_state and then check for match of the mean bin edges).
Please add a TODO item to the description of this PR not to forget about this.
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
…oosting.py Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
|
Statistical tests were conducted for n_samples>subsample (subsample=int(2e5)) using the following code: import numpy as np
from sklearn.ensemble._hist_gradient_boosting.binning import _BinMapper
from scipy.stats import kstest
import matplotlib.pyplot as plt
BONFERRONI_CORRECTION = 1 #To adjust later according to agreed test dim.
rng = np.random.RandomState(0)
n_samples = int(3e5)
X = rng.randint(0, 30, size=(n_samples,3))
# Use random integers (including zero) as weights.
sw = rng.randint(0, 5, size=n_samples)
X_repeated = np.repeat(X, sw,axis=0)
assert len(X_repeated)>int(2e5)
bin_thresholds_weighted=[]
bin_thresholds_repeated = []
for seed in np.arange(100):
est_weighted = _BinMapper(n_bins=6, random_state=seed).fit(
X, sample_weight=sw
)
est_repeated = _BinMapper(n_bins=6, random_state=seed+500).fit(
X_repeated, sample_weight=None)
bin_thresholds_weighted.append(est_weighted.bin_thresholds_)
bin_thresholds_repeated.append(est_repeated.bin_thresholds_)
bin_thresholds_weighted = np.asarray(bin_thresholds_weighted)
bin_thresholds_repeated = np.asarray(bin_thresholds_repeated)
fig,axs = plt.subplots(3,4,figsize=(14,12))
j=0
for i,ax in enumerate(axs.flatten()):
if i>0 and i%4==0:
j+=1
ax.hist(bin_thresholds_weighted[:,j,i%4].flatten())
ax.hist(bin_thresholds_repeated[:,j,i%4].flatten(),alpha=0.5)
pval = kstest(bin_thresholds_weighted[:,j,i%4].flatten(),
bin_thresholds_repeated[:,j,i%4].flatten()).pvalue
if pval<(0.05*BONFERRONI_CORRECTION):
ax.set_title(f'p-value: {pval:.4f},failed')
else:
ax.set_title(f'p-value: {pval:.4f},passed')The output is as follows: |
|
The ARM lint test is still failing due to the test_find_binning_thresholds_small_regular_data assertion error but I can't reproduce it locally. |
ogrisel
left a comment
There was a problem hiding this comment.
Modified tests to have larger number of samples otherwise expected results are not attaines (i.e., r2 value threshold etc. is not reached).... was expecting this to be the other way so not sure if I set off another bug.
This is annoying, but it's probably caused by the fact that we switched from np.percentile(..., method="linear") to np.percentile(..., method="averaged_inverted_cdf") in the sample_weight=None case. Assuming this change is really the cause and since it's necessary to fix the weight/repetition equivalence, we might consider this behavior change as a bug fix. It should be clearly documented as such in the changelog, possibly with a dedicated note in the "Changed models" section of the release notes to better warn the users about this change.
Another concern is that _averaged_weighted_percentile implementation is currently very naive and will cause a performance regression whenever the users passes sample_weight != None. We might want to postpone the final merge of this PR to wait for the review and merge of #30945 first.
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
|
@antoinebaker can you see if you can reproduce the |
ogrisel
left a comment
There was a problem hiding this comment.
The failure is probably caused by the fact that the test is not fully deterministic:
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
|
@antoinebaker this is ready for another pass of review |
antoinebaker
left a comment
There was a problem hiding this comment.
Thanks @snath-xoc, besides a few nitpicks, LGTM!
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
ogrisel
left a comment
There was a problem hiding this comment.
@snath-xoc Since this change only impacts estimators from the sklearn.ensemble submodule, hence the changelog entry should be moved to the doc/whats_new/upcoming_changes/sklearn.ensemble folder. The changed-model folder should be reserved to library wide changes that impact estimators scattered in many modules.
Besides this and the following comment, LGTM.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…ted_equivalence is high-powered to detect subtle bugs
ogrisel
left a comment
There was a problem hiding this comment.
A few more refinements to better address the feedback in @lorentzenchr's review.
| # statistically equivalent to passing unit weights. | ||
| subset = rng.choice( | ||
| X.shape[0], self.subsample, p=subsampling_probabilities, replace=True | ||
| ) |
There was a problem hiding this comment.
I checked that the number of iterations in test_subsampled_weighted_vs_repeated_equivalence is large enough to detect the statistical discrepancy if we either set replace=False (easy) or with replace=sample_weight is not None (more subtle statistical bug).
There was a problem hiding this comment.
When I use replace=sample_weight is not None on my side the test_subsampled_weighted_vs_repeated_equivalence fails as well, so seems like it is good enough at detecting the bug (from what I understand we would expect replace=sample_weight is not None to fail, unless I misunderstood)
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
|
@ogrisel this is ready it seems, I suppose we are waiting for @lorentzenchr? |
|
Since this fixes the case with sample weights, and make the |
PR: scikit-learn#29641 Issue: scikit-learn#29640 Base commit: 57aa064 Changed lines: 233

Fixes #29640, #27117
Towards #16298
Calls sample weight within BinMapper and passes it on to _find_binning_thresholds where bin midpoints are calculated using weighted percentile. Sample weights also passed to rng.choice() subsampling in BinMapper for samples larger than 2e5
NOTE: when the n_bins<discrete_values the best workaround was to set the bins as the midpoints. In future, it may be worth getting rid of this altogether, however at the risk of getting inhomogeneous array from weighted_percentile. We will need to agree on the best methods of trimming.
Major changes proposed:
NOTE: this also allows HGBT to pass weighted tests for more than 256 samples (but still less than 2e6)
TO DO:
KBinsDiscretizer#29906. Conduct analysis for many different values random_state for match of the mean bin edges.