FIX Draw indices using sample_weight in Random Forests#31529
FIX Draw indices using sample_weight in Random Forests#31529ogrisel merged 61 commits intoscikit-learn:mainfrom
Conversation
|
Relative (float) |
|
The |
There was a problem hiding this comment.
I haven't the time to finish my review today, but this looks great: I tried running the notebook of https://github.com/snath-xoc/sample-weight-audit-nondet/ against this branch and I confirm the statistical tests pass for RandomForestClassifier/Regressor and ExtraTreesClassifier/Regressor.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
adam2392
left a comment
There was a problem hiding this comment.
Did a light pass. Mostly LGTM! Great work @antoinebaker
|
@ogrisel do you want to weigh in on the Not sure whether we should allow float to be >1 as well here, or deal with both float and int in a separate PR? If we do change behaviour here, it may be worth mentioning it in the changelog. Note that you can only have one bullet point per changelog file, but you can have any number of nested bullet points. |
|
I am fine with allowing EDIT: I changed my mind after re-reading the discussion, see below. |
ogrisel
left a comment
There was a problem hiding this comment.
I did another pass, and LGTM.
|
Actually, we probably need to add support for At the moment, when calling: RandomForestClassifier(max_samples=1.1).fit(X, y)we get: but the following does not raise anymore: RandomForestClassifier(max_samples=int(1.1 * X.shape[0])).fit(X, y)Which is weird, I agree. Let's make that behavior change consistent and explicitly documented and tested as part of this PR. |
|
Let's also mark as fixing #28507 ! |
|
Thanks for the reviews @lucyleeow @adam2392 @ogrisel. |
ogrisel
left a comment
There was a problem hiding this comment.
Thanks @antoinebaker. LGTM besides the following nits.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
|
@lucyleeow @adam2392 @cakedev0 ok for merge? |
cakedev0
left a comment
There was a problem hiding this comment.
I haven't reviewed the latest changes, but I read the discussions and commit messages and it looked good 👍
And for the rest, all good for me.
lucyleeow
left a comment
There was a problem hiding this comment.
Nits and a question, but LGTM!
sklearn/ensemble/_bootstrap.py
Outdated
| sample_weight : array of shape (n_samples,) or None | ||
| Sample weights. The frequency semantics of :term:`sample_weight` is | ||
| guaranteed when `max_samples` is a float or integer, but not when | ||
| `max_samples` is None. |
There was a problem hiding this comment.
Can we add back the line about "the effective bootstrap size is no longer guaranteed to be equivalent."?
There was a problem hiding this comment.
I put it in a dedicated Notes section, as it was a bit lengthy.
|
Merged! Thanks all for your work on getting this in! |
…31529) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Lucy Liu <jliu176@gmail.com> Co-authored-by: Arthur Lacote <arthur.lcte@gmail.com>
…31529) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Lucy Liu <jliu176@gmail.com> Co-authored-by: Arthur Lacote <arthur.lcte@gmail.com>
…31529) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Lucy Liu <jliu176@gmail.com> Co-authored-by: Arthur Lacote <arthur.lcte@gmail.com>




Part of #16298. Similar to #31414 (Bagging estimators) but for Forest estimators.
Also fixes #28507.
What does this implement/fix? Explain your changes.
When subsampling is activated (
bootstrap=True),sample_weightare now used as probabilities to draw the indices. Forest estimators then pass the statistical repeated/weighted equivalence test.Comments
This PR does not fix Forest estimators when
bootstrap=False(no subsampling).sample_weightare still passed to the decision trees. Forest estimators then fail the statistical repeated/weighted equivalence test because the individual treesalso fail this test (probably because of tied splits in decision trees #23728).
TODO
sample_weight=Nonecasemax_samplesas done in FIX Draw indices using sample_weight in Bagging #31414class_weight = "balanced"as done in Fix linear svc handling sample weights under class_weight="balanced" #30057class_weight = "balanced_subsample"