FIX Draw indices using sample_weight in Random Forests by antoinebaker · Pull Request #31529 · scikit-learn/scikit-learn

antoinebaker · 2025-06-12T08:19:04Z

Part of #16298. Similar to #31414 (Bagging estimators) but for Forest estimators.

Also fixes #28507.

What does this implement/fix? Explain your changes.

When subsampling is activated (bootstrap=True), sample_weight are now used as probabilities to draw the indices. Forest estimators then pass the statistical repeated/weighted equivalence test.

Comments

This PR does not fix Forest estimators when bootstrap=False (no subsampling). sample_weight are still passed to the decision trees. Forest estimators then fail the statistical repeated/weighted equivalence test because the individual trees
also fail this test (probably because of tied splits in decision trees #23728).

TODO

choose how to generate indices in the sample_weight=None case
fix relative (float) max_samples as done in FIX Draw indices using sample_weight in Bagging #31414
docstrings
fix class_weight = "balanced" as done in Fix linear svc handling sample weights under class_weight="balanced" #30057
fix class_weight = "balanced_subsample"
changelog

github-actions · 2025-06-12T08:19:52Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: d47066f. Link to the linter CI: here}

sklearn/ensemble/_forest.py

antoinebaker · 2025-06-12T09:14:29Z

The forest estimators now pass the statistical repeated/weighted equivalence test, for example

antoinebaker · 2025-06-16T07:49:21Z

Relative (float) max_samples, with the new meaning of drawing max_samples * sw_sum indices as done in #31414 , also passes the statistical repeated/weighted equivalence test

antoinebaker · 2025-06-25T09:29:58Z

The class_weight="balanced" option, now taking the sample_weight into account as in #30057, now passes the statistical repeated/weighted equivalence test

antoinebaker · 2025-06-25T09:37:08Z

The class_weight="balanced_subsampling" also passes, in that case sample_weight are used to draw the indices, the class_weight are then computed on the bootstraped sample for every grown tree and passed as sample_weight to the tree fit.

sklearn/ensemble/_forest.py

ogrisel

I haven't the time to finish my review today, but this looks great: I tried running the notebook of https://github.com/snath-xoc/sample-weight-audit-nondet/ against this branch and I confirm the statistical tests pass for RandomForestClassifier/Regressor and ExtraTreesClassifier/Regressor.

sklearn/ensemble/tests/test_voting.py

sklearn/ensemble/tests/test_forest.py

ogrisel

More feedback.

sklearn/ensemble/_forest.py

sklearn/ensemble/tests/test_voting.py

sklearn/ensemble/_forest.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

sklearn/ensemble/tests/test_bootstrap.py

adam2392

Did a light pass. Mostly LGTM! Great work @antoinebaker

lucyleeow · 2025-12-30T23:50:09Z

@ogrisel do you want to weigh in on the max_sample behaviour: #28507

Not sure whether we should allow float to be >1 as well here, or deal with both float and int in a separate PR?

If we do change behaviour here, it may be worth mentioning it in the changelog. Note that you can only have one bullet point per changelog file, but you can have any number of nested bullet points.

ogrisel · 2026-01-06T15:06:47Z

I am fine with allowing max_samples > 1.0 (when float) but let's do that in a separate PR to make it easier to review (with a dedicated changelog entry and tests).

EDIT: I changed my mind after re-reading the discussion, see below.

ogrisel

I did another pass, and LGTM.

ogrisel · 2026-01-06T15:16:47Z

Actually, we probably need to add support for max_samples > 1.0 when float as part of this PR to preserve consistency with max_samples > n_samples when passed integer.

At the moment, when calling:

RandomForestClassifier(max_samples=1.1).fit(X, y)

we get:

InvalidParameterError: The 'max_samples' parameter of RandomForestClassifier must be None, a float in the range (0.0, 1.0] or an int in the range [1, inf). Got 1.6 instead.

but the following does not raise anymore:

RandomForestClassifier(max_samples=int(1.1 * X.shape[0])).fit(X, y)

Which is weird, I agree. Let's make that behavior change consistent and explicitly documented and tested as part of this PR.

lucyleeow · 2026-01-06T16:50:07Z

Let's also mark as fixing #28507 !

antoinebaker · 2026-01-08T13:27:38Z

Thanks for the reviews @lucyleeow @adam2392 @ogrisel. max_samples > 1.0 is now supported and tested in test_max_samples_geq_one.

ogrisel

Thanks @antoinebaker. LGTM besides the following nits.

sklearn/ensemble/_bootstrap.py

sklearn/ensemble/_forest.py

sklearn/ensemble/tests/test_forest.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel · 2026-01-14T08:46:07Z

@lucyleeow @adam2392 @cakedev0 ok for merge?

cakedev0

I haven't reviewed the latest changes, but I read the discussions and commit messages and it looked good 👍

And for the rest, all good for me.

lucyleeow

Nits and a question, but LGTM!

doc/whats_new/upcoming_changes/sklearn.ensemble/31529.fix.rst

sklearn/ensemble/tests/test_bootstrap.py

sklearn/ensemble/tests/test_forest.py

lucyleeow · 2026-01-14T10:26:40Z

sklearn/ensemble/_bootstrap.py

+    sample_weight : array of shape (n_samples,) or None
+        Sample weights. The frequency semantics of :term:`sample_weight` is
+        guaranteed when `max_samples` is a float or integer, but not when
+        `max_samples` is None.


Can we add back the line about "the effective bootstrap size is no longer guaranteed to be equivalent."?

I put it in a dedicated Notes section, as it was a bit lengthy.

sklearn/ensemble/tests/test_forest.py

Co-authored-by: Lucy Liu <jliu176@gmail.com>

ogrisel · 2026-01-16T14:13:42Z

Merged! Thanks all for your work on getting this in!

…31529) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Lucy Liu <jliu176@gmail.com> Co-authored-by: Arthur Lacote <arthur.lcte@gmail.com>

use sample_weight in choice

9458a1c

github-actions bot added the module:ensemble label Jun 12, 2025

antoinebaker commented Jun 12, 2025

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

antoinebaker added 2 commits June 13, 2025 17:40

use old code path

2f30d7d

relative max_samples

a55643b

antoinebaker and others added 8 commits June 19, 2025 14:50

adapt tests

bcad08a

Merge branch 'main' into random_forest_sample_weight

cce2060

changelog

f77059e

add relative max_sample test

eae5d27

fix class_weight

83c45db

replace rf by gbdt

8f28c4a

cleanup

8ca62a9

comment

c311a2f

antoinebaker and others added 5 commits June 25, 2025 11:42

typo

65f3ece

docstrings

2cf1700

undo changelog

e6e083b

typo

9201e5a

Merge branch 'main' into random_forest_sample_weight

bff60ae

antoinebaker commented Jun 27, 2025

View reviewed changes

sklearn/ensemble/_forest.py Show resolved Hide resolved

antoinebaker marked this pull request as ready for review June 27, 2025 10:14

ogrisel reviewed Jul 2, 2025

View reviewed changes

sklearn/ensemble/tests/test_voting.py Show resolved Hide resolved

sklearn/ensemble/tests/test_forest.py Show resolved Hide resolved

ogrisel reviewed Jul 3, 2025

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

antoinebaker and others added 2 commits July 8, 2025 17:10

Update sklearn/ensemble/_forest.py

e5282f6

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Apply suggestions from code review

94a96e3

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

adam2392 reviewed Dec 27, 2025

View reviewed changes

sklearn/ensemble/tests/test_bootstrap.py Outdated Show resolved Hide resolved

adam2392 reviewed Dec 27, 2025

View reviewed changes

adrinjalali moved this from In progress to In progress - High Priority in Labs Jan 6, 2026

ogrisel approved these changes Jan 6, 2026

View reviewed changes

antoinebaker and others added 4 commits January 8, 2026 10:10

Merge branch 'main' into random_forest_sample_weight

ffd6fb3

allow max_samples geq 1

bb42798

add test

a06be6e

Merge branch 'main' into random_forest_sample_weight

885c1f4

ogrisel approved these changes Jan 12, 2026

View reviewed changes

sklearn/ensemble/_bootstrap.py Outdated Show resolved Hide resolved

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

sklearn/ensemble/tests/test_forest.py Show resolved Hide resolved

antoinebaker and others added 3 commits January 12, 2026 15:52

Apply suggestions from code review

6be7c3e

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

check score

fc3f303

same random_state

b41e433

cakedev0 approved these changes Jan 14, 2026

View reviewed changes

lucyleeow approved these changes Jan 14, 2026

View reviewed changes

adam2392 approved these changes Jan 14, 2026

View reviewed changes

antoinebaker and others added 3 commits January 16, 2026 09:37

Update sklearn/ensemble/tests/test_forest.py

ecb12f9

Co-authored-by: Lucy Liu <jliu176@gmail.com>

changelog

f99e124

note

c6f2793

ogrisel merged commit ce1b377 into scikit-learn:main Jan 16, 2026
38 checks passed

github-project-automation bot moved this from In progress - High Priority to Done in Labs Jan 16, 2026

Uh oh!

Conversation

antoinebaker commented Jun 12, 2025 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Comments

Uh oh!

github-actions bot commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

antoinebaker commented Jun 12, 2025

Uh oh!

antoinebaker commented Jun 16, 2025

Uh oh!

antoinebaker commented Jun 25, 2025

Uh oh!

antoinebaker commented Jun 25, 2025

Uh oh!

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adam2392 left a comment

Choose a reason for hiding this comment

Uh oh!

lucyleeow commented Dec 30, 2025

Uh oh!

ogrisel commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jan 6, 2026

Uh oh!

lucyleeow commented Jan 6, 2026

Uh oh!

antoinebaker commented Jan 8, 2026

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Jan 14, 2026

Uh oh!

cakedev0 left a comment

Choose a reason for hiding this comment

Uh oh!

lucyleeow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucyleeow Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

antoinebaker Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

antoinebaker commented Jun 12, 2025 •

edited by ogrisel

Loading

github-actions bot commented Jun 12, 2025 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel commented Jan 6, 2026 •

edited

Loading