Skip to content

ENH Adding estimators_samples_ attribute to forest models#26736

Merged
glemaitre merged 51 commits intoscikit-learn:mainfrom
neurodata:est_samples
Nov 3, 2023
Merged

ENH Adding estimators_samples_ attribute to forest models#26736
glemaitre merged 51 commits intoscikit-learn:mainfrom
neurodata:est_samples

Conversation

@adam2392
Copy link
Copy Markdown
Member

@adam2392 adam2392 commented Jun 30, 2023

Reference Issues/PRs

Fixes: #26716

What does this implement/fix? Explain your changes.

  • Adds the estimators_samples_ property to the BaseForest class, which allows any forest-method to recompute the indices of the training samples per tree in the forest

Any other comments?

Note this is essentially very similar to Bagging* except we only care about the samples considered rather than samples and feature indices.

This is useful for example for potentially allowing the trees to get closer to enabling things like i) honesty and ii) analysis of the trained samples used for the tree.

Signed-off-by: Adam Li <adam2392@gmail.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jun 30, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ab3be64. Link to the linter CI: here

adam2392 added 3 commits June 30, 2023 12:14
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 marked this pull request as ready for review June 30, 2023 19:21
@adam2392
Copy link
Copy Markdown
Member Author

Kay this is ready for review!

Copy link
Copy Markdown
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @adam2392!

@adam2392 adam2392 requested a review from OmarManzoor August 11, 2023 14:15
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 requested a review from OmarManzoor August 11, 2023 20:36
Copy link
Copy Markdown
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @adam2392. A few additional comments.

adam2392 and others added 3 commits August 15, 2023 10:45
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Oct 20, 2023

This attribute is already public on Bagging* models. I don't really see the point in implementing something private that we don't use internally in scikit-learn.

The goal of this attribute is to make it easier to inspect the result of the learning process via public API.

+1 for exposing it as a public attribute.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Oct 20, 2023

@adam2392 maybe you can clarify what kind of use you make of this attribute?

@adam2392
Copy link
Copy Markdown
Member Author

@adam2392 maybe you can clarify what kind of use you make of this attribute?

@ogrisel I plan on using this attribute for inspecting the training/testing samples used in fitting each individual tree. Right now, this is really hard to do as the fitting the forest process is entirely internal.

More broadly, this would allow someone to implement "honest trees" (ref: #19710) very easily.

  • fit each tree/forest
  • for each tree, get out-of-bag samples using the estimators_indices_ property and use those to re-fit the leaves
  • each tree is now "honest"

adam2392 and others added 2 commits October 24, 2023 09:16
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@adam2392
Copy link
Copy Markdown
Member Author

Hi I resolved the few docstring comments, but left #26736 (comment) and #26736 (comment) as those seem like they would require changing the behavior in Bagging* too.

Happy to do so though if you think it's warranted.

@adam2392 adam2392 requested a review from ogrisel October 24, 2023 13:19
@adam2392
Copy link
Copy Markdown
Member Author

Hi @ogrisel just wanted to gently follow-up here to prevent it getting lost.

Is there anything related to the current feature that you think should be still changed?

And a follow-up, do you think the deprecation and renaming of the estimators_samples_ -> training_samples_ API is warranted for a new GH issue for both Bagging* and Forest*?

@glemaitre glemaitre self-requested a review November 3, 2023 13:44
@glemaitre glemaitre changed the title [ENH] Adding estimators_samples_ for forest models ENH Adding estimators_samples_ attribute to forest models Nov 3, 2023
Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM on my side.

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM on my side.

adam2392 and others added 3 commits November 3, 2023 12:40
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>

assert isinstance(estimator_samples, list)
assert len(estimators_samples) == len(estimators)
assert estimators_samples[0].dtype == np.int32
Copy link
Copy Markdown
Member Author

@adam2392 adam2392 Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @glemaitre . One part of the if statement led to it being np.int64 instead of np.int32

@adam2392
Copy link
Copy Markdown
Member Author

adam2392 commented Nov 3, 2023

Otherwise LGTM on my side.

Thanks for the review! I've addressed the comments left by you. Lmk if there's anything I missed.

@glemaitre glemaitre self-requested a review November 3, 2023 18:29
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I will merge once the doc build passed.

@glemaitre glemaitre merged commit 3737909 into scikit-learn:main Nov 3, 2023
@glemaitre
Copy link
Copy Markdown
Member

The failures were transient due to GitHub.

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
…rn#26736)

Signed-off-by: Adam Li <adam2392@gmail.com>
Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ensemble Waiting for Second Reviewer First reviewer is done, need a second one!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding estimators_samples_ property to ensemble of tree methods such as RandomForest and ExtraTrees

5 participants