[WIP] Balanced Random Forest by massich · Pull Request #8732 · scikit-learn/scikit-learn

massich · 2017-04-12T13:00:07Z

Reference Issue

Fixes #8607

What does this implement/fix? Explain your changes.

This PR takes over #5181 ( and #8728 )

What does this implement/fix? Explain your changes.

Tasks to be performed

Balancing
- tree level
- [ ]node level
complement the BRF original benchmark with the highly imbalanced datasets present in scikit-learn-contrib/imalance-learn
Replace the balanced=True that triggers the BRF

MechCoder · 2017-04-20T02:26:25Z

Can you provide a summary of what exactly is left to do in the PR description? Thanks!

potash · 2017-05-17T20:58:07Z

@massich check out my branch feature/balanced-random-forest-api. The changes are:

Followed the discussion of @glemaitre @arjoly @amueller in bootstrapping based on sample weights in random forests #8607 to remove the ad-hoc support for multioutput balanced randomf forest and raising an error when it is attempted.
Added unit tests for the two BRF helper methods to test_balanced_random_forest.py-- it wasn't obvious to me which of the existing test files they belong in so feel free to move them.
I changed the API to be class_weight="balanced_bootstrap" as discussed in bootstrapping based on sample weights in random forests #8607.

Please let me know what is left to get this merged.

massich · 2017-05-18T08:57:12Z

@potash I am benchmarking the estimator here. My idea for the benchmark is:

Using sklearn datasets:
- Create a synthetic dataset and go from balanced to highly unbalanced to see when BRF is beneficial
- Repeat the experiment with Breast dataset in Sk-learn.
Using sklearn-imbalance:
- Test against their selection of imbalanced datasets
Using openML:
- Explore some imbalanced datasets

potash · 2017-05-18T16:45:14Z

Sounds good. You'll want to merge feature/balanced-random-forest-api so you can work off the new api (class_weight="balanced_bootstrap") and merge brf-example as it's been updated there too. Let me know if I can help with the examples.

amueller · 2017-05-18T18:27:30Z

There's some benchmarks here on a real datasets and also a silly implementation of the feature using imblearn: https://github.com/amueller/applied_ml_spring_2017/blob/master/slides/aml-15-resampling-imbalanced-data.ipynb
You can see round Out[83] that this method is doing much better than any of the others.

geneorama · 2017-11-21T22:25:11Z

Hello there, is it possible to get an update on this? We're using this model in production (https://github.com/Chicago/lead-model), and as we prepare to go live it would be very helpful for deployment if this branch were in the standard sci-kit learn library.

Thanks for all the great work here!

Also, let us know if there's something we can do to move this forward.

amueller · 2017-11-21T22:36:41Z

this needs tests, documentation and examples. I'm a big fan of this methods, so I'd be happy to see this moved forward. @massich are you still working on it? Would you like some help?
I liked using the mammography dataset: https://www.openml.org/d/310, see #9908 for a loader ;)

glemaitre · 2017-11-21T22:42:35Z

In the meanwhile, we have the BalancedBaggingClassifier which can be set to a balanced random forest by setting max_features='auto' if I am not wrong.

amueller · 2017-11-21T22:47:48Z

@glemaitre I believe you are right.

massich · 2017-11-22T13:27:17Z

Actually, it completely stalled. I did not even finish the benchmark. I was playing with openml but I didn't finish it. It has been sitting for 6 months.

We should definitely revive it.

chkoar · 2018-01-05T14:29:21Z

@massich what is the current status of this PR? Do you need a hand? According to a previous comment of @amueller this PR needs love, tests, documentation and examples, right?

jnothman · 2018-02-21T21:38:35Z

IMO it would be good if you helped complete this, @chkoar

chkoar · 2018-02-21T22:01:36Z

@jnothman That was the intention. If it is not picked by anyone else I will give it a in a couple of weeks. @massich has already given write access to me on his repos

potash · 2018-02-21T22:04:04Z

@chkoar let me know if there's anything I (original author of the feature) can do to help. Would be very happy to see this merged.

chkoar · 2019-02-18T16:45:03Z

@potash ok, thanks. Let's hope that it will be merged during the upcoming sprint.

jnothman · 2019-02-19T00:46:11Z

I think you should expect a little less. But let's honours list hope it will be a lot closer to merge after the sprint.

massich · 2019-02-24T18:58:59Z

closing in favor of #13227. Thx @chkoar for taking over.

potash and others added 17 commits April 11, 2017 17:55

initial balanced commit

027207c

fix default value, comment encoding

495f4f5

remove debug

8a07510

cache balance_data

514c1c6

subsetting data is more efficient

8be7077

fix sample_weight when balanced

14f2789

fix sample_weights

46527e1

balanced random forest example

8e5d5d0

Raise error for balanced multi-output

5c7cefe

remove brf example

7f8bb88

refactor

07c61f8

multi-output brf

ddae208

fix flake8

e01667e

xrange -> range

9dd2907

fix flex8

6fd01c7

fix comment line >90 characters

3766acd

fix flex8

7c4054b

massich mentioned this pull request Apr 12, 2017

[WIP] Balanced Random Forest #8728

Closed

raghavrv added the Sprint label Jun 3, 2017

raghavrv self-requested a review June 28, 2017 13:05

glemaitre mentioned this pull request Nov 27, 2017

DOC add comments regarding to make a balanced random forest from a BalancedBaggingClassifier scikit-learn-contrib/imbalanced-learn#372

Closed

amueller mentioned this pull request Dec 12, 2017

Integration and test cases for RandomForest subsampling #9645

Closed

jnothman mentioned this pull request Feb 21, 2018

RandomForest: Different subsampling per tree for faster training #10668

Closed

chkoar mentioned this pull request Feb 22, 2019

Balanced Random Forest #13227

Closed

massich closed this Feb 24, 2019

Uh oh!

Conversation

massich commented Apr 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Tasks to be performed

Uh oh!

MechCoder commented Apr 20, 2017

Uh oh!

potash commented May 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

massich commented May 18, 2017

Uh oh!

potash commented May 18, 2017

Uh oh!

amueller commented May 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geneorama commented Nov 21, 2017

Uh oh!

amueller commented Nov 21, 2017

Uh oh!

glemaitre commented Nov 21, 2017

Uh oh!

amueller commented Nov 21, 2017

Uh oh!

massich commented Nov 22, 2017

Uh oh!

chkoar commented Jan 5, 2018

Uh oh!

jnothman commented Feb 21, 2018

Uh oh!

chkoar commented Feb 21, 2018

Uh oh!

potash commented Feb 21, 2018

Uh oh!

chkoar commented Feb 18, 2019

Uh oh!

jnothman commented Feb 19, 2019

Uh oh!

massich commented Feb 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

massich commented Apr 12, 2017 •

edited

Loading

potash commented May 17, 2017 •

edited

Loading

amueller commented May 18, 2017 •

edited

Loading

massich commented Feb 24, 2019 •

edited

Loading