[MRG+2] Add class_weight support to the forests & trees#3961
[MRG+2] Add class_weight support to the forests & trees#3961GaelVaroquaux merged 10 commits intoscikit-learn:masterfrom
Conversation
sklearn/ensemble/forest.py
Outdated
There was a problem hiding this comment.
Further to my comment about multiplying the output's weights for multi-output, this is what I was thinking for replacing #L403-412:
cw = []
for k in range(self.n_outputs_):
cw_part = compute_class_weight(self.class_weight,
self.classes_[k],
y_org[:, k])
cw_part = cw_part[np.searchsorted(self.classes_[k],
y_org[:, k])]
cw.append(cw_part)
cw = np.prod(cw, axis=0, dtype=np.float64)
Another option would be to perform this action at the bootstrap level for 'auto' class_weights in the _parallel_build_trees method. However, this would make 'auto' act differently to user-defined class_weights, as the user-defined dict would be applied the same way regardless of the bootstrap sample.
|
Added support for multi-output through the presets or use of a list of dicts for user-defined weights, ie In decided whether to implement the "auto" weights at the full dataset level, or in the bootstrapping, I added another option for the user to enter These presets, as with the user-defined dicts, are multiplied together for multi-output in this implementation. |
|
Anyone have any thoughts on this so far? |
sklearn/ensemble/forest.py
Outdated
There was a problem hiding this comment.
Shouldn't this be implemented in the tree? If that gets the auto class weight it can compute it itself, right?
|
Thanks for your contribution. |
|
I will check it out, but I think that the The "auto" and user-defined weights could be done in the tree perhaps, but that would require re-computing the same weights for every tree, which seems like redundant work. I'll undertake a bit of an investigation all the same. |
|
Hi Trevor! First of all, sorry for the lack of feedback on our side. It seems like all tree-huggers are quite busy with other things. Regarding the implementation of class weights, there are basically two different approaches, as I explained some time ago in #2129. The approach that you chose, exploiting sample weights only, or the one using class weights as priors in the computation of the impurity criteria. It should be checked but I am not sure both are exactly equivalent. Anyhow, the sample-weight based implementation is certainly the simplest to implement. So for the record, I am +1 for implementing this algorithm rather than modifying all criteria. |
I agree. In general, whatever option we have in forests, we also have in single trees. This should be the case as well for this feature I believe. Note that from an implementation point of view, redundant work can be avoided easily. E.g. by forcing from the forest class_weight=None in the trees that are built but passing the correctly balanced sample weights. |
|
Hey Gilles, thanks for the input, and no worries on delays, understand that everyone here is pretty heavily loaded. I'm going to take a look through the I think we're on the same page, but the redundant work I mentioned was just that the Anyhow, I'll probably check back in after my lovely 20-hour commute back home to Australia this evening :) |
|
Added
So I think keeping the |
|
|
Great, thanks @glouppe I'm working on validation of the |
I have no strong opinion on this, at the very least we should mention what happens in the docstring. Consistency with SGDClassifier is a good thing though. |
|
Is this still a work in progress? I can start reviewing the PR if you feel this is more or less ready. |
|
That would be great @glouppe , thanks. Added tests for the I warn for the string presets and |
There was a problem hiding this comment.
There was a problem hiding this comment.
maybe the test could also be reworked. The test passed for rbf svms, too, I think.
I wrote some of the class_weight tests, but I feel these are very hard to do. There are some examples with carefully crafted datasets, maybe these should be used instead?
There was a problem hiding this comment.
Actually, this test seems to be fine to me. As long as there is any regularization, it should work. I guess the bootstrapping alone doesn't help enough as there are very few trees in the ensemble by default.
There was a problem hiding this comment.
Yup, there's also only 2 features in this dataset too, so the randomized feature selection doesn't have much of a chance to generalize. Happy to make any mods that are deemed necessary though. Admit this is a bit of a hack to get a passing grade from Travis, though the module(s) do look a bit harder through my new Iris tests in tree and forest.
sklearn/ensemble/forest.py
Outdated
There was a problem hiding this comment.
Can you be explicit and call the variable y_original?
|
Besides the small cosmit, I am +1 for merge. Tests are thorough. Could you also add an entry in the whatsnew file? Thanks for your work @trevorstephens! This is a very helpful addition :) (and sorry for the lack of responsiveness these days...) |
|
Thanks @glouppe ! Requested changes are made. Second reviewer anyone (while I figure out how to get around a lovely 11th hour merge conflict) ... ? |
|
I was not concerned about performance but code duplication. |
|
Ah. I made a few comments about this a couple of weeks back: #3961 (comment) TL;DR: It may be possible, but would result in identical calculations for every tree where this way we only do it once. |
|
never mind then, 👍 from me. |
|
Great, thanks! Any way to kick off Travis again? |
|
I restarted the failed job. |
|
@GaelVaroquaux thanks! |
sklearn/ensemble/forest.py
Outdated
There was a problem hiding this comment.
Could we have a more explicite name for what "cw" means (for the variable name and the function call).
|
@GaelVaroquaux , thanks for looking it over! Let me know what you think of the latest. |
|
@amueller @glouppe @GaelVaroquaux & others I've been thinking about rolling Sorry to bring this up after flipping from [WIP], but it seems important to decide on before merge given the other estimators in the module. |
|
@trevorstephens : that's a very good comment. Maybe I would use 'subsample', which I find is less confusing than 'estimator'. |
sklearn/ensemble/forest.py
Outdated
There was a problem hiding this comment.
The 'np.copy' doesn't seem necessary above.
Conflicts: doc/whats_new.rst sklearn/ensemble/forest.py sklearn/tree/tree.py
|
Thanks for the comments @GaelVaroquaux . Renamed the |
|
LGTM. Two 👍 and my review. I think that this can be merged. Merging. Thanks! |
[MRG+2] Add class_weight support to the forests & trees
|
Thanks! Cheers! |

This PR adds support for specifying
class_weightin the forest classifier constructors.class_weightused with thewarm_startoption?Any other comments ?