[MRG+1] Use more natural class_weight="auto" heuristic#4347
[MRG+1] Use more natural class_weight="auto" heuristic#4347GaelVaroquaux merged 4 commits intoscikit-learn:masterfrom
Conversation
518ca64 to
3d7ad59
Compare
There was a problem hiding this comment.
This is not true in master.
3d7ad59 to
76badb1
Compare
76badb1 to
71e2e88
Compare
|
Hey @amueller & @ogrisel , as I mentioned on the Gitters, the change (while I do not dispute it being much more intuitive) could break reproducible results between scikit-learn versions. I ran a couple of quick toy dataset tests and there are some differences in results between implementations. I think some comment about the effect in the what's new would be a good idea. Logit: Old version: New version: Forests: Old version: New version: |
|
@trevorstephens do you think the whatsnew entry that I added is not sufficient? This will definitely change results, but as I remarked elsewhere, if the user grid-searches |
|
Sorry to have been silent (was at a conference). Yes, if the user grid searches C in any of the linear models they can theoretically get the same optimized parameters as before, depending on how many C values they search over. Specifically, if w0, ..., wM are the old class weights and w'0, ..., w'M are the new class weights, such that w'm = lambda * wm where lambda is the constant described above, then the optimized parameters of the model w/ class weights w0, ..., wM and an inverse regularization paremter C will be identical to the optimized parameters of the model w/ class weights w'0, ..., w'M and an inverse regularization parameter C / lambda. So, without grid searching over C (or setting C to whatever value was used previously divided by lambda), this change to class weights will produce different results. So I think there are two options: leave "auto" as it is and just be much clearer about what it does (even acknowleding that maybe it's not the most obvious thing to do, but differs only in the "effective" number of data points) or change it. If we go the latter route, we also need to give a much clearer explanation in the documentation of what "auto" does and how it relates to the value of C, but we should also make very clear somewhere (and I don't know where) exactly how the change will affect people's results with a fixed C value and how they should set C to recreate their old results. (I think this is important, not least because changes to default parameters can be very confusing and easy to overlook/miss. I ran into this last week w/ the change to the default min_df value in CountVectorizer() -- ugh.) Anyway, all that said, I'd like to hold off on changing anything for another 24 hours or so. Two reasons: 1) I have no idea how this change will affect random forests or other (nonlinear) classifiers. Someone should investigate this before changing anything and should really do so in terms of writing down math, in addition to plugging in values and seeing what happens. 2) I'd like to find out what other weighting schemes (if any) other people (e.g., Hal, John, etc.) can think of just to check that there aren't other, even more natural, options that we're overlooking. |
|
Thanks for your feedback. We do document all changes in the We could also make the |
|
@amueller @hannawallach ... I guess I just meant that if people had hard-coded a 'good' set of parameters, through grid search or heuristics, their models might change a bit. Some comment about this in the what's new is what I was suggesting. As for forests and trees, |
c25e335 to
b2cdc55
Compare
|
I'm slightly surprised that the tree output changes. have you checked if the tree changes or only the probability estimates? |
|
@trevorstephens what was the data you compared the trees on? |
|
Never mind, I reproduced the tree things here: #4366
I'm not super happy about behavior changes, as they can surprise users and we can't really assume users read whatsnew.rst. |
Was in the first code block of the post, just a rather small unbalanced |
|
Personally, I'd prefer option number 2 (deprecate "auto" and introduce "balance" or some similar name) just so that it's very clear to everyone that there's been a change. I'm open to being convinced otherwise though :-) |
|
If we go with number 2, does anyone have opinions on whether we should have the old tests for |
|
I'd prefer the 2nd, deprecate, option as well. From a Kaggler POV, I want to be able to reproduce results wherever possible for verification, or just for blending runs. Plus 'balance' or something seems more informative than 'auto' IMO. As requested, looking into the two forests, there's definitely some strange stuff going on. The first tree in the ensemble even diverges, albeit only slightly. Not necessarily on gini, samples, possibly not even the outcome, etc... But you can see a different variable split at the far left node. Other ones I've looked at further down the ensemble diverge even more than this one with differing gini's down the tree. Using: Master: New Version: |
|
Beyond Kaggle, which is a just an eloborate game, people won't trust
scikit-learn if the behavior changes between versions. I think we should
strive to go through a depreciation cycle for this too
|
|
@GaelVaroquaux a deprecation cycle meaning a name change and removing the old name (just to confirm). |
|
Ok, will head on over there. Sorry to pollute the waters. |
1920ff0 to
e99705c
Compare
|
This has the deprecations and additional parameter now. Any reviews? |
|
well I don't see a different way, so maybe it will be "balanced_subsample" |
|
Ok should be good now. |
6f320fe to
df3d97b
Compare
sklearn/utils/estimator_checks.py
Outdated
There was a problem hiding this comment.
To make this test easier to read I would do:
y = np.array([1, 1, 1, -1, -1])
n_samples = len(y)
n_classes = len(np.unique(y))
class_weight = {
1: n_samples / (np.sum(y == 1) * n_classes),
-1: n_samples / (np.sum(y == -1) * n_classes),
}|
Besides minor comments, LGTM. |
doc/whats_new.rst
Outdated
There was a problem hiding this comment.
There should be a bit more details here: at least explain that class_weight="auto" has been deprecated in favor of class_weight="balanced" that provides invariance to class imbalance introduced by re-sampling with replacement for linear models.
Maybe also mention class_weight='balanced_subsample' for forest models.
|
Aside the small comments by @ogrisel and myself, 👍 for merge. Thank you |
df3d97b to
2ddb503
Compare
|
Have you seen that the tests are failing? |
|
fixed. |
|
OK. Merging! |
[MRG+1] Use more natural class_weight="auto" heuristic
|
thanks :) 🍻 |
rebase on top of scikit-learn#4347 improve error message
rebase on top of scikit-learn#4347 improve error message update error msg


Fix for #4324.
No test yet. Travis passes, so we don't have very strict tests for this ^^ (apart from the manual reimplementation in the test).