[MRG+1] fix logistic regression class weights#5008
[MRG+1] fix logistic regression class weights#5008amueller merged 1 commit intoscikit-learn:masterfrom
Conversation
f9fade2 to
aec918f
Compare
There was a problem hiding this comment.
Why not also test multiclass OvR with 3 imbalanced classes?
There was a problem hiding this comment.
I think I understand the equivalent class_weight_dict for 3 classes with OvR is no longer trivial to compute in that case.
There was a problem hiding this comment.
exactly, at least we cannot used directly compute_class_weight
|
The way to deal with For I don't know if this is principled or not but we should enforce consistency throughout the OvR classifiers. @hannawallach @amueller any comment on this issue? |
50e91b3 to
ed5004e
Compare
I don't know which way is preferable, but We could change LogisticRegression to behave in the same way, for consistency. This remark being said, this PR solves a small bug about |
6d3af2f to
5e7ba8f
Compare
There was a problem hiding this comment.
Sorry I don't understand this.
There was a problem hiding this comment.
It's been more than a year but the rationale was something like this.
Liblinear does not solve for the true multinomial loss and does a OvR, it now raises an error in sklearn if the loss is multinomial and the solver is liblinear.
In a OvR approach, for the other solvers if the class_weights are provided in a dict format, the class_weight are converted to sample_weights and then used. But liblinear does not support providing sample_weights, right now (#5274). However, in the future after the PR gets merged, we can do away with this.
There was a problem hiding this comment.
But it does support class_weight, right? It's odd to support balanced but not dicts.
There was a problem hiding this comment.
here is the problem :
_fit_liblinear(loss='logistic_regression')does not handlesample_weight, but handlesclass_weightdirectly, and only with binary problems (!) .- SAG, LBFGS and Newton-cg solvers handles
class_weightby putting the weights into thesample_weight.
That is why we have a different handling of class_weight for liblinear.
Then, there are 3 cases, with solver='liblinear':
- multinomial case, but it raise an error in
_check_solver_option. - binary case, then the
class_weightdictionnary is reconstructed with keys -1 and 1, probably in order to match the API of_fit_liblinear - OvR with
n_classes > 2case, and then it raises an error (@MechCoder). The reason is that the other solvers use the class weight for each sample, but it is not possible infit_liblinearsince it does not handlessample_weightyet. With the 'balanced' case, the class weights are computed after the binarization one-vs-rest, which can be used in the same way by all solvers, liblinear included.
Conclusion: We probably just need to merge #5274 to be consistent with all solvers in LogisticRegression
The inconsistency for class_weight='balanced' in the OvR case (inconsistency between SGDClassifier, LinearSVC and LogisticRegression) is another topic (see #5008 (comment))
There was a problem hiding this comment.
I think #5274 is ready, but it only adds samples weights in LogisticRegression and not in other classes that use liblinear.
There was a problem hiding this comment.
There was a problem hiding this comment.
I was about to summarize it, but thanks @TomDLT for doing it.. that was the issue. Once 5724 is merged, we can get rid of this check.
There was a problem hiding this comment.
Whether or not is the best way to do things is of course a different issue.
2fad9b6 to
7f0c91b
Compare
|
I added the test from #5420, which tests the bug detailed in #5415. -This PR tests that these two classifiers are equivalent: class_weights = compute_class_weight("balanced", np.unique(y), y)
clf1 = LogisticRegression(multi_class="multinomial", class_weight="balanced")
clf2 = LogisticRegression(multi_class="multinomial", class_weight=class_weights)-[EDIT]This PR also tests that these two classifiers are equivalent: (fix #5450) class_weights = compute_class_weight("balanced", np.unique(y), y)
clf1 = LogisticRegressionCV(class_weight="balanced")
clf2 = LogisticRegressionCV(class_weight=class_weights)That is why we can remove this line |
e543ed2 to
4768bf3
Compare
8dd7128 to
3a649ce
Compare
sklearn/linear_model/logistic.py
Outdated
There was a problem hiding this comment.
Is there any internal comment convention for things that are deprecated? # TODO: looks horrible but some kind of tag could be nice.
There was a problem hiding this comment.
Comment with the version is most helpful. #TODO is fine. I grep for the version so 0.19 in this case I guess
|
Besides my comment/question this seems pretty solid +1 |
|
Can you remove the line in common tests that excludes |
Are you talking about this line? It has been removed. |
|
Yes. Sorry I overlooked that. |
There was a problem hiding this comment.
Sorry if I'm being stupid, but I still don't understand why this is. Why can liblinear do "balanced" but not a dict? Maybe this is outside the scope of this fix, but I feel the logic here is hard to follow. Some of the class weight is handled here, and other parts are handled further down in the if mulitclass == 'ovr'
There was a problem hiding this comment.
This is not clear indeed, I am trying to understand it
There was a problem hiding this comment.
I think this is related to the different handling of class_weight with multi_class='ovr'
There was a problem hiding this comment.
It's kind of unrelated to what you are fixing, I'm not sure if we should address it in this PR or not. It just seems odd.
There was a problem hiding this comment.
Indeed, this is not related to this PR
3a649ce to
d439dc4
Compare
|
ok LGTM |
|
+1 for merge |
[MRG+1] fix logistic regression class weights
|
backporting |
|
@TomDLT can you please add a whatsnew for 0.17? |
[MRG+1] fix logistic regression class weights
-This PR tests that these two classifiers are equivalent:
-[EDIT]This PR also tests that these two classifiers are equivalent: (fix #5450)
That is why we can remove this line
I also :
"auto"(deprecated) into"balanced".y_binandY_bin