[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels#5874
[MRG + 2] FIX LogisticRegressionCV to correctly handle string labels#5874raghavrv merged 12 commits intoscikit-learn:masterfrom
Conversation
7a54392 to
41777f3
Compare
There was a problem hiding this comment.
Surely self._enc must be used somewhere?
There was a problem hiding this comment.
It was added here - 67585f6, but not used anywhere... I am unable to understand why though...
|
Instead of this can we encode If you did this you would need to reconstruct the Also you can remove the |
|
Would also be a good time to clean up some parts of the code, especially the encoding logic, dtype checks etc |
|
hm this is still a bug, right? |
sklearn/linear_model/logistic.py
Outdated
|
|
||
| # To deal with object dtypes, we need to convert into an array of floats. | ||
| y_test = check_array(y_test, dtype=np.float64, ensure_2d=False) | ||
| y_test = check_array(LabelEncoder().fit_transform(y_test), |
There was a problem hiding this comment.
Surely y_test needs to use the same encoder as was used for training.
3e85d50 to
c61373d
Compare
|
@MechCoder @jnothman @amueller Could you take a look at this now? |
1a7889b to
9097471
Compare
c3d207e to
0b7129a
Compare
|
y = LabelEncoder().fit_transform(y) |
| y = np.array(y) - 1 | ||
| # Test for string labels | ||
| lr = LogisticRegression(solver='lbfgs', multi_class='multinomial') | ||
| lr_cv = LogisticRegression(solver='lbfgs', multi_class='multinomial') |
There was a problem hiding this comment.
you probably wanted to use LogisticRegressionCV here
sklearn/linear_model/logistic.py
Outdated
| check_consistent_length(X, y) | ||
| if y.dtype.kind in ('S', 'U'): | ||
| # Encode for string labels | ||
| self._label_encoder = LabelEncoder().fit(y) |
There was a problem hiding this comment.
Why can't you do encoding in all cases and assume that LabelEncoder handles efficiency issues?
There was a problem hiding this comment.
thanks for the comment I did so in the recent commit...
sklearn/linear_model/logistic.py
Outdated
| for new_key, old_key in zip(new_keys, old_keys): | ||
| self.class_weight[new_key] = self.class_weight[old_key] | ||
| else: | ||
| self.classes_enc_ = self.classes_ |
There was a problem hiding this comment.
Does this work when classes are [1,2,3], not [0,1,2]?
There was a problem hiding this comment.
I'm encoding for all cases now like you suggested. And there is a test for this usecase.
0b7129a to
14699b7
Compare
|
Arghh. This is tougher to solve than I thought. After encoding the string labels at the fit of |
14699b7 to
934493c
Compare
|
Now it should pass all the tests... Gentle ping @TomDLT @MechCoder for reviews too... |
|
Your variables names are quite confusing: what about: or even remove |
|
Otherwise LGTM |
b0de8f3 to
4961d97
Compare
|
Merging once CIs pass... |
|
Thanks @jnothman @amueller @TomDLT @MechCoder for the reviews! |
|
thanks for the pr :) |
…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry
|
We seem to be getting test failures from this on PRs, such as at https://ci.appveyor.com/project/sklearn-ci/scikit-learn/build/1.0.10296/job/qol1vrsk30ycsopl |
|
Which PR is this? I saw a similar error but because of an incorrect updation of master. (Which got resolved after fixing that...) |
|
#7838 The commit history On 21 November 2016 at 09:33, Raghav RV notifications@github.com wrote:
|
|
Okay. Thanks for the notice. I'll look into it tomorrow... |
…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry
…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry
* tag '0.18.1': (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...
* releases: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ... Conflicts: removed sklearn/externals/joblib/__init__.py sklearn/externals/joblib/_parallel_backends.py sklearn/externals/joblib/testing.py
* dfsg: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...
…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry
…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry
…cikit-learn#5874) * TST if LogisticRegressionCV handles string labels properly * TST Add a test with class_weight dict * ENH Encode y and class_weight dict * Better variable names * TYPO casses --> classes * FIX Use dict comprehension; classes_labels --> classes * Revert dict comprehension (for Python 2.6 compat) * MNT reorder validation to improve clarity * Add whatsnew entry
Fixes #5868
In master -
In this branch -
@agramfort @GaelVaroquaux