[MRG+1] Fix DummyClassifier bug with putting arrays into lists.#10926
[MRG+1] Fix DummyClassifier bug with putting arrays into lists.#10926nsorros wants to merge 8 commits intoscikit-learn:masterfrom
Conversation
dee1089 to
0012560
Compare
|
Tests failing. |
|
@jnothman some tests, not all, were complaining about the pandas import. any idea why? i removed it and replaced the array a matrix representation of a dataframe which replicates what was creating the problem. fingers crossed 🤞 |
|
We don't require Pandas to run scikit-learn, nor do we have it installed in
all testing instances. Tests that require Pandas must be skipped if Pandas
cannot be imported. pytest.importorskip should help.
…On 10 April 2018 at 18:47, Nick Sorros ***@***.***> wrote:
@jnothman <https://github.com/jnothman> some tests, not all, were
complaining about the pandas import. any idea why? i removed it and
replaced the array a matrix representation of a dataframe which replicates
what was creating the problem. fingers crossed 🤞
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10926 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_5vI9bW4FeB1sjJl6y7Vo5pYxqcks5tnHGlgaJpZM4TJr-K>
.
|
|
@jnothman Can I get some thoughts on this PR? The pandas dependency is now removed so it should be good to go. |
| self.output_2d_ = y.ndim == 2 | ||
| self.output_2d_ = (y.ndim == 2) and (y.shape[1] > 1) | ||
|
|
||
| if not self.output_2d_: |
There was a problem hiding this comment.
Rather than resorting to custom code, can you not use check_X_y as I mentioned in the associated issue?
There was a problem hiding this comment.
As far as I understood it the problem with using check_X_y is that since multi_output needs to be true in this case, check_X_y will not transform y to be 1d. We can set the flag to false if the output is not 2d but this ends up as the same code as above. If I am missing something let me know.
There was a problem hiding this comment.
I am not 100% sure (and don't have time to dig into it right now) but it feels like you are using custom validation code where check_X_y could be a better fit. Some arguments of check_X_y may need to be set properly for this use case.
The advantage of using a check_* function is that it is thoroughly used and tested across the code base, fixes to check_* will benefit you as well. By using custom code it is quite easy to introduce bugs.
Maybe what would make it easier to review is if you could summarize why check_X_y and/or check_array is not possible. Bonus points if you can add snippets because that sometimes makes it clearer than words.
There was a problem hiding this comment.
That is the snippet that demonstrates what I was saying. The problem is that since we accept multi_outputs then check_X_y does not reduce the dimension to 1d in the special case where the second dimension is just 1 thus a vector e.g. shape=(x,1).
import numpy as np
from sklearn.utils.validation import check_X_y
X = np.array([[0], [0], [0], [0]])
y_1d = np.array([1, 2, 1, 1])
y_2d = np.array([[1], [2], [1], [1]])
output_2d_ = (y_2d.ndim == 2)
X_new, y_new = check_X_y(X, y_2d, multi_output=output_2d_)
print(y_2d.ndim)
print(y_new.ndim)This is why to get around the special case i updated the defintion of output_2d to (y.ndim == 2) and (y.shape[1] > 1).
There was a problem hiding this comment.
Same solution to the problem exists in other parts of the code, for example in multi_layer_perceptron.
915 if y.ndim == 2 and y.shape[1] == 1:
916 y = column_or_1d(y, warn=True)
This bug is mainly fixed in other classifiers but persists in the dummy one. One alternative would be to add a function in validation.py that checks whether y is multi_output and replace the repeatable custom code with that to achieve the benefits of reusability you are describing.
There was a problem hiding this comment.
OK thanks for the details! Maybe we can use warn=True in column_or_1d so that there is a warning similarly to what is done in other places?
Side-comment: to link to code inside github like you can use this nice feature.
There was a problem hiding this comment.
👍 Makes sense. I am uploading a new version with warn=True. Anything else that needed to be done in order to be ready to merge?
sklearn/tests/test_dummy.py
Outdated
| clf.class_prior_.reshape((1, -1)) > 0.5) | ||
|
|
||
|
|
||
| def test_most_frequent_and_prior_strategy_with_pandas_dataframe(): |
There was a problem hiding this comment.
Good stuff this is a non-regression test indeed since this fails in master. I think you should rename it though since it does not use a pandas dataframe at all.
sklearn/tests/test_dummy.py
Outdated
|
|
||
|
|
||
| def test_most_frequent_and_prior_strategy_with_pandas_dataframe(): | ||
| X = [[0], [0], [0], [0]] # ignored |
There was a problem hiding this comment.
I am not sure what you mean by this comment (maybe that X is irrelevant for this issue). In any case I think just remove the comment.
sklearn/tests/test_dummy.py
Outdated
|
|
||
| def test_most_frequent_and_prior_strategy_with_pandas_dataframe(): | ||
| X = [[0], [0], [0], [0]] # ignored | ||
| y = [[1], [2], [1], [1]] |
There was a problem hiding this comment.
So you copied this test from the function above and just added brackets around y so that it is 2d. Maybe something simpler would be to compare the predict or y_1d and y_2d.
|
LGTM, maybe someone with a more intimate knowledge about edge cases in check_array can have a look and confirm this is the behaviour we want. |
jnothman
left a comment
There was a problem hiding this comment.
I agree with not using check_X_y but for different reasons: we should avoid placing constraints on X.
I don't quite understand why we a column input means predictions should be 1d. Is that consistent with elsewhere in the library?
|
Please add an entry to the change log at |
4561bcc to
c6cf782
Compare
c6cf782 to
aa8cc9f
Compare
|
Added a what's new entry. Do I need to mention the code reviewers as well? Is there anything else I need to do before being ready to merge? |
|
Sorry for the slow reply. Did you answer my question above before my what's new comment? |
|
@jnothman could you elaborate a bit more on what you mean? the input is not necessarily 1 column. |
|
I understood this PR would predict shape (n_samples,) if fit to y of shape (n_samples, 1). But I may have misunderstood |
|
@jnothman can you remember what exactly the convention here is? I don't think we have that written down, except maybe in the common tests? |
|
no, I'm not sure, and I'm not available to investigate
|
|
@jnothman You got it right. I confused input to mean X. So if input y is of shape (n_samples, 1) which is 2D it is reduced to (n_samples,) which is 1D. @amueller not sure what the convention is but this happens in other parts of the codebase as well. One way this problem arises is by passing a pandas dataframe y because scikit understands that to be 2d. Other classifiers are solving the problem in a similar fashion, for example. |
|
Also see #9169 :-/ I considered this a bug. I have to double check though, it's been a while. |
|
We should make sure to fix this one way or another |
| :class:`mixture.BayesianGaussianMixture`. :issue:`10740` by :user:`Erich | ||
| Schubert <kno10>` and :user:`Guillaume Lemaitre <glemaitre>`. | ||
|
|
||
| - Fixed a bug in :class:`dummy.DummyClassifier` where 1d dimensional y with |
There was a problem hiding this comment.
could you please move this entry to v0.21.rst @nsorros ?
|
Could you please also rebase on master @nsorros ? |
| self.output_2d_ = (y.ndim == 2) and (y.shape[1] > 1) | ||
|
|
||
| if not self.output_2d_: | ||
| y = column_or_1d(y, warn=True) |
There was a problem hiding this comment.
This is reversed right after when y is reshaped to (-1, 1). At least as far as the added tests here go, This if and column_or_1d is not needed and the change for output_2d_ fixes the issue. Shouldn't the same fix apply to dummyregressor?
What does this implement/fix? Explain your changes.
It fixes an error thrown by dummy classifier when passing a dataframe y as explained more in depth in the issue. I enhance the check for output_2d to test whether the 2nd dimension is 1 in which case I use
column_or_1dto transform the output.Any other comments?
This is my first PR so any help and comments are greatly appreciated.
Fixes #10786