Skip to content

[MRG+1] Fix DummyClassifier bug with putting arrays into lists.#10926

Closed
nsorros wants to merge 8 commits intoscikit-learn:masterfrom
nsorros:fix-dummy_classifier_bug
Closed

[MRG+1] Fix DummyClassifier bug with putting arrays into lists.#10926
nsorros wants to merge 8 commits intoscikit-learn:masterfrom
nsorros:fix-dummy_classifier_bug

Conversation

@nsorros
Copy link
Copy Markdown

@nsorros nsorros commented Apr 6, 2018

What does this implement/fix? Explain your changes.

It fixes an error thrown by dummy classifier when passing a dataframe y as explained more in depth in the issue. I enhance the check for output_2d to test whether the 2nd dimension is 1 in which case I use column_or_1d to transform the output.

Any other comments?

This is my first PR so any help and comments are greatly appreciated.

Fixes #10786

@nsorros nsorros force-pushed the fix-dummy_classifier_bug branch 2 times, most recently from dee1089 to 0012560 Compare April 9, 2018 08:57
@nsorros nsorros changed the title [WIP] Fix DummyClassifier bug with putting arrays into lists. Fixes #10786 [MRG] Fix DummyClassifier bug with putting arrays into lists. Fixes #10786 Apr 9, 2018
@jnothman
Copy link
Copy Markdown
Member

Tests failing.
Pease state "fixes #xxx" in the description, not title, so that GitHub will link it and automatically close the issue when the PR is merged

@nsorros nsorros changed the title [MRG] Fix DummyClassifier bug with putting arrays into lists. Fixes #10786 [MRG] Fix DummyClassifier bug with putting arrays into lists. Apr 10, 2018
@nsorros
Copy link
Copy Markdown
Author

nsorros commented Apr 10, 2018

@jnothman some tests, not all, were complaining about the pandas import. any idea why? i removed it and replaced the array a matrix representation of a dataframe which replicates what was creating the problem. fingers crossed 🤞

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Apr 10, 2018 via email

@nsorros
Copy link
Copy Markdown
Author

nsorros commented Apr 13, 2018

@jnothman Can I get some thoughts on this PR? The pandas dependency is now removed so it should be good to go.

Copy link
Copy Markdown
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments

self.output_2d_ = y.ndim == 2
self.output_2d_ = (y.ndim == 2) and (y.shape[1] > 1)

if not self.output_2d_:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than resorting to custom code, can you not use check_X_y as I mentioned in the associated issue?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understood it the problem with using check_X_y is that since multi_output needs to be true in this case, check_X_y will not transform y to be 1d. We can set the flag to false if the output is not 2d but this ends up as the same code as above. If I am missing something let me know.

Copy link
Copy Markdown
Member

@lesteve lesteve Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure (and don't have time to dig into it right now) but it feels like you are using custom validation code where check_X_y could be a better fit. Some arguments of check_X_y may need to be set properly for this use case.

The advantage of using a check_* function is that it is thoroughly used and tested across the code base, fixes to check_* will benefit you as well. By using custom code it is quite easy to introduce bugs.

Maybe what would make it easier to review is if you could summarize why check_X_y and/or check_array is not possible. Bonus points if you can add snippets because that sometimes makes it clearer than words.

Copy link
Copy Markdown
Author

@nsorros nsorros Apr 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the snippet that demonstrates what I was saying. The problem is that since we accept multi_outputs then check_X_y does not reduce the dimension to 1d in the special case where the second dimension is just 1 thus a vector e.g. shape=(x,1).

import numpy as np

from sklearn.utils.validation import check_X_y

X = np.array([[0], [0], [0], [0]])
y_1d = np.array([1, 2, 1, 1])
y_2d = np.array([[1], [2], [1], [1]])

output_2d_ = (y_2d.ndim == 2) 

X_new, y_new = check_X_y(X, y_2d, multi_output=output_2d_)
print(y_2d.ndim)
print(y_new.ndim)

This is why to get around the special case i updated the defintion of output_2d to (y.ndim == 2) and (y.shape[1] > 1).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same solution to the problem exists in other parts of the code, for example in multi_layer_perceptron.

915 if y.ndim == 2 and y.shape[1] == 1:
916  y = column_or_1d(y, warn=True)

This bug is mainly fixed in other classifiers but persists in the dummy one. One alternative would be to add a function in validation.py that checks whether y is multi_output and replace the repeatable custom code with that to achieve the benefits of reusability you are describing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for the details! Maybe we can use warn=True in column_or_1d so that there is a warning similarly to what is done in other places?

Side-comment: to link to code inside github like you can use this nice feature.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Makes sense. I am uploading a new version with warn=True. Anything else that needed to be done in order to be ready to merge?

clf.class_prior_.reshape((1, -1)) > 0.5)


def test_most_frequent_and_prior_strategy_with_pandas_dataframe():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff this is a non-regression test indeed since this fails in master. I think you should rename it though since it does not use a pandas dataframe at all.



def test_most_frequent_and_prior_strategy_with_pandas_dataframe():
X = [[0], [0], [0], [0]] # ignored
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean by this comment (maybe that X is irrelevant for this issue). In any case I think just remove the comment.


def test_most_frequent_and_prior_strategy_with_pandas_dataframe():
X = [[0], [0], [0], [0]] # ignored
y = [[1], [2], [1], [1]]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you copied this test from the function above and just added brackets around y so that it is 2d. Maybe something simpler would be to compare the predict or y_1d and y_2d.

@lesteve lesteve changed the title [MRG] Fix DummyClassifier bug with putting arrays into lists. [MRG+1] Fix DummyClassifier bug with putting arrays into lists. May 2, 2018
@lesteve
Copy link
Copy Markdown
Member

lesteve commented May 2, 2018

LGTM, maybe someone with a more intimate knowledge about edge cases in check_array can have a look and confirm this is the behaviour we want.

Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with not using check_X_y but for different reasons: we should avoid placing constraints on X.

I don't quite understand why we a column input means predictions should be 1d. Is that consistent with elsewhere in the library?

@jnothman
Copy link
Copy Markdown
Member

jnothman commented May 2, 2018

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@nsorros nsorros force-pushed the fix-dummy_classifier_bug branch from 4561bcc to c6cf782 Compare May 17, 2018 08:19
@nsorros nsorros force-pushed the fix-dummy_classifier_bug branch from c6cf782 to aa8cc9f Compare May 17, 2018 08:30
@nsorros
Copy link
Copy Markdown
Author

nsorros commented May 17, 2018

Added a what's new entry. Do I need to mention the code reviewers as well? Is there anything else I need to do before being ready to merge?

@jnothman
Copy link
Copy Markdown
Member

Sorry for the slow reply. Did you answer my question above before my what's new comment?

@nsorros
Copy link
Copy Markdown
Author

nsorros commented May 21, 2018

@jnothman could you elaborate a bit more on what you mean? the input is not necessarily 1 column.

@jnothman
Copy link
Copy Markdown
Member

I understood this PR would predict shape (n_samples,) if fit to y of shape (n_samples, 1). But I may have misunderstood

@amueller
Copy link
Copy Markdown
Member

@jnothman can you remember what exactly the convention here is? I don't think we have that written down, except maybe in the common tests?

@jnothman
Copy link
Copy Markdown
Member

jnothman commented May 21, 2018 via email

@nsorros
Copy link
Copy Markdown
Author

nsorros commented May 22, 2018

@jnothman You got it right. I confused input to mean X. So if input y is of shape (n_samples, 1) which is 2D it is reduced to (n_samples,) which is 1D.

@amueller not sure what the convention is but this happens in other parts of the codebase as well. One way this problem arises is by passing a pandas dataframe y because scikit understands that to be 2d. Other classifiers are solving the problem in a similar fashion, for example.

@amueller
Copy link
Copy Markdown
Member

Also see #9169 :-/ I considered this a bug. I have to double check though, it's been a while.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Mar 6, 2019

We should make sure to fix this one way or another

@jnothman jnothman added the Bug label Mar 6, 2019
@jnothman jnothman added this to the 0.21 milestone Mar 6, 2019
:class:`mixture.BayesianGaussianMixture`. :issue:`10740` by :user:`Erich
Schubert <kno10>` and :user:`Guillaume Lemaitre <glemaitre>`.

- Fixed a bug in :class:`dummy.DummyClassifier` where 1d dimensional y with
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please move this entry to v0.21.rst @nsorros ?

@adrinjalali adrinjalali self-assigned this Mar 25, 2019
@adrinjalali
Copy link
Copy Markdown
Member

Could you please also rebase on master @nsorros ?

self.output_2d_ = (y.ndim == 2) and (y.shape[1] > 1)

if not self.output_2d_:
y = column_or_1d(y, warn=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reversed right after when y is reshaped to (-1, 1). At least as far as the added tests here go, This if and column_or_1d is not needed and the change for output_2d_ fixes the issue. Shouldn't the same fix apply to dummyregressor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DummyClassifier bug with putting arrays into lists

5 participants