[MRG+1] Fix DummyClassifier bug with putting arrays into lists. by nsorros · Pull Request #10926 · scikit-learn/scikit-learn

nsorros · 2018-04-06T08:32:28Z

What does this implement/fix? Explain your changes.

It fixes an error thrown by dummy classifier when passing a dataframe y as explained more in depth in the issue. I enhance the check for output_2d to test whether the 2nd dimension is 1 in which case I use column_or_1d to transform the output.

Any other comments?

This is my first PR so any help and comments are greatly appreciated.

Fixes #10786

jnothman · 2018-04-10T07:38:45Z

Tests failing.
Pease state "fixes #xxx" in the description, not title, so that GitHub will link it and automatically close the issue when the PR is merged

nsorros · 2018-04-10T08:47:30Z

@jnothman some tests, not all, were complaining about the pandas import. any idea why? i removed it and replaced the array a matrix representation of a dataframe which replicates what was creating the problem. fingers crossed 🤞

jnothman · 2018-04-10T22:28:35Z

We don't require Pandas to run scikit-learn, nor do we have it installed in all testing instances. Tests that require Pandas must be skipped if Pandas cannot be imported. pytest.importorskip should help.

…

On 10 April 2018 at 18:47, Nick Sorros ***@***.***> wrote: @jnothman <https://github.com/jnothman> some tests, not all, were complaining about the pandas import. any idea why? i removed it and replaced the array a matrix representation of a dataframe which replicates what was creating the problem. fingers crossed 🤞 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10926 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_5vI9bW4FeB1sjJl6y7Vo5pYxqcks5tnHGlgaJpZM4TJr-K> .

nsorros · 2018-04-13T07:55:13Z

@jnothman Can I get some thoughts on this PR? The pandas dependency is now removed so it should be good to go.

lesteve

Some comments

lesteve · 2018-04-13T13:47:59Z

sklearn/dummy.py

-        self.output_2d_ = y.ndim == 2
+        self.output_2d_ = (y.ndim == 2) and (y.shape[1] > 1)
+
+        if not self.output_2d_:


Rather than resorting to custom code, can you not use check_X_y as I mentioned in the associated issue?

As far as I understood it the problem with using check_X_y is that since multi_output needs to be true in this case, check_X_y will not transform y to be 1d. We can set the flag to false if the output is not 2d but this ends up as the same code as above. If I am missing something let me know.

I am not 100% sure (and don't have time to dig into it right now) but it feels like you are using custom validation code where check_X_y could be a better fit. Some arguments of check_X_y may need to be set properly for this use case.

The advantage of using a check_* function is that it is thoroughly used and tested across the code base, fixes to check_* will benefit you as well. By using custom code it is quite easy to introduce bugs.

Maybe what would make it easier to review is if you could summarize why check_X_y and/or check_array is not possible. Bonus points if you can add snippets because that sometimes makes it clearer than words.

That is the snippet that demonstrates what I was saying. The problem is that since we accept multi_outputs then check_X_y does not reduce the dimension to 1d in the special case where the second dimension is just 1 thus a vector e.g. shape=(x,1).

import numpy as np from sklearn.utils.validation import check_X_y X = np.array([[0], [0], [0], [0]]) y_1d = np.array([1, 2, 1, 1]) y_2d = np.array([[1], [2], [1], [1]]) output_2d_ = (y_2d.ndim == 2) X_new, y_new = check_X_y(X, y_2d, multi_output=output_2d_) print(y_2d.ndim) print(y_new.ndim)

This is why to get around the special case i updated the defintion of output_2d to (y.ndim == 2) and (y.shape[1] > 1).

Same solution to the problem exists in other parts of the code, for example in multi_layer_perceptron.

915 if y.ndim == 2 and y.shape[1] == 1: 916 y = column_or_1d(y, warn=True)

This bug is mainly fixed in other classifiers but persists in the dummy one. One alternative would be to add a function in validation.py that checks whether y is multi_output and replace the repeatable custom code with that to achieve the benefits of reusability you are describing.

OK thanks for the details! Maybe we can use warn=True in column_or_1d so that there is a warning similarly to what is done in other places?

Side-comment: to link to code inside github like you can use this nice feature.

👍 Makes sense. I am uploading a new version with warn=True. Anything else that needed to be done in order to be ready to merge?

lesteve · 2018-04-13T13:48:48Z

sklearn/tests/test_dummy.py

                                      clf.class_prior_.reshape((1, -1)) > 0.5)


+def test_most_frequent_and_prior_strategy_with_pandas_dataframe():


Good stuff this is a non-regression test indeed since this fails in master. I think you should rename it though since it does not use a pandas dataframe at all.

lesteve · 2018-04-13T13:50:53Z

sklearn/tests/test_dummy.py



+def test_most_frequent_and_prior_strategy_with_pandas_dataframe():
+    X = [[0], [0], [0], [0]]  # ignored


I am not sure what you mean by this comment (maybe that X is irrelevant for this issue). In any case I think just remove the comment.

lesteve · 2018-04-13T13:58:40Z

sklearn/tests/test_dummy.py


+def test_most_frequent_and_prior_strategy_with_pandas_dataframe():
+    X = [[0], [0], [0], [0]]  # ignored
+    y = [[1], [2], [1], [1]]


So you copied this test from the function above and just added brackets around y so that it is 2d. Maybe something simpler would be to compare the predict or y_1d and y_2d.

lesteve · 2018-05-02T16:29:17Z

LGTM, maybe someone with a more intimate knowledge about edge cases in check_array can have a look and confirm this is the behaviour we want.

jnothman

I agree with not using check_X_y but for different reasons: we should avoid placing constraints on X.

I don't quite understand why we a column input means predictions should be 1d. Is that consistent with elsewhere in the library?

jnothman · 2018-05-02T23:44:35Z

Please add an entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

nsorros · 2018-05-17T08:31:23Z

Added a what's new entry. Do I need to mention the code reviewers as well? Is there anything else I need to do before being ready to merge?

jnothman · 2018-05-17T22:11:54Z

Sorry for the slow reply. Did you answer my question above before my what's new comment?

nsorros · 2018-05-21T16:50:45Z

@jnothman could you elaborate a bit more on what you mean? the input is not necessarily 1 column.

jnothman · 2018-05-21T22:23:24Z

I understood this PR would predict shape (n_samples,) if fit to y of shape (n_samples, 1). But I may have misunderstood

amueller · 2018-05-21T22:28:19Z

@jnothman can you remember what exactly the convention here is? I don't think we have that written down, except maybe in the common tests?

jnothman · 2018-05-21T23:31:02Z

no, I'm not sure, and I'm not available to investigate

nsorros · 2018-05-22T07:56:58Z

@jnothman You got it right. I confused input to mean X. So if input y is of shape (n_samples, 1) which is 2D it is reduced to (n_samples,) which is 1D.

@amueller not sure what the convention is but this happens in other parts of the codebase as well. One way this problem arises is by passing a pandas dataframe y because scikit understands that to be 2d. Other classifiers are solving the problem in a similar fashion, for example.

amueller · 2018-05-26T16:12:15Z

Also see #9169 :-/ I considered this a bug. I have to double check though, it's been a while.

jnothman · 2019-03-06T17:47:06Z

We should make sure to fix this one way or another

adrinjalali · 2019-03-25T13:35:48Z

doc/whats_new/v0.20.rst

  :class:`mixture.BayesianGaussianMixture`. :issue:`10740` by :user:`Erich
  Schubert <kno10>` and :user:`Guillaume Lemaitre <glemaitre>`.

+- Fixed a bug in :class:`dummy.DummyClassifier` where 1d dimensional y with


could you please move this entry to v0.21.rst @nsorros ?

adrinjalali · 2019-03-25T15:03:33Z

Could you please also rebase on master @nsorros ?

adrinjalali · 2019-03-25T15:18:34Z

sklearn/dummy.py

+        self.output_2d_ = (y.ndim == 2) and (y.shape[1] > 1)
+
+        if not self.output_2d_:
+            y = column_or_1d(y, warn=True)


This is reversed right after when y is reshaped to (-1, 1). At least as far as the added tests here go, This if and column_or_1d is not needed and the change for output_2d_ fixes the issue. Shouldn't the same fix apply to dummyregressor?

nsorros mentioned this pull request Apr 6, 2018

DummyClassifier bug with putting arrays into lists #10786

Closed

nsorros force-pushed the fix-dummy_classifier_bug branch 2 times, most recently from dee1089 to 0012560 Compare April 9, 2018 08:57

nsorros changed the title ~~[WIP] Fix DummyClassifier bug with putting arrays into lists. Fixes #10786~~ [MRG] Fix DummyClassifier bug with putting arrays into lists. Fixes #10786 Apr 9, 2018

nsorros changed the title ~~[MRG] Fix DummyClassifier bug with putting arrays into lists. Fixes #10786~~ [MRG] Fix DummyClassifier bug with putting arrays into lists. Apr 10, 2018

lesteve reviewed Apr 13, 2018

View reviewed changes

lesteve changed the title ~~[MRG] Fix DummyClassifier bug with putting arrays into lists.~~ [MRG+1] Fix DummyClassifier bug with putting arrays into lists. May 2, 2018

jnothman reviewed May 2, 2018

View reviewed changes

nsorros force-pushed the fix-dummy_classifier_bug branch from 4561bcc to c6cf782 Compare May 17, 2018 08:19

nsorros added 8 commits May 17, 2018 09:29

Add column_or_1d to account for dataframe y

8b84ff1

Add test

c0b8b4a

make the diff cleaner

b70f960

switched pandas import for 2d array

9dc5ad6

change test to a simple comparison between 1d and 2d y

ca2bd64

flake8 errors in tests

6be4924

Add warning to acieve consistent behaviour as in other classifiers

7c4db74

Update whats new file

aa8cc9f

nsorros force-pushed the fix-dummy_classifier_bug branch from c6cf782 to aa8cc9f Compare May 17, 2018 08:30

jnothman added the Bug label Mar 6, 2019

jnothman added this to the 0.21 milestone Mar 6, 2019

adrinjalali reviewed Mar 25, 2019

View reviewed changes

adrinjalali self-assigned this Mar 25, 2019

adrinjalali reviewed Mar 25, 2019

View reviewed changes

adrinjalali mentioned this pull request Mar 29, 2019

[MRG+2] FIX DummyEstimator and a single output 2d input list/array #13545

Merged

NicolasHug closed this in #13545 Apr 5, 2019

		clf.class_prior_.reshape((1, -1)) > 0.5)


		def test_most_frequent_and_prior_strategy_with_pandas_dataframe():



		def test_most_frequent_and_prior_strategy_with_pandas_dataframe():
		X = [[0], [0], [0], [0]] # ignored

Uh oh!

Conversation

nsorros commented Apr 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Apr 10, 2018

Uh oh!

nsorros commented Apr 10, 2018

Uh oh!

jnothman commented Apr 10, 2018 via email

Uh oh!

nsorros commented Apr 13, 2018

Uh oh!

lesteve left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve Apr 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsorros Apr 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve commented May 2, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented May 2, 2018

Uh oh!

nsorros commented May 17, 2018

Uh oh!

jnothman commented May 17, 2018

Uh oh!

nsorros commented May 21, 2018

Uh oh!

jnothman commented May 21, 2018

Uh oh!

amueller commented May 21, 2018

Uh oh!

jnothman commented May 21, 2018 via email

Uh oh!

nsorros commented May 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented May 26, 2018

Uh oh!

jnothman commented Mar 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Mar 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

nsorros commented Apr 6, 2018 •

edited

Loading

lesteve Apr 19, 2018 •

edited

Loading

nsorros Apr 24, 2018 •

edited

Loading

nsorros commented May 22, 2018 •

edited

Loading