[MRG+1] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) by ccatalfo · Pull Request #10774 · scikit-learn/scikit-learn

ccatalfo · 2018-03-08T02:48:49Z

Reference Issues/PRs

Fixes #10734 by implementing return_X_y for kdcupp99, twenty_newsgroups, rcv1, and lfw datasets.

What does this implement/fix? Explain your changes.

This replicates the return_X_y parameter that was added to datasets/base.py.

Any other comments?

I did not add return_X_y to some of the other datasets as it seemed to make less sense for these.

…kit-learn#10734)

jnothman · 2018-03-08T03:04:11Z

I think it is applicable to lfw. We are not stopping the user from getting the bunch, just making it easier to get the primary tuple of data

And versionadded should be 0.20

ccatalfo · 2018-03-08T03:12:45Z

@jnothman thanks - will add to ifw and change versionadded to 0.20.

…10734)

ccatalfo · 2018-03-09T02:22:02Z

@jnothman I added return_X_y to lfw as well as fixed a couple of pep8 things I missed.

…learn#10734)

jnothman

This is still marked WIP. Is there something else you intend to do before considering this sufficient for review and merge?

jnothman · 2018-03-10T10:46:09Z

sklearn/datasets/tests/test_20news.py

    assert_equal(bunch.data.dtype, np.float64)

+    # test return_X_y option
+    X_y_tuple = datasets.fetch_20newsgroups_vectorized(subset='test',return_X_y=True)


space after , please

jnothman · 2018-03-10T10:46:29Z

sklearn/datasets/tests/test_lfw.py


+    # test return_X_y option
+    X_y_tuple = fetch_lfw_people(data_home=SCIKIT_LEARN_DATA, resize=None,
+                                  slice_=None, color=True,


this indent should be consistent with the argument on the previous line

…d spaces. (scikit-learn#10734)

ccatalfo · 2018-03-11T02:22:17Z

@jnothman thanks for the heads up on those two issues and on marking as MRG. I believe this is now ready for review and merge.

jnothman · 2018-03-11T06:48:50Z

flake8 says you have left some lines longer than 79 characters.

jnothman

the tests are not run by our ci, I think. Have you run them?

It might be nice to have this as a common test for datasets. Otherwise LGTM

…ikit-learn#10734)

ccatalfo · 2018-03-12T01:40:42Z

@jnothman I have run the tests and they are passing.

I just committed fix for those too-long lines flagged by flake8.

A common test for datasets - meaning a ci test that runs the sklearn/datasets/tests ?

jnothman · 2018-03-12T06:45:44Z

No, by a common test, I mean avoiding the repetition of code in the current tests, instead looping over fetches to confirm that they return the right format when return_X_y is used

…ay to get around memory isssue. (scikit-learn#10734)

ccatalfo · 2018-03-12T12:03:39Z

@jnothman "avoiding the repetition of code in the current tests" - Ah I see what you're saying. Certainly could do that - does it make more sense to have every dataset's test in its own test_ file or test across datasets for functionality that crosses them, is the question I guess. I'm happy to do that if you think it's a good idea.

I made a small change to test_rcv1 where I think the test was running out of memory attempting to test the entire array returned.

The codecov/patch that is marked as failing at https://codecov.io/gh/scikit-learn/scikit-learn/compare/ccbf9975fcf1676f6ac4f311e388529d3a3c4d3f...7dcadcb12a74b4b871c1f4d976564992c25ce30a - is that indicating the previous diff does not hit large enough test coverage percentage?

jnothman · 2018-03-12T20:36:00Z

the codecov failure simply says that the new test code is not being run. Which is to be expected. a common test would live in datasets/tests/test_common.py.

ccatalfo · 2018-03-13T01:52:58Z

@jnothman looking at the datasets/tests/test_common.py idea for return_X_y.

Perhaps I'm misinterpreting your idea, but I think there'd still wind up being duplicated code from moving the relevant pieces of the the test_.py files into the tests/test_common.py:

While looping over the limited set of datasets which accept the return_X_y parameter (rcv1, lfw, 20_newsgroups, kddcup99 and various fetches from base.py)

test_common.py would have to catch exceptions from not having each dataset downloaded, as the normal test_lfw.py etc. test files do.
test_common.py would also have to call the particular dataset's Bunch-returning fetch function, each of which take different parameters, in order to compare the original Bunch with the X_y_tuple from returned from the return_X_y fetch.

In sum - while the actual test of the X_y_tuples are the same, the fetching involved differs by dataset. Moving that to test_common.py would lead to code duplication of that part of the test logic.

While I agree that it would be nice to capture the repetitive parts of this return_X_y test logic, it feels to me like it would

wind up having to duplicate nearly as much code as it replaces and
it would break up the coherence of having each dataset's tests only found in each test_.py.

If I've misunderstood or mischaracterized your proposal, please let me know. If you feel like test_common.py is the best way to go, I can certainly implement it that way. Thanks for your thoughts.

Also thanks for the hand-holding as I get acclimated to the codebase and the contributing flow.

jnothman · 2018-03-13T02:26:02Z

Okay. Could also just use a helper called check_return_X_y called by each dataset's test.

ccatalfo · 2018-03-13T11:48:35Z

@jnothman great, yes, I will do that - refactor the return_X_y test part called by each test into tests/test_common.py.

jnothman · 2018-03-13T21:28:17Z

thanks!

…parameter into common test function.

ccatalfo · 2018-03-15T02:30:08Z

@jnothman I refactored as suggested into test_common.py - both the files I had updated (kddcup99, lfw etc) as well as test_base.py to use the new test function.

jnothman

Thanks!
You have a flake8 failure

jnothman · 2018-03-15T07:13:20Z

sklearn/datasets/tests/test_common.py

+
+
+def check_return_X_y(bunch, X_y_tuple):
+    assert_true(isinstance(X_y_tuple, tuple))


We are trying to phase out these assertion functions. Please use a bare assert

Also fixed in 5f73583

jnothman · 2018-03-15T07:15:03Z

sklearn/datasets/tests/test_common.py

+
+def check_return_X_y(bunch, X_y_tuple):
+    assert_true(isinstance(X_y_tuple, tuple))
+    assert_array_equal(X_y_tuple[0].shape, bunch.data.shape)


Shapes should not be arrays. Just use assert ==

Ok, switched in 5f73583

jnothman · 2018-03-15T07:17:34Z

sklearn/datasets/tests/test_kddcup99.py

    assert_equal(data.target.shape, (9571,))

+    X_y_tuple = fetch_kddcup99('smtp', return_X_y=True)
+    bunch = fetch_kddcup99('smtp')


Don't we already have a bunch above? I think it's unclear to the reader why we get this again

Thanks - you're right. Fixed in e68f3f9

jnothman · 2018-03-15T07:19:09Z

sklearn/datasets/tests/test_common.py

+from sklearn.utils.testing import assert_true
+
+
+def check_return_X_y(bunch, X_y_tuple):


I had imagined this would take a partial function and pass in return_X_y=True. You still have left a lot of duplicated idiom in the test functions

By

a lot of duplicated idiom

do you mean the setting up of the bunches and fetch of the X_y_tuples to compare?

Were you thinking of something like this?

Each of the dataset test files sets up the appropriate fetch partial

test_20news.py, test_lfw.py, etc.

from functools import partial # test return_X_y option fetch_func = partial(datasets.fetch_20newsgroups_vectorized, subset='test') check_return_X_y(bunch, fetch_func)

During the check_return_X_y call, the partial is called, passing return_X_y=True

test_common.py

def check_return_X_y(bunch, fetch_func_partial): X_y_tuple = fetch_func_partial(return_X_y=True) assert(isinstance(X_y_tuple, tuple)) assert(X_y_tuple[0].shape == bunch.data.shape) assert(X_y_tuple[1].shape == bunch.target.shape)

Wouldn't this run into problems if ever additional arguments are added to those fetch functions after what's currently the last parameter, the return_X_y one?

… test.

jnothman · 2018-03-15T23:55:52Z

no, I don't see how it would run into problems if args are passed by name

ccatalfo · 2018-03-16T00:02:42Z

Ok is that what you’re going for as far as the usage of the partial then?

If so I’ll do that tonight.

jnothman · 2018-03-16T00:50:34Z

I think so, only if it seems to make things neater

…e partial fetch function (scikit-learn#10734).

ccatalfo · 2018-03-16T12:31:41Z

@jnothman Ok I've updated each to use the partial function version. How does this look now?

jnothman

I'm happy with this :)

qinhanmin2014

LGTM overall

qinhanmin2014 · 2018-03-22T04:06:55Z

sklearn/datasets/tests/test_base.py

-    assert_true(isinstance(X_y_tuple, tuple))
-    assert_array_equal(X_y_tuple[0], bunch.data)
-    assert_array_equal(X_y_tuple[1], bunch.target)
+    check_return_X_y(res, partial(load_diabetes))


Why do we need partial here? (and other places in this file)

These tests_base.py test functions pass in the partial for the same reason the partial is passed in from the other test_* dataset files like test_20news.py - so that check_return_X_y can call that same dataset fetch function with the additional return_X_y=True parameter and perform the same standard X_y_tuple checks.

Doing it this way keeps the interface to check_return_X_y uniform among the dataset test_* files -- even if these particular fetch functions are relatively simple compared to the test files like test_20news.py.

I don't want to argue on such a minor question (though I still prefer to remove partial if we don't need it). But I think you should try to make the whole file consistent. (e.g., you're not using partial in test_load_digits)

@qinhanmin2014 Aahh - good catch, I had missed that one. Thanks.

I've pushed a fix for that.

qinhanmin2014 · 2018-03-22T04:08:34Z

sklearn/datasets/tests/test_20news.py

    assert_equal(bunch.target.shape[0], 7532)
    assert_equal(bunch.data.dtype, np.float64)

+    # test return_X_y option


How about the training set?

I thought only one subset return_X_y check was needed since fetch_20newsgroups_vectorized is subsetting the fetched dataset before return_X_y is checked.

qinhanmin2014 · 2018-03-22T08:29:42Z

@ccatalfo Seems that some functions are not included here (e.g., fetch_california_housing). Any specific reason?

ccatalfo · 2018-03-24T01:51:44Z

@qinhanmin2014 As far as adding return_X_y to the last few datasets like california_housing it looked to me as if there were more than just the X and y to return.

For example fetch_california_housing returns

python return Bunch(data=data, target=target, feature_names=feature_names, DESCR=MODULE_DOCS)

so it didn't seem to make sense to return just the X=data and y=target and leave out the feature_names. But I could certainly add to that one if you think it's useful just with the X and y.

I'm not sure which other datasets, beyond that one, might benefit from return_X_y?

Maybe covtype.py?

So shall I add return_X_y tests to these two?

fetch_california_housing
covtype

jnothman · 2018-03-24T22:50:56Z

yes, it's useful even without feature names.

…aset.

…igits test to check_return_X_y

ccatalfo · 2018-03-25T01:47:42Z

I've added return_X_y to covtype.

…ng datasets test file (scikit-learn#10374)

…alifornia_housing dataset (scikit-learn#10734).

ccatalfo · 2018-03-25T02:02:15Z

Also added return_X_y to california_housing.

In doing that, I also added a new test_california_housing.py test file (I didn't see an existing test file for california_housing.py) and return_X_y.

qinhanmin2014

LGTM. I can't fully understand the codecov failure but I think it's irrelevant and 'll trust jnothman's comment. Thanks @ccatalfo :)

ccatalfo · 2018-03-25T12:03:30Z

Thanks for all the suggestions and comments!

jnothman · 2018-03-25T23:04:23Z

thank you for your work and clear communication!

ccatalfo added 3 commits March 7, 2018 13:51

[MRG] Update fetch_kddcup99 to support return_X_y (scikit-learn#10734)

86ee884

[MRG] Update fetch_rcv1 to support return_X_y (scikit-learn#10734)

609d2ad

[MRG] Update fetch_20newsgroups_vectorized to support return_X_y (sci…

e97609b

…kit-learn#10734)

ccatalfo mentioned this pull request Mar 8, 2018

return_X_y should be available on more dataset loaders/fetchers #10734

Closed

[MRG] return_X_y to more datasets - pep8 (scikit-learn#10734)

722bccc

ccatalfo added 3 commits March 7, 2018 22:14

[MRG] return_X_y to more datasets - versionadded fixed (scikit-learn#…

6f9d9df

…10734)

[MRG] return_X_y to more datasets - pep8 fixes (scikit-learn#10734)

9bc7f58

[MRG] Update lfw to support return_X_y (scikit-learn#10734)

52fc2e4

[MRG] Add return_X_y to twenty_newsgroups - fix versionadded (scikit-…

32f3222

…learn#10734)

jnothman reviewed Mar 10, 2018

View reviewed changes

[MRG] Add return_X_y to test_20news and test_lfw - fix indentation an…

72d07f4

…d spaces. (scikit-learn#10734)

ccatalfo changed the title ~~[WIP] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734)~~ [MRG] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) Mar 11, 2018

jnothman approved these changes Mar 11, 2018

View reviewed changes

[MRG] Add return_X_y to other datasets - fix some too-long lines. (sc…

7dcadcb

…ikit-learn#10734)

[MRG] Add return_X_y to test_rcv1 - test shape rather than entire arr…

058b57e

…ay to get around memory isssue. (scikit-learn#10734)

ccatalfo changed the title ~~[MRG] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734)~~ [WIP] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) Mar 14, 2018

[MRG] Add return_X_y to more datasets - refactor tests of return_X_y …

250b9a6

…parameter into common test function.

ccatalfo changed the title ~~[WIP] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734)~~ [MRG] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) Mar 15, 2018

jnothman reviewed Mar 15, 2018

View reviewed changes

ccatalfo added 4 commits March 15, 2018 07:49

[MRG] Add return_X_y to more datasets - remove unused import.

14630ee

[MRG] Add return_X_y to more datasets - remove unneeded data fetch in…

e68f3f9

… test.

[MRG] Add return_X_y to more datasets - switch to straight asserts in…

5f73583

… test.

[MRG] Add return_X_y to more datasets - remove another unused import.

f9e1210

[MRG] Add return_X_y to more datasets - change check_return_X_y to us…

e2f012e

…e partial fetch function (scikit-learn#10734).

jnothman approved these changes Mar 17, 2018

View reviewed changes

jnothman changed the title ~~[MRG] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734)~~ [MRG+1] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734) Mar 17, 2018

qinhanmin2014 reviewed Mar 22, 2018

View reviewed changes

ccatalfo added 2 commits March 24, 2018 21:40

[MRG] Add return_X_y to more datasets - add return_X_y to covtype dat…

a2b0b89

…aset.

[MRG] Add return_X_y to more datasets - fix missing partial on load_d…

e1695dd

…igits test to check_return_X_y

ccatalfo added 2 commits March 24, 2018 22:00

[MRG] Add return_X_y to more datasets - add new test_california_housi…

1e81921

…ng datasets test file (scikit-learn#10374)

[MRG] Add return_X_y to more datasets - add return_X_y parameter to c…

9a5ebb8

…alifornia_housing dataset (scikit-learn#10734).

qinhanmin2014 approved these changes Mar 25, 2018

View reviewed changes

qinhanmin2014 merged commit ff3230c into scikit-learn:master Mar 25, 2018



		def check_return_X_y(bunch, X_y_tuple):
		assert_true(isinstance(X_y_tuple, tuple))

		from sklearn.utils.testing import assert_true


		def check_return_X_y(bunch, X_y_tuple):

Uh oh!

Conversation

ccatalfo commented Mar 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented Mar 8, 2018

Uh oh!

ccatalfo commented Mar 8, 2018

Uh oh!

ccatalfo commented Mar 9, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccatalfo commented Mar 11, 2018

Uh oh!

jnothman commented Mar 11, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

ccatalfo commented Mar 12, 2018

Uh oh!

jnothman commented Mar 12, 2018

Uh oh!

ccatalfo commented Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Mar 12, 2018 via email

Uh oh!

ccatalfo commented Mar 13, 2018

Uh oh!

jnothman commented Mar 13, 2018 via email

Uh oh!

ccatalfo commented Mar 13, 2018

Uh oh!

jnothman commented Mar 13, 2018 via email

Uh oh!

ccatalfo commented Mar 15, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Mar 15, 2018 via email

Uh oh!

ccatalfo commented Mar 16, 2018

Uh oh!

jnothman commented Mar 16, 2018 via email

Uh oh!

ccatalfo commented Mar 16, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Uh oh!

ccatalfo commented Mar 8, 2018 •

edited

Loading

ccatalfo commented Mar 12, 2018 •

edited

Loading