[MRG+1] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734)#10774
[MRG+1] Fixes return_X_y should be available on more dataset loaders/fetchers (#10734)#10774qinhanmin2014 merged 21 commits intoscikit-learn:masterfrom
Conversation
|
I think it is applicable to lfw. We are not stopping the user from getting the bunch, just making it easier to get the primary tuple of data And versionadded should be 0.20 |
|
@jnothman thanks - will add to ifw and change versionadded to 0.20. |
|
@jnothman I added return_X_y to lfw as well as fixed a couple of pep8 things I missed. |
jnothman
left a comment
There was a problem hiding this comment.
This is still marked WIP. Is there something else you intend to do before considering this sufficient for review and merge?
| assert_equal(bunch.data.dtype, np.float64) | ||
|
|
||
| # test return_X_y option | ||
| X_y_tuple = datasets.fetch_20newsgroups_vectorized(subset='test',return_X_y=True) |
sklearn/datasets/tests/test_lfw.py
Outdated
|
|
||
| # test return_X_y option | ||
| X_y_tuple = fetch_lfw_people(data_home=SCIKIT_LEARN_DATA, resize=None, | ||
| slice_=None, color=True, |
There was a problem hiding this comment.
this indent should be consistent with the argument on the previous line
|
@jnothman thanks for the heads up on those two issues and on marking as MRG. I believe this is now ready for review and merge. |
|
flake8 says you have left some lines longer than 79 characters. |
jnothman
left a comment
There was a problem hiding this comment.
the tests are not run by our ci, I think. Have you run them?
It might be nice to have this as a common test for datasets. Otherwise LGTM
|
@jnothman I have run the tests and they are passing. I just committed fix for those too-long lines flagged by flake8. A common test for datasets - meaning a ci test that runs the sklearn/datasets/tests ? |
|
No, by a common test, I mean avoiding the repetition of code in the current tests, instead looping over fetches to confirm that they return the right format when return_X_y is used |
…ay to get around memory isssue. (scikit-learn#10734)
|
@jnothman "avoiding the repetition of code in the current tests" - Ah I see what you're saying. Certainly could do that - does it make more sense to have every dataset's test in its own test_ file or test across datasets for functionality that crosses them, is the question I guess. I'm happy to do that if you think it's a good idea. I made a small change to test_rcv1 where I think the test was running out of memory attempting to test the entire array returned. The codecov/patch that is marked as failing at https://codecov.io/gh/scikit-learn/scikit-learn/compare/ccbf9975fcf1676f6ac4f311e388529d3a3c4d3f...7dcadcb12a74b4b871c1f4d976564992c25ce30a - is that indicating the previous diff does not hit large enough test coverage percentage? |
|
the codecov failure simply says that the new test code is not being run.
Which is to be expected.
a common test would live in datasets/tests/test_common.py.
|
|
@jnothman looking at the datasets/tests/test_common.py idea for return_X_y. Perhaps I'm misinterpreting your idea, but I think there'd still wind up being duplicated code from moving the relevant pieces of the the test_.py files into the tests/test_common.py: While looping over the limited set of datasets which accept the return_X_y parameter (rcv1, lfw, 20_newsgroups, kddcup99 and various fetches from base.py)
In sum - while the actual test of the X_y_tuples are the same, the fetching involved differs by dataset. Moving that to test_common.py would lead to code duplication of that part of the test logic. While I agree that it would be nice to capture the repetitive parts of this return_X_y test logic, it feels to me like it would
If I've misunderstood or mischaracterized your proposal, please let me know. If you feel like test_common.py is the best way to go, I can certainly implement it that way. Thanks for your thoughts. Also thanks for the hand-holding as I get acclimated to the codebase and the contributing flow. |
|
Okay. Could also just use a helper called check_return_X_y called by each
dataset's test.
|
|
@jnothman great, yes, I will do that - refactor the return_X_y test part called by each test into tests/test_common.py. |
|
thanks!
|
…parameter into common test function.
|
@jnothman I refactored as suggested into test_common.py - both the files I had updated (kddcup99, lfw etc) as well as test_base.py to use the new test function. |
jnothman
left a comment
There was a problem hiding this comment.
Thanks!
You have a flake8 failure
|
|
||
|
|
||
| def check_return_X_y(bunch, X_y_tuple): | ||
| assert_true(isinstance(X_y_tuple, tuple)) |
There was a problem hiding this comment.
We are trying to phase out these assertion functions. Please use a bare assert
|
|
||
| def check_return_X_y(bunch, X_y_tuple): | ||
| assert_true(isinstance(X_y_tuple, tuple)) | ||
| assert_array_equal(X_y_tuple[0].shape, bunch.data.shape) |
There was a problem hiding this comment.
Shapes should not be arrays. Just use assert ==
| assert_equal(data.target.shape, (9571,)) | ||
|
|
||
| X_y_tuple = fetch_kddcup99('smtp', return_X_y=True) | ||
| bunch = fetch_kddcup99('smtp') |
There was a problem hiding this comment.
Don't we already have a bunch above? I think it's unclear to the reader why we get this again
| from sklearn.utils.testing import assert_true | ||
|
|
||
|
|
||
| def check_return_X_y(bunch, X_y_tuple): |
There was a problem hiding this comment.
I had imagined this would take a partial function and pass in return_X_y=True. You still have left a lot of duplicated idiom in the test functions
There was a problem hiding this comment.
By
a lot of duplicated idiom
do you mean the setting up of the bunches and fetch of the X_y_tuples to compare?
Were you thinking of something like this?
Each of the dataset test files sets up the appropriate fetch partial
- test_20news.py, test_lfw.py, etc.
from functools import partial
# test return_X_y option
fetch_func = partial(datasets.fetch_20newsgroups_vectorized, subset='test')
check_return_X_y(bunch, fetch_func)During the check_return_X_y call, the partial is called, passing return_X_y=True
- test_common.py
def check_return_X_y(bunch, fetch_func_partial):
X_y_tuple = fetch_func_partial(return_X_y=True)
assert(isinstance(X_y_tuple, tuple))
assert(X_y_tuple[0].shape == bunch.data.shape)
assert(X_y_tuple[1].shape == bunch.target.shape)Wouldn't this run into problems if ever additional arguments are added to those fetch functions after what's currently the last parameter, the return_X_y one?
|
no, I don't see how it would run into problems if args are passed by name
|
|
Ok is that what you’re going for as far as the usage of the partial then? If so I’ll do that tonight. |
|
I think so, only if it seems to make things neater
|
…e partial fetch function (scikit-learn#10734).
|
@jnothman Ok I've updated each to use the partial function version. How does this look now? |
| assert_true(isinstance(X_y_tuple, tuple)) | ||
| assert_array_equal(X_y_tuple[0], bunch.data) | ||
| assert_array_equal(X_y_tuple[1], bunch.target) | ||
| check_return_X_y(res, partial(load_diabetes)) |
There was a problem hiding this comment.
Why do we need partial here? (and other places in this file)
There was a problem hiding this comment.
These tests_base.py test functions pass in the partial for the same reason the partial is passed in from the other test_* dataset files like test_20news.py - so that check_return_X_y can call that same dataset fetch function with the additional return_X_y=True parameter and perform the same standard X_y_tuple checks.
Doing it this way keeps the interface to check_return_X_y uniform among the dataset test_* files -- even if these particular fetch functions are relatively simple compared to the test files like test_20news.py.
There was a problem hiding this comment.
I don't want to argue on such a minor question (though I still prefer to remove partial if we don't need it). But I think you should try to make the whole file consistent. (e.g., you're not using partial in test_load_digits)
There was a problem hiding this comment.
@qinhanmin2014 Aahh - good catch, I had missed that one. Thanks.
There was a problem hiding this comment.
I've pushed a fix for that.
| assert_equal(bunch.target.shape[0], 7532) | ||
| assert_equal(bunch.data.dtype, np.float64) | ||
|
|
||
| # test return_X_y option |
There was a problem hiding this comment.
How about the training set?
There was a problem hiding this comment.
I thought only one subset return_X_y check was needed since fetch_20newsgroups_vectorized is subsetting the fetched dataset before return_X_y is checked.
|
@ccatalfo Seems that some functions are not included here (e.g., |
|
@qinhanmin2014 As far as adding return_X_y to the last few datasets like california_housing it looked to me as if there were more than just the X and y to return. For example
so it didn't seem to make sense to return just the I'm not sure which other datasets, beyond that one, might benefit from return_X_y? Maybe covtype.py? So shall I add return_X_y tests to these two?
|
|
yes, it's useful even without feature names.
|
…igits test to check_return_X_y
|
I've added return_X_y to covtype. |
…ng datasets test file (scikit-learn#10374)
…alifornia_housing dataset (scikit-learn#10734).
|
Also added return_X_y to california_housing. In doing that, I also added a new test_california_housing.py test file (I didn't see an existing test file for california_housing.py) and return_X_y. |
qinhanmin2014
left a comment
There was a problem hiding this comment.
LGTM. I can't fully understand the codecov failure but I think it's irrelevant and 'll trust jnothman's comment. Thanks @ccatalfo :)
|
Thanks for all the suggestions and comments! |
|
thank you for your work and clear communication!
|
Reference Issues/PRs
Fixes #10734 by implementing return_X_y for kdcupp99, twenty_newsgroups, rcv1, and lfw datasets.
What does this implement/fix? Explain your changes.
This replicates the return_X_y parameter that was added to datasets/base.py.
Any other comments?
I did not add return_X_y to some of the other datasets as it seemed to make less sense for these.