ENH add as_frame functionality for toy datasets#15980
ENH add as_frame functionality for toy datasets#15980thomasjpfan merged 35 commits intoscikit-learn:masterfrom
Conversation
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
…es, load_linnerud, load_boston Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
|
Hello @wconnell: good to see this progressing. This is the linting error I see here: Running flake8 on the diff in the range 2287d2d22..4a4915566 (12 commit(s)):
--------------------------------------------------------------------------------
sklearn/datasets/tests/test_common.py:13:80: E501 line too long (100 > 79 characters)
expected_msg = ('{} with as_frame=True requires pandas'.format(fetch_func_partial.__name__))
|
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
Co-authored-by: @reganconnell <reganconnell@berkeley.edu>
glemaitre
left a comment
There was a problem hiding this comment.
I made a first pass. A couple of comments regarding the documentation.
I am a bit surprised that we always return DataFrame for the target while the original API in fetch_openml was to return a Series if we have a single column.
This change was introduced in fetch_california_housing. I will open an issue to discuss what is the desired output. We can wait for a consensus before to change anything on this side.
|
@wconnel do you plan to continue this PR? |
|
@cmarmo Yes, still planning to continue. Sounds good, thanks for the update! |
ogrisel
left a comment
There was a problem hiding this comment.
This is looking good but here are some (very redundant :) comments:
ogrisel
left a comment
There was a problem hiding this comment.
Couple more nitpicks but otherwise LGTM. Please feel free to merge the current master into this branch to get the coverage report to be up to date.
thomasjpfan
left a comment
There was a problem hiding this comment.
Thank you for the PR @wconnell !
|
I do not understand why the codecov reports states that the check_as_frame utility function is never executed. Furthermore the diff does not show the full changes of this PR: Anyway I am still +1 for merge. |
ogrisel
left a comment
There was a problem hiding this comment.
Actually codecov is right: those tests are always skipped because check_pandas_dependency_message skips the test if pandas is installed and check_as_frame skips the test if pandas is not installed.
The tests for the toy datasets that now support as_frame need to be refactored into 2 sets of tests, those who run when pandas is installed and the the others.
|
You can run the test with Details |
ogrisel
left a comment
There was a problem hiding this comment.
I splitted the as_frame tests into to groups (when pandas is installed vs when pandas is missing). Now all the lines are covered and codecov is happy.
I also added an entry to document the new feature in what's new. LGTM on my end.
@glemaitre any more review / comment?
|
Thanks for taking care of that, learned quite a bit from your help. Hope to contribute more in the future! |
|
@ogrisel Are you able to remove the "Waiting for reviewer" label? Thank you. |
Reference Issues/PRs
Handles part of #10733
Related to #13902, #15486 and #15950
Work from WiMLDS SF sprint that was waiting for #15486 to close, thanks for the helpful work @gitsteph and @reshamas. Co-authored with @reganconnell.
What does this implement/fix? Explain your changes.
Adds
as_framefunctionality to all toy datasets includingload_wine(), load_iris(), load_breast_cancer(), load_diabetes(), load_linnerud(), load_boston(), load_digits().Any other comments?
Documentation conventions for
fetch_california_housing()from #15950 were followed, although the doc structure describing attributes of the returned toy dataBunchand fetched dataBunchare different so needs review to enhance consistency. In relation,Bunch.targetdataframe column name defaults to "target" when unspecified.Previously there was inconsistency in what
Bunch.target_namesrepresents, sometimes returning the classes of a target variable, sometimes the target variable name(s). I specified this attribute only for classification datasets, so it returns the classes of a variable.Some tests are generalized from those in
sklearn/datasets/tests/test_california_housing.pyand may be able to supersede those for future use.