Skip to content

[MRG+1] Added estimator checks for pandas object#12218

Merged
rth merged 22 commits intoscikit-learn:masterfrom
mc4229:sklearn/add_dataframe_estimator_test
Feb 1, 2020
Merged

[MRG+1] Added estimator checks for pandas object#12218
rth merged 22 commits intoscikit-learn:masterfrom
mc4229:sklearn/add_dataframe_estimator_test

Conversation

@mc4229
Copy link
Copy Markdown
Contributor

@mc4229 mc4229 commented Sep 29, 2018

Reference Issues/PRs

Fixes #7528

What does this implement/fix? Explain your changes.

Added estimator checks for pandas object.

Any other comments?

@amueller
Copy link
Copy Markdown
Member

Looks good. Though actually now I'm thinking maybe it might be even possible to extend the NotAnArray test and just parametrize that so we don't have to copy that much code?

@mc4229
Copy link
Copy Markdown
Contributor Author

mc4229 commented Sep 29, 2018

@amueller That makes sense. I will take a look!

@sergulaydore
Copy link
Copy Markdown
Contributor

Hello @mc4229 ,

Thank you for participating in the WiMLDS/scikit sprint. We would love to merge all the PRs that were submitted. It would be great if you could follow up on the work that you started! For the PR you submitted, would you please update and re-submit? Please include #wimlds in your PR conversation.

Any questions:

  • see workflow for reference
  • ask on this PR conversation or the issue tracker
  • ask on wimlds gitter with a reference to this PR

cc: @reshamas

@reshamas
Copy link
Copy Markdown
Member

@mc4229
Will you be completing this PR?

@reshamas
Copy link
Copy Markdown
Member

Looks good. Though actually now I'm thinking maybe it might be even possible to extend the NotAnArray test and just parametrize that so we don't have to copy that much code?

@amueller can you provide more context for this? I am not sure how to proceed. Can you point me to a reference?

@mc4229
Copy link
Copy Markdown
Contributor Author

mc4229 commented Dec 19, 2018

@reshamas I plan to continue working on the PR. Actually I have already made the changes to parametrize the test and avoid copying code in my latest commit. However, currently I am facing two issues with the PR:

  1. Somehow after the sprint event I could not run pytest locally (pytest give me some error messages), so I has not been able to test my latest changes;
  2. The code are causing failures in continuous-integration builds and I have not been able to find the root causes yet. One issue is that, even with my previous changes that did pass my local pytest, the continuous-integration builds also failed.

@reshamas
Copy link
Copy Markdown
Member

@mc4229
Can you share the error from pytest?
Go ahead and push your updates. We can take a look and see what is going on.

@reshamas
Copy link
Copy Markdown
Member

@mc4229 Our goal is to have all outstanding PRs from sprint merged by December 29, which is 3 months post-sprint.

@mc4229
Copy link
Copy Markdown
Contributor Author

mc4229 commented Dec 19, 2018

@reshamas I have already pushed my changes (the latest commit "Extend original tests to avoid copying code").
For pytest, I got the error "ImportError: No module named 'sklearn.__check_build._check_build'". Previously I could run pytest with no problem and I don't know why this issue occured after a while. I searched for several solutions online and none of those worked for me. I will probably just rebuild the entire environment to see whether this error can go away.

@mc4229
Copy link
Copy Markdown
Contributor Author

mc4229 commented Dec 19, 2018

The Traceback info of the pytest error is as the following:

Traceback (most recent call last):
  File "C:\Users\Menghan\Anaconda3\envs\sklearndev\lib\site-packages\_pytest\con
fig\__init__.py", line 381, in _getconftestmodules
    return self._path2confmods[path]
KeyError: local('D:\\GitDir\\scikit-learn\\sklearn')

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "C:\Users\Menghan\Anaconda3\envs\sklearndev\lib\site-packages\_pytest\con
fig\__init__.py", line 412, in _importconftest
    return self._conftestpath2mod[conftestpath]
KeyError: local('D:\\GitDir\\scikit-learn\\conftest.py')

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "D:\GitDir\scikit-learn\sklearn\__check_build\__init__.py", line 44, in <
module>
    from ._check_build import check_build  # noqa
ModuleNotFoundError: No module named 'sklearn.__check_build._check_build'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "C:\Users\Menghan\Anaconda3\envs\sklearndev\lib\site-packages\_pytest\con
fig\__init__.py", line 418, in _importconftest
    mod = conftestpath.pyimport()
  File "C:\Users\Menghan\Anaconda3\envs\sklearndev\lib\site-packages\py\_path\lo
cal.py", line 668, in pyimport
    __import__(modname)
  File "C:\Users\Menghan\Anaconda3\envs\sklearndev\lib\site-packages\_pytest\ass
ertion\rewrite.py", line 290, in load_module
    six.exec_(co, mod.__dict__)
  File "D:\GitDir\scikit-learn\conftest.py", line 14, in <module>
    from sklearn.utils.fixes import PY3_OR_LATER
  File "D:\GitDir\scikit-learn\sklearn\__init__.py", line 63, in <module>
    from . import __check_build
  File "D:\GitDir\scikit-learn\sklearn\__check_build\__init__.py", line 46, in <
module>
    raise_build_error(e)
  File "D:\GitDir\scikit-learn\sklearn\__check_build\__init__.py", line 41, in r
aise_build_error
    %s""" % (e, local_dir, ''.join(dir_content).strip(), msg))
ImportError: No module named 'sklearn.__check_build._check_build'
___________________________________________________________________________
Contents of D:\GitDir\scikit-learn\sklearn\__check_build:
setup.py                  _check_build.c            _check_build.cp36-win_amd64.
pyd
_check_build.pyx          __init__.py               __pycache__
___________________________________________________________________________
It seems that scikit-learn has not been built correctly.

If you have installed scikit-learn from source, please do not forget
to build the package before using it: run `python setup.py install` or
`make` in the source directory.

If you have used an installer, please check that it is suited for your
Python version, your operating system and your platform.
ERROR: could not load D:\GitDir\scikit-learn\conftest.py

@mc4229
Copy link
Copy Markdown
Contributor Author

mc4229 commented Dec 22, 2018

I rebuilt my environment and was able to run pytest. My code changes should affect sklearn/tests/test_common.py. I ran pytest on that file and the test succeeded. However, I have not figured out how to deal with the errors in the continuous-integration tests. For the AppVeyor build, I got the error "This problem is unconstrained". For Travis CI Builds, I got the error "No module named pandas". Could anyone suggest how I can fix those errors? @amueller @reshamas

@reshamas
Copy link
Copy Markdown
Member

@mc4229 to confirm, you are running in your virtual environment?

@mc4229
Copy link
Copy Markdown
Contributor Author

mc4229 commented Dec 23, 2018

@reshamas I ran pytest in my virtual environment. I got the continuous integration errors in the online checks.

@reshamas
Copy link
Copy Markdown
Member

reshamas commented Jan 5, 2019

@NicolasHug @adrinjalali
Are you able to assist with this? The pytest sklearn are failing with the error: ModuleNotFoundError: No module named 'pandas'

y_ = NotAnArray(np.asarray(y))
X_ = NotAnArray(np.asarray(X))
else:
import pandas as pd
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't always have pandas in our test environments. Pandas is not officially a dependency and therefore we make sure that the code base works without it. As a result all tests which use pandas should use it with pd = pytest.importorskip('pandas') instead. pytest will simply skip the test if pandas is not installed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A search of the sklearn folder showed this syntax for importing pandas, but all the files were "test_*.py" files. Is it correct to add this syntax to the file in this PR, or should it be going somewhere else?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good point @reshamas. It may be a good idea to catch the ImportError here and if obj_type is PandasDataframe and import fails, to give a ValueError or something, since that basically means the input is not compatible with the environment. Then when the tests want to call this function, they should skip the tests that would give this error using pytest.importorskip, I guess.

I'm really not sure which exception is more apt to raise though!

Copy link
Copy Markdown
Member

@reshamas reshamas Jan 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be something like this:

import pytest
from pytest import importorskip
    if obj_type not in ["NotAnArray", 'PandasDataframe']:
        raise ValueError("Data type {0} not supported".format(obj_type))

    if obj_type == "NotAnArray":
        y_ = NotAnArray(np.asarray(y))
        X_ = NotAnArray(np.asarray(X))
    elif obj_type == "PandasDataframe":
        try:  
            import pandas as pd
            pd = pytest.importorskip("pandas")
            y_ = np.asarray(y)
            if y_.ndim == 1:
                y_ = pd.Series(y_)
            else:
                y_ = pd.DataFrame(y_)
            X_ = pd.DataFrame(np.asarray(X))
        except ImportError:
            raise SkipTest("Skipping test, pandas not installed")

OR

except ImportError:
    assert_raise_message(ValueError, "Skipping test, pandas not installed ")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More like the latter option, but maybe say "Cannot run a pandas related test when pandas is not installed." or something. Cause we're not really skipping here, were' raising an error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reshamas @adrinjalali Thanks for your suggestions. I will make the changes accordingly.

Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, in fact, the NotAnArray check is sufficient... Although there are cases where pandas may be treated specially (e.g. for iloc).

We should also generally have a test that the check is working in test_estimator_checks. There we could test that NotAnArray support and Pandas support was identical??

@mc4229
Copy link
Copy Markdown
Contributor Author

mc4229 commented Feb 7, 2019

We should also generally have a test that the check is working in test_estimator_checks. There we could test that NotAnArray support and Pandas support was identical??

My understanding is that, all estimator checks are picked up by check_estimator() in estimator_checks.py, and then test_estimator_checks.py import check_estimator() to further test all estimators. @jnothman if you think it would be better to add the pandas related checks in test_estimator_checks.py, could you point me to a location where I can add those tests?

@psorianom
Copy link
Copy Markdown
Contributor

Hey @mc4229, currently being at the sprint, I would like to take over this issue but if you are still working on this please tell me so I don't step on your work. Thanks!

@reshamas
Copy link
Copy Markdown
Member

Hello @psorianom
Please take over this PR. This is from a WiMLDS I organized back on Sept 29, 2018. It's been nearly 5 months since the event. Since there has been no significant activity on this PR after multiple follow-up attempts on my part, according to the updated definition in the contributing guidelines, it is open. It has been delayed for too long. Please do take over and complete. Thanks!

@mc4229 mc4229 changed the title [WIP] Added estimator checks for pandas object [MRG] Added estimator checks for pandas object Feb 26, 2019
@mc4229
Copy link
Copy Markdown
Contributor Author

mc4229 commented Feb 26, 2019

Hi @jnothman @GaelVaroquaux could you help review this PR? Thanks!

# test classifiers can handle non-array data
yield check_classifier_data_not_an_array
# test classifiers can handle non-array data and pandas objects
yield check_classifier_two_data_types
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two_data_types is not clear (especially as dtype = data type is something else entirely). not_an_array is still acceptable.

Copy link
Copy Markdown
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

@jnothman mentioned above the need for a test:

We should also generally have a test that the check is working in test_estimator_checks. There we could test that NotAnArray support and Pandas support was identical??

But I am not very familiar with those test? What would exactly need to be tested there? The goal is to make a small dummy esimator that would fail the check (by not properly supporting pandas) and then ensure the check indeed fails?

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Feb 27, 2019 via email

@mc4229
Copy link
Copy Markdown
Contributor Author

mc4229 commented Feb 27, 2019

I will try to create a dummy estimator for this case.

@mc4229
Copy link
Copy Markdown
Contributor Author

mc4229 commented Apr 7, 2019

@jnothman @jorisvandenbossche Sorry it took me a while to follow up on this PR. Could you review my changes and let me know whether they make sense?

@cmarmo
Copy link
Copy Markdown
Contributor

cmarmo commented Jan 14, 2020

Hi @mc4229 , are you still interested in finishing this? If you could find some time to resolve conflicts, I think you deserve we push a bit more for a review. If not, I totally understand, I will put your PR in the Sprint pool again so that your work is not lost. Thanks for your patience!

@mc4229
Copy link
Copy Markdown
Contributor Author

mc4229 commented Jan 29, 2020

@cmarmo Yes I am still interested in finishing this! I will find some time to resolve the conflicts.

@cmarmo
Copy link
Copy Markdown
Contributor

cmarmo commented Jan 31, 2020

@adrinjalali , @rth , @glemaitre, @jnothman this PR probably deserves some attention for the 2-years perseverance of @mc4229 . Thanks!

@GaelVaroquaux GaelVaroquaux changed the title [MRG] Added estimator checks for pandas object [MRG+1] Added estimator checks for pandas object Jan 31, 2020
Copy link
Copy Markdown
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to do the most boring review ever: this looks good to me, +1 for merge.

This is a very useful test. Thank you @mc4229 for implementing it, and sticking around.

Thank you @cmarmo for finding it, and reminding us about it.

Copy link
Copy Markdown
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mc4229! This is indeed very useful.

@rth rth merged commit 5234efa into scikit-learn:master Feb 1, 2020
@mc4229
Copy link
Copy Markdown
Contributor Author

mc4229 commented Feb 1, 2020

Thanks everyone!

@mc4229 mc4229 deleted the sklearn/add_dataframe_estimator_test branch February 1, 2020 16:26
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 22, 2020
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add dataframe test to common tests.