[MRG+1] Added estimator checks for pandas object by mc4229 · Pull Request #12218 · scikit-learn/scikit-learn

mc4229 · 2018-09-29T19:46:44Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Added estimator checks for pandas object.

Any other comments?

amueller · 2018-09-29T21:36:32Z

Looks good. Though actually now I'm thinking maybe it might be even possible to extend the NotAnArray test and just parametrize that so we don't have to copy that much code?

mc4229 · 2018-09-29T22:55:24Z

@amueller That makes sense. I will take a look!

sergulaydore · 2018-11-11T16:39:50Z

Hello @mc4229 ,

Thank you for participating in the WiMLDS/scikit sprint. We would love to merge all the PRs that were submitted. It would be great if you could follow up on the work that you started! For the PR you submitted, would you please update and re-submit? Please include #wimlds in your PR conversation.

Any questions:

see workflow for reference
ask on this PR conversation or the issue tracker
ask on wimlds gitter with a reference to this PR

cc: @reshamas

reshamas · 2018-12-16T15:52:37Z

@mc4229
Will you be completing this PR?

reshamas · 2018-12-18T23:10:12Z

Looks good. Though actually now I'm thinking maybe it might be even possible to extend the NotAnArray test and just parametrize that so we don't have to copy that much code?

@amueller can you provide more context for this? I am not sure how to proceed. Can you point me to a reference?

mc4229 · 2018-12-19T00:30:42Z

@reshamas I plan to continue working on the PR. Actually I have already made the changes to parametrize the test and avoid copying code in my latest commit. However, currently I am facing two issues with the PR:

Somehow after the sprint event I could not run pytest locally (pytest give me some error messages), so I has not been able to test my latest changes;
The code are causing failures in continuous-integration builds and I have not been able to find the root causes yet. One issue is that, even with my previous changes that did pass my local pytest, the continuous-integration builds also failed.

reshamas · 2018-12-19T00:34:26Z

@mc4229
Can you share the error from pytest?
Go ahead and push your updates. We can take a look and see what is going on.

reshamas · 2018-12-19T00:37:04Z

@mc4229 Our goal is to have all outstanding PRs from sprint merged by December 29, which is 3 months post-sprint.

mc4229 · 2018-12-19T02:18:18Z

@reshamas I have already pushed my changes (the latest commit "Extend original tests to avoid copying code").
For pytest, I got the error "ImportError: No module named 'sklearn.__check_build._check_build'". Previously I could run pytest with no problem and I don't know why this issue occured after a while. I searched for several solutions online and none of those worked for me. I will probably just rebuild the entire environment to see whether this error can go away.

mc4229 · 2018-12-19T02:22:38Z

The Traceback info of the pytest error is as the following:

Traceback (most recent call last):
  File "C:\Users\Menghan\Anaconda3\envs\sklearndev\lib\site-packages\_pytest\con
fig\__init__.py", line 381, in _getconftestmodules
    return self._path2confmods[path]
KeyError: local('D:\\GitDir\\scikit-learn\\sklearn')

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "C:\Users\Menghan\Anaconda3\envs\sklearndev\lib\site-packages\_pytest\con
fig\__init__.py", line 412, in _importconftest
    return self._conftestpath2mod[conftestpath]
KeyError: local('D:\\GitDir\\scikit-learn\\conftest.py')

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "D:\GitDir\scikit-learn\sklearn\__check_build\__init__.py", line 44, in <
module>
    from ._check_build import check_build  # noqa
ModuleNotFoundError: No module named 'sklearn.__check_build._check_build'

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "C:\Users\Menghan\Anaconda3\envs\sklearndev\lib\site-packages\_pytest\con
fig\__init__.py", line 418, in _importconftest
    mod = conftestpath.pyimport()
  File "C:\Users\Menghan\Anaconda3\envs\sklearndev\lib\site-packages\py\_path\lo
cal.py", line 668, in pyimport
    __import__(modname)
  File "C:\Users\Menghan\Anaconda3\envs\sklearndev\lib\site-packages\_pytest\ass
ertion\rewrite.py", line 290, in load_module
    six.exec_(co, mod.__dict__)
  File "D:\GitDir\scikit-learn\conftest.py", line 14, in <module>
    from sklearn.utils.fixes import PY3_OR_LATER
  File "D:\GitDir\scikit-learn\sklearn\__init__.py", line 63, in <module>
    from . import __check_build
  File "D:\GitDir\scikit-learn\sklearn\__check_build\__init__.py", line 46, in <
module>
    raise_build_error(e)
  File "D:\GitDir\scikit-learn\sklearn\__check_build\__init__.py", line 41, in r
aise_build_error
    %s""" % (e, local_dir, ''.join(dir_content).strip(), msg))
ImportError: No module named 'sklearn.__check_build._check_build'
___________________________________________________________________________
Contents of D:\GitDir\scikit-learn\sklearn\__check_build:
setup.py                  _check_build.c            _check_build.cp36-win_amd64.
pyd
_check_build.pyx          __init__.py               __pycache__
___________________________________________________________________________
It seems that scikit-learn has not been built correctly.

If you have installed scikit-learn from source, please do not forget
to build the package before using it: run `python setup.py install` or
`make` in the source directory.

If you have used an installer, please check that it is suited for your
Python version, your operating system and your platform.
ERROR: could not load D:\GitDir\scikit-learn\conftest.py

mc4229 · 2018-12-22T21:08:22Z

I rebuilt my environment and was able to run pytest. My code changes should affect sklearn/tests/test_common.py. I ran pytest on that file and the test succeeded. However, I have not figured out how to deal with the errors in the continuous-integration tests. For the AppVeyor build, I got the error "This problem is unconstrained". For Travis CI Builds, I got the error "No module named pandas". Could anyone suggest how I can fix those errors? @amueller @reshamas

reshamas · 2018-12-23T01:51:30Z

@mc4229 to confirm, you are running in your virtual environment?

mc4229 · 2018-12-23T19:20:03Z

@reshamas I ran pytest in my virtual environment. I got the continuous integration errors in the online checks.

reshamas · 2019-01-05T02:20:48Z

@NicolasHug @adrinjalali
Are you able to assist with this? The pytest sklearn are failing with the error: ModuleNotFoundError: No module named 'pandas'

adrinjalali · 2019-01-05T10:16:10Z

sklearn/utils/estimator_checks.py

+        y_ = NotAnArray(np.asarray(y))
+        X_ = NotAnArray(np.asarray(X))
+    else:
+        import pandas as pd


We don't always have pandas in our test environments. Pandas is not officially a dependency and therefore we make sure that the code base works without it. As a result all tests which use pandas should use it with pd = pytest.importorskip('pandas') instead. pytest will simply skip the test if pandas is not installed.

A search of the sklearn folder showed this syntax for importing pandas, but all the files were "test_*.py" files. Is it correct to add this syntax to the file in this PR, or should it be going somewhere else?

That's a really good point @reshamas. It may be a good idea to catch the ImportError here and if obj_type is PandasDataframe and import fails, to give a ValueError or something, since that basically means the input is not compatible with the environment. Then when the tests want to call this function, they should skip the tests that would give this error using pytest.importorskip, I guess.

I'm really not sure which exception is more apt to raise though!

Would it be something like this:

import pytest from pytest import importorskip

if obj_type not in ["NotAnArray", 'PandasDataframe']: raise ValueError("Data type {0} not supported".format(obj_type)) if obj_type == "NotAnArray": y_ = NotAnArray(np.asarray(y)) X_ = NotAnArray(np.asarray(X)) elif obj_type == "PandasDataframe": try: import pandas as pd pd = pytest.importorskip("pandas") y_ = np.asarray(y) if y_.ndim == 1: y_ = pd.Series(y_) else: y_ = pd.DataFrame(y_) X_ = pd.DataFrame(np.asarray(X)) except ImportError: raise SkipTest("Skipping test, pandas not installed")

OR

except ImportError: assert_raise_message(ValueError, "Skipping test, pandas not installed ")

More like the latter option, but maybe say "Cannot run a pandas related test when pandas is not installed." or something. Cause we're not really skipping here, were' raising an error.

@reshamas @adrinjalali Thanks for your suggestions. I will make the changes accordingly.

…into sklearn/add_dataframe_estimator_test

jnothman

Maybe, in fact, the NotAnArray check is sufficient... Although there are cases where pandas may be treated specially (e.g. for iloc).

We should also generally have a test that the check is working in test_estimator_checks. There we could test that NotAnArray support and Pandas support was identical??

sklearn/utils/estimator_checks.py

mc4229 · 2019-02-07T01:13:01Z

We should also generally have a test that the check is working in test_estimator_checks. There we could test that NotAnArray support and Pandas support was identical??

My understanding is that, all estimator checks are picked up by check_estimator() in estimator_checks.py, and then test_estimator_checks.py import check_estimator() to further test all estimators. @jnothman if you think it would be better to add the pandas related checks in test_estimator_checks.py, could you point me to a location where I can add those tests?

psorianom · 2019-02-25T09:58:52Z

Hey @mc4229, currently being at the sprint, I would like to take over this issue but if you are still working on this please tell me so I don't step on your work. Thanks!

reshamas · 2019-02-25T11:33:47Z

Hello @psorianom
Please take over this PR. This is from a WiMLDS I organized back on Sept 29, 2018. It's been nearly 5 months since the event. Since there has been no significant activity on this PR after multiple follow-up attempts on my part, according to the updated definition in the contributing guidelines, it is open. It has been delayed for too long. Please do take over and complete. Thanks!

mc4229 · 2019-02-26T03:57:38Z

Hi @jnothman @GaelVaroquaux could you help review this PR? Thanks!

jnothman · 2019-02-26T17:17:59Z

sklearn/utils/estimator_checks.py

-    # test classifiers can handle non-array data
-    yield check_classifier_data_not_an_array
+    # test classifiers can handle non-array data and pandas objects
+    yield check_classifier_two_data_types


two_data_types is not clear (especially as dtype = data type is something else entirely). not_an_array is still acceptable.

sklearn/utils/estimator_checks.py

jorisvandenbossche

This looks good to me!

@jnothman mentioned above the need for a test:

We should also generally have a test that the check is working in test_estimator_checks. There we could test that NotAnArray support and Pandas support was identical??

But I am not very familiar with those test? What would exactly need to be tested there? The goal is to make a small dummy esimator that would fail the check (by not properly supporting pandas) and then ensure the check indeed fails?

jnothman · 2019-02-27T15:01:22Z

Yes, making a dummy estimator that fails the check would be the goal. Is this too much?

mc4229 · 2019-02-27T23:59:12Z

I will try to create a dummy estimator for this case.

….com/mc4229/scikit-learn into sklearn/add_dataframe_estimator_test

mc4229 · 2019-04-07T02:17:48Z

@jnothman @jorisvandenbossche Sorry it took me a while to follow up on this PR. Could you review my changes and let me know whether they make sense?

cmarmo · 2020-01-14T09:10:43Z

Hi @mc4229 , are you still interested in finishing this? If you could find some time to resolve conflicts, I think you deserve we push a bit more for a review. If not, I totally understand, I will put your PR in the Sprint pool again so that your work is not lost. Thanks for your patience!

mc4229 · 2020-01-29T22:50:49Z

@cmarmo Yes I am still interested in finishing this! I will find some time to resolve the conflicts.

cmarmo · 2020-01-31T18:48:55Z

@adrinjalali , @rth , @glemaitre, @jnothman this PR probably deserves some attention for the 2-years perseverance of @mc4229 . Thanks!

GaelVaroquaux

I'm going to do the most boring review ever: this looks good to me, +1 for merge.

This is a very useful test. Thank you @mc4229 for implementing it, and sticking around.

Thank you @cmarmo for finding it, and reminding us about it.

rth

Thanks @mc4229! This is indeed very useful.

mc4229 · 2020-02-01T16:24:40Z

Thanks everyone!

mc4229 added 3 commits September 29, 2018 14:15

Added estimator check for pandas dataframe

5886319

Added pandas check for regressors and classifiers

8a51dd0

fixed minor errors

da73ba6

mc4229 mentioned this pull request Sep 29, 2018

Add dataframe test to common tests. #7528

Closed

mc4229 added 2 commits September 29, 2018 16:00

Modified SkipTest message for check_estimators_pandas_dataframe

3d97c10

fixed dimensionality error

d01480f

learnFlat mentioned this pull request Oct 3, 2018

[WIP] Add dataframe test to common tests. issue #7528 #12254

Closed

Extend original tests to avoid copying code

9bf7d46

adrinjalali reviewed Jan 5, 2019

View reviewed changes

Menghan Chen and others added 3 commits February 5, 2019 19:04

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

524a6d0

…into sklearn/add_dataframe_estimator_test

Handling pandas ImportError

00ac9fc

Fixed flake8 error

95d9435

jnothman reviewed Feb 6, 2019

View reviewed changes

sklearn/utils/estimator_checks.py Show resolved Hide resolved

mc4229 changed the title ~~[WIP] Added estimator checks for pandas object~~ [MRG] Added estimator checks for pandas object Feb 26, 2019

jnothman reviewed Feb 26, 2019

View reviewed changes

mc4229 added 2 commits February 26, 2019 21:01

Enhanced comments and function names.

e118896

Fixed flake8

90b707c

jorisvandenbossche reviewed Feb 27, 2019

View reviewed changes

mc4229 and others added 6 commits April 6, 2019 18:53

Added tests for pandas checks

90f5387

Merge branch 'master' into sklearn/add_dataframe_estimator_test

faf2416

Not import pytest

3de976f

Merge branch 'sklearn/add_dataframe_estimator_test' of https://github…

221d0aa

….com/mc4229/scikit-learn into sklearn/add_dataframe_estimator_test

Fix flake8

c3df4c3

Fixed flake8

bcf2adf

amueller added the Waiting for Reviewer label Aug 6, 2019

mc4229 added 2 commits January 29, 2020 20:29

Merge branch 'master' into sklearn/add_dataframe_estimator_test

d7ec95e

fix merging errors

b38c536

GaelVaroquaux changed the title ~~[MRG] Added estimator checks for pandas object~~ [MRG+1] Added estimator checks for pandas object Jan 31, 2020

GaelVaroquaux approved these changes Jan 31, 2020

View reviewed changes

rth approved these changes Feb 1, 2020

View reviewed changes

rth merged commit 5234efa into scikit-learn:master Feb 1, 2020

cmarmo removed the Waiting for Reviewer label Feb 1, 2020

mc4229 deleted the sklearn/add_dataframe_estimator_test branch February 1, 2020 16:26

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 22, 2020

TST Add estimator check for pandas objects (scikit-learn#12218)

5267b51

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020

TST Add estimator check for pandas objects (scikit-learn#12218)

03b4eb6

Uh oh!

Conversation

mc4229 commented Sep 29, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

amueller commented Sep 29, 2018

Uh oh!

mc4229 commented Sep 29, 2018

Uh oh!

sergulaydore commented Nov 11, 2018

Uh oh!

reshamas commented Dec 16, 2018

Uh oh!

reshamas commented Dec 18, 2018

Uh oh!

mc4229 commented Dec 19, 2018

Uh oh!

reshamas commented Dec 19, 2018

Uh oh!

reshamas commented Dec 19, 2018

Uh oh!

mc4229 commented Dec 19, 2018

Uh oh!

mc4229 commented Dec 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mc4229 commented Dec 22, 2018

Uh oh!

reshamas commented Dec 23, 2018

Uh oh!

mc4229 commented Dec 23, 2018

Uh oh!

reshamas commented Jan 5, 2019

Uh oh!

adrinjalali Jan 5, 2019

Choose a reason for hiding this comment

Uh oh!

reshamas Jan 13, 2019

Choose a reason for hiding this comment

Uh oh!

adrinjalali Jan 13, 2019

Choose a reason for hiding this comment

Uh oh!

reshamas Jan 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali Jan 16, 2019

Choose a reason for hiding this comment

Uh oh!

mc4229 Jan 25, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mc4229 commented Feb 7, 2019

Uh oh!

psorianom commented Feb 25, 2019

Uh oh!

reshamas commented Feb 25, 2019

Uh oh!

mc4229 commented Feb 26, 2019

Uh oh!

jnothman Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Feb 27, 2019 via email

Uh oh!

mc4229 commented Feb 27, 2019

Uh oh!

mc4229 commented Apr 7, 2019

Uh oh!

mc4229 commented Dec 19, 2018 •

edited

Loading

reshamas Jan 14, 2019 •

edited

Loading