[MRG+1] Added estimator checks for pandas object#12218
[MRG+1] Added estimator checks for pandas object#12218rth merged 22 commits intoscikit-learn:masterfrom
Conversation
|
Looks good. Though actually now I'm thinking maybe it might be even possible to extend the NotAnArray test and just parametrize that so we don't have to copy that much code? |
|
@amueller That makes sense. I will take a look! |
|
Hello @mc4229 , Thank you for participating in the WiMLDS/scikit sprint. We would love to merge all the PRs that were submitted. It would be great if you could follow up on the work that you started! For the PR you submitted, would you please update and re-submit? Please include #wimlds in your PR conversation. Any questions:
cc: @reshamas |
|
@mc4229 |
@amueller can you provide more context for this? I am not sure how to proceed. Can you point me to a reference? |
|
@reshamas I plan to continue working on the PR. Actually I have already made the changes to parametrize the test and avoid copying code in my latest commit. However, currently I am facing two issues with the PR:
|
|
@mc4229 |
|
@mc4229 Our goal is to have all outstanding PRs from sprint merged by December 29, which is 3 months post-sprint. |
|
@reshamas I have already pushed my changes (the latest commit "Extend original tests to avoid copying code"). |
|
The Traceback info of the pytest error is as the following: |
|
I rebuilt my environment and was able to run pytest. My code changes should affect sklearn/tests/test_common.py. I ran pytest on that file and the test succeeded. However, I have not figured out how to deal with the errors in the continuous-integration tests. For the AppVeyor build, I got the error "This problem is unconstrained". For Travis CI Builds, I got the error "No module named pandas". Could anyone suggest how I can fix those errors? @amueller @reshamas |
|
@mc4229 to confirm, you are running in your virtual environment? |
|
@reshamas I ran pytest in my virtual environment. I got the continuous integration errors in the online checks. |
|
@NicolasHug @adrinjalali |
sklearn/utils/estimator_checks.py
Outdated
| y_ = NotAnArray(np.asarray(y)) | ||
| X_ = NotAnArray(np.asarray(X)) | ||
| else: | ||
| import pandas as pd |
There was a problem hiding this comment.
We don't always have pandas in our test environments. Pandas is not officially a dependency and therefore we make sure that the code base works without it. As a result all tests which use pandas should use it with pd = pytest.importorskip('pandas') instead. pytest will simply skip the test if pandas is not installed.
There was a problem hiding this comment.
A search of the sklearn folder showed this syntax for importing pandas, but all the files were "test_*.py" files. Is it correct to add this syntax to the file in this PR, or should it be going somewhere else?
There was a problem hiding this comment.
That's a really good point @reshamas. It may be a good idea to catch the ImportError here and if obj_type is PandasDataframe and import fails, to give a ValueError or something, since that basically means the input is not compatible with the environment. Then when the tests want to call this function, they should skip the tests that would give this error using pytest.importorskip, I guess.
I'm really not sure which exception is more apt to raise though!
There was a problem hiding this comment.
Would it be something like this:
import pytest
from pytest import importorskip if obj_type not in ["NotAnArray", 'PandasDataframe']:
raise ValueError("Data type {0} not supported".format(obj_type))
if obj_type == "NotAnArray":
y_ = NotAnArray(np.asarray(y))
X_ = NotAnArray(np.asarray(X))
elif obj_type == "PandasDataframe":
try:
import pandas as pd
pd = pytest.importorskip("pandas")
y_ = np.asarray(y)
if y_.ndim == 1:
y_ = pd.Series(y_)
else:
y_ = pd.DataFrame(y_)
X_ = pd.DataFrame(np.asarray(X))
except ImportError:
raise SkipTest("Skipping test, pandas not installed")OR
except ImportError:
assert_raise_message(ValueError, "Skipping test, pandas not installed ")There was a problem hiding this comment.
More like the latter option, but maybe say "Cannot run a pandas related test when pandas is not installed." or something. Cause we're not really skipping here, were' raising an error.
There was a problem hiding this comment.
@reshamas @adrinjalali Thanks for your suggestions. I will make the changes accordingly.
…into sklearn/add_dataframe_estimator_test
jnothman
left a comment
There was a problem hiding this comment.
Maybe, in fact, the NotAnArray check is sufficient... Although there are cases where pandas may be treated specially (e.g. for iloc).
We should also generally have a test that the check is working in test_estimator_checks. There we could test that NotAnArray support and Pandas support was identical??
My understanding is that, all estimator checks are picked up by |
|
Hey @mc4229, currently being at the sprint, I would like to take over this issue but if you are still working on this please tell me so I don't step on your work. Thanks! |
|
Hello @psorianom |
|
Hi @jnothman @GaelVaroquaux could you help review this PR? Thanks! |
sklearn/utils/estimator_checks.py
Outdated
| # test classifiers can handle non-array data | ||
| yield check_classifier_data_not_an_array | ||
| # test classifiers can handle non-array data and pandas objects | ||
| yield check_classifier_two_data_types |
There was a problem hiding this comment.
two_data_types is not clear (especially as dtype = data type is something else entirely). not_an_array is still acceptable.
jorisvandenbossche
left a comment
There was a problem hiding this comment.
This looks good to me!
@jnothman mentioned above the need for a test:
We should also generally have a test that the check is working in test_estimator_checks. There we could test that NotAnArray support and Pandas support was identical??
But I am not very familiar with those test? What would exactly need to be tested there? The goal is to make a small dummy esimator that would fail the check (by not properly supporting pandas) and then ensure the check indeed fails?
|
Yes, making a dummy estimator that fails the check would be the goal. Is
this too much?
|
|
I will try to create a dummy estimator for this case. |
….com/mc4229/scikit-learn into sklearn/add_dataframe_estimator_test
|
@jnothman @jorisvandenbossche Sorry it took me a while to follow up on this PR. Could you review my changes and let me know whether they make sense? |
|
Hi @mc4229 , are you still interested in finishing this? If you could find some time to resolve conflicts, I think you deserve we push a bit more for a review. If not, I totally understand, I will put your PR in the Sprint pool again so that your work is not lost. Thanks for your patience! |
|
@cmarmo Yes I am still interested in finishing this! I will find some time to resolve the conflicts. |
|
@adrinjalali , @rth , @glemaitre, @jnothman this PR probably deserves some attention for the 2-years perseverance of @mc4229 . Thanks! |
|
Thanks everyone! |
Reference Issues/PRs
Fixes #7528
What does this implement/fix? Explain your changes.
Added estimator checks for pandas object.
Any other comments?