MAINT make sure to test encoders in common tests#26859
MAINT make sure to test encoders in common tests#26859thomasjpfan merged 4 commits intoscikit-learn:mainfrom
Conversation
sklearn/utils/estimator_checks.py
Outdated
| estimator.transform(X) | ||
|
|
||
| with raises(Exception, match="Unknown label type", may_pass=True): | ||
| with raises(Exception, match="(?i)unknown", may_pass=True): |
sklearn/utils/estimator_checks.py
Outdated
| if "string" not in tags["X_types"]: | ||
| X[0, 0] = {"foo": "bar"} | ||
| msg = "argument must be a string.* number" | ||
| msg = "string.* number" |
There was a problem hiding this comment.
here as well, we're basically loosening up the tests here, I rather fix the estimator.
There was a problem hiding this comment.
I revert back with a slight change in the match. I added a comment to inform from where the error is raised.
sklearn/utils/estimator_checks.py
Outdated
| # TargetEncoder accepts binary and continuous values. We therefore force | ||
| # y to be binary also. |
There was a problem hiding this comment.
can that be through an estimator tag rather than hard coding here? making encoders pass common tests shouldn't be fixing common tests to accept encoders lol
There was a problem hiding this comment.
Yep, but this is a really weird thing: a transformer that uses y and can be classification and regression but only binary classification. This is just a beast.
However, I special case with another comment because we should be dropping this issue when merging support for multiclass: #26674
There was a problem hiding this comment.
#26674 should modify the code such that the test is passing. I assume I can do the trick of @thomasjpfan using binary_only and you will remove it in your PR.
adrinjalali
left a comment
There was a problem hiding this comment.
My issue with this PR as it stands is that it's another example of how we're not third party developer friendly. Our tags are not sufficient for our own estimator, and I think we should be fixing the tags instead of special casing. It's not wild to imagine an estimator which only handles regression and binary-only classification.
sklearn/utils/estimator_checks.py
Outdated
| if estimator.__class__.__name__ == "TargetEncoder": | ||
| # TargetEncoder is a special case where a transformer uses `y` but only accept | ||
| # binary classification and regression targets. | ||
| # TODO: remove this special case when multiclass support is added to | ||
| # TargetEncoder. |
There was a problem hiding this comment.
so the issue is our tags are not sufficient. We should have one tag which would be output, and can support a list of values, and in this case it would be regression and binary.
There was a problem hiding this comment.
The short term solution is to give TargetEncoder the "binary_only" tag which is true for classification and good enough to run the common test.
sklearn/utils/estimator_checks.py
Outdated
| if estimator.__class__.__name__ == "TargetEncoder": | ||
| # TargetEncoder is a special case where a transformer uses `y` but only accept | ||
| # binary classification and regression targets. | ||
| # TODO: remove this special case when multiclass support is added to | ||
| # TargetEncoder. |
There was a problem hiding this comment.
The short term solution is to give TargetEncoder the "binary_only" tag which is true for classification and good enough to run the common test.
| def _more_tags(self): | ||
| return {"requires_y": True} |
There was a problem hiding this comment.
Should this tag be removed? TargetEncoder does require_y in fit and fit_transform.
|
Should we also make sure they're actually tested? 😁 I can't find where the change is to make the tests go from not-tested or xfail to tested-now. |
if |
Just found out that our encoders are not tested via the common test.
I added the necessary tags and solve a couple of issues.