TST Add a test for meta-estimators with non tabular data by jeremiedbb · Pull Request #19755 · scikit-learn/scikit-learn

jeremiedbb · 2021-03-23T11:57:29Z

Meta-estimators should delegate data validation to their inner estimator(s), which is currently not the case (only SearchCV estimators don't).

This PR proposes to introduce a new common test for meta-estimators only, to check that they work with non tabular data as long as the inner estimator does (a pipeline with text preprocessing followed by ridge or logreg for instance).

This is also related to n_features_in_ since meta estimators should delegate setting n_features_in_ to their inner estimator, when applicable (see #19333).

There's a long blacklist for now from which meta-estimators should be removed as they are fixed

CC @ogrisel

thomasjpfan

Thank you for working on this @jeremiedbb !

I am +1 on enforcing check_meta_estimators_validation for our meta-estimators, but it may be too much of a requirement for third-party meta-estimators.

sklearn/utils/estimator_checks.py

jeremiedbb · 2021-03-23T12:44:49Z

but it may be too much of a requirement for third-party meta-estimators.

Unless I'm wrong I only enabled this check for our meta-estimators. Is there something I didn't catch ?

thomasjpfan · 2021-03-23T13:31:38Z

Unless I'm wrong I only enabled this check for our meta-estimators. Is there something I didn't catch ?

This PR looks okay. I was concerned with adding a "public function", check_meta_estimators_validation, to estimator_checks.py, but it would only be used to validate our own meta-estimators. I think most of the checks in estimator_checks.py were meant to be place in check_estimator at some point.

jeremiedbb · 2021-03-23T13:33:44Z

I think most of the checks in estimator_checks.py were meant to be place in check_estimator at some point.

I see, thanks for the clarification. I'll move everything to test_common only.
EDIT: Actually maybe it's more appropriate to put this test in test_metaestimators.

ogrisel

LGTM. I think it's best to have meta-estimators delegate input validation by default. We might still want to have exceptions to this rule on a case by case basis but they should be motivated.

sklearn/tests/test_metaestimators.py

ogrisel · 2021-03-24T15:35:18Z

Once this PR is merged, one should add an issue in the tracker with a TODO list of meta-estimators to update.

Some meta-estimators might require some code change, for instance by using safe_indexing on X and y instead of relying on sample-wise fancy indexing.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

thomasjpfan · 2021-04-04T12:32:06Z

sklearn/tests/test_metaestimators.py

+
+            if "param_grid" in sig or "param_distributions" in sig:
+                # SearchCV estimators
+                yield Estimator(estimator, param_grid)


This raises a UserWarning:

UserWarning: The total space of parameters 2 is smaller than n_iter=10. Running 2 iterations. For exhaustive searches, use GridSearchCV.

for RandomSearchCV. I think this is okay to ignore with a ignore_warnings(category=UserWarning) decorator on test_meta_estimators_delegate_data_validation.

Done, but I set n_iter=2 for RandomizedSearchCV instead of ignoring the warnings for the whole test.

thomasjpfan · 2021-04-04T15:11:13Z

Thinking about this more, I think there are some benefit with having the metaestimator validate:

If the metaestimator validates for finite values, then it can configure config_context(assume_finite=True) for the inner estimators. Deciding when to validate for finite values would mean checking the 'allow_nan' tag in the inner estimators.
If the metaestimator validates and converts the input into a numpy array, that can take advantage of the memmapping in joblib.

…eremiedbb/scikit-learn into common-test-pipeline-in-meta-estimator

jeremiedbb · 2021-04-07T14:55:53Z

Thinking about this more, I think there are some benefit with having the metaestimator validate

I agree that there are benefits to validate in the meta-estimator. Based on our previous discussions, I think the main goal of not doing any validation is to make a step towards being able to pass estimators that can process cupy arrays, dask arrays, ...

However, as you pointed out, we could check the inner estimators tags to decide if we can safely do some partial (or complete) validation.

thomasjpfan · 2021-04-07T18:34:19Z

Based on our previous discussions, I think the main goal of not doing any validation is to make a step towards being able to pass estimators that can process cupy arrays, dask arrays,

Yea, I think the benefit of supporting more array types outweighs the advantages I mentioned out in #19755 (comment)

thomasjpfan

Minor comments, otherwise LGTM

thomasjpfan · 2021-04-07T18:37:12Z

sklearn/tests/test_metaestimators.py

+    "base_estimator" or "estimators".
+    """
+    for _, Estimator in sorted(all_estimators()):
+        sig = list(signature(Estimator).parameters)


Small nit:

Suggested change

sig = list(signature(Estimator).parameters)

sig = set(signature(Estimator).parameters)

thomasjpfan · 2021-04-07T18:38:17Z

sklearn/tests/test_metaestimators.py

+        sig = list(signature(Estimator).parameters)
+
+        if "estimator" in sig or "base_estimator" in sig:
+            if issubclass(Estimator, RegressorMixin):


Was there an issue with using is_regressor here?

Suggested change

if issubclass(Estimator, RegressorMixin):

if is_regressor(Estimator):

thomasjpfan · 2021-04-07T18:39:23Z

sklearn/tests/test_metaestimators.py

+
+        elif "estimators" in sig:
+            # stacking, voting
+            if issubclass(Estimator, RegressorMixin):


Same here:

Suggested change

if issubclass(Estimator, RegressorMixin):

if is_regressor(Estimator):

…n#19755) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

common test for meta-estimators with non tabular data

692c25f

github-actions bot added the module:utils label Mar 23, 2021

jeremiedbb added the No Changelog Needed label Mar 23, 2021

thomasjpfan reviewed Mar 23, 2021

View reviewed changes

sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved

use synthetic data

8919f47

fix for pytest-xdist

53ff275

move everything to test_metaestimators

0475487

jeremiedbb changed the title ~~TST Add a common test for meta-estimators with non tabular data~~ TST Add a test for meta-estimators with non tabular data Mar 23, 2021

jeremiedbb added 3 commits March 23, 2021 14:54

cln

7d5f0d6

attempt fix pytest-xdist

a9cf1f7

new attempt

701fafb

ogrisel approved these changes Mar 24, 2021

View reviewed changes

sklearn/tests/test_metaestimators.py Show resolved Hide resolved

jeremiedbb and others added 3 commits March 29, 2021 20:11

Update sklearn/tests/test_metaestimators.py

23c6dd6

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge branch 'main' into common-test-pipeline-in-meta-estimator

5c83ef5

CalibratedClassifierCV now works

8de2691

thomasjpfan reviewed Apr 4, 2021

View reviewed changes

thomasjpfan mentioned this pull request Apr 4, 2021

ENH Adds n_features_in_ checking to multioutput #19692

Merged

jeremiedbb added 2 commits April 7, 2021 16:42

avoid warning

8e4b35f

Merge branch 'common-test-pipeline-in-meta-estimator' of github.com:j…

4bd1e31

…eremiedbb/scikit-learn into common-test-pipeline-in-meta-estimator

thomasjpfan approved these changes Apr 7, 2021

View reviewed changes

apply suggestions

2b935a9

thomasjpfan merged commit 1ce1715 into scikit-learn:main Apr 8, 2021

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Apr 19, 2021

TST Add a test for meta-estimators with non tabular data (scikit-lear…

aeef397

…n#19755) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

	sig = list(signature(Estimator).parameters)
	sig = set(signature(Estimator).parameters)

	if issubclass(Estimator, RegressorMixin):
	if is_regressor(Estimator):

Uh oh!

Conversation

jeremiedbb commented Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeremiedbb commented Mar 23, 2021

Uh oh!

thomasjpfan commented Mar 23, 2021

Uh oh!

jeremiedbb commented Mar 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Mar 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan Apr 4, 2021

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Apr 4, 2021

Uh oh!

jeremiedbb commented Apr 7, 2021

Uh oh!

thomasjpfan commented Apr 7, 2021

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Apr 7, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeremiedbb commented Mar 23, 2021 •

edited

Loading

jeremiedbb commented Mar 23, 2021 •

edited

Loading

ogrisel commented Mar 24, 2021 •

edited

Loading