TST Add a test for meta-estimators with non tabular data#19755
TST Add a test for meta-estimators with non tabular data#19755thomasjpfan merged 13 commits intoscikit-learn:mainfrom
Conversation
thomasjpfan
left a comment
There was a problem hiding this comment.
Thank you for working on this @jeremiedbb !
I am +1 on enforcing check_meta_estimators_validation for our meta-estimators, but it may be too much of a requirement for third-party meta-estimators.
Unless I'm wrong I only enabled this check for our meta-estimators. Is there something I didn't catch ? |
This PR looks okay. I was concerned with adding a "public function", |
I see, thanks for the clarification. I'll move everything to test_common only. |
ogrisel
left a comment
There was a problem hiding this comment.
LGTM. I think it's best to have meta-estimators delegate input validation by default. We might still want to have exceptions to this rule on a case by case basis but they should be motivated.
|
Once this PR is merged, one should add an issue in the tracker with a TODO list of meta-estimators to update. Some meta-estimators might require some code change, for instance by using |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
sklearn/tests/test_metaestimators.py
Outdated
|
|
||
| if "param_grid" in sig or "param_distributions" in sig: | ||
| # SearchCV estimators | ||
| yield Estimator(estimator, param_grid) |
There was a problem hiding this comment.
This raises a UserWarning:
UserWarning: The total space of parameters 2 is smaller than n_iter=10.
Running 2 iterations. For exhaustive searches, use GridSearchCV.
for RandomSearchCV. I think this is okay to ignore with a ignore_warnings(category=UserWarning) decorator on test_meta_estimators_delegate_data_validation.
There was a problem hiding this comment.
Done, but I set n_iter=2 for RandomizedSearchCV instead of ignoring the warnings for the whole test.
|
Thinking about this more, I think there are some benefit with having the metaestimator validate:
|
…eremiedbb/scikit-learn into common-test-pipeline-in-meta-estimator
I agree that there are benefits to validate in the meta-estimator. Based on our previous discussions, I think the main goal of not doing any validation is to make a step towards being able to pass estimators that can process cupy arrays, dask arrays, ... However, as you pointed out, we could check the inner estimators tags to decide if we can safely do some partial (or complete) validation. |
Yea, I think the benefit of supporting more array types outweighs the advantages I mentioned out in #19755 (comment) |
thomasjpfan
left a comment
There was a problem hiding this comment.
Minor comments, otherwise LGTM
sklearn/tests/test_metaestimators.py
Outdated
| "base_estimator" or "estimators". | ||
| """ | ||
| for _, Estimator in sorted(all_estimators()): | ||
| sig = list(signature(Estimator).parameters) |
There was a problem hiding this comment.
Small nit:
| sig = list(signature(Estimator).parameters) | |
| sig = set(signature(Estimator).parameters) |
sklearn/tests/test_metaestimators.py
Outdated
| sig = list(signature(Estimator).parameters) | ||
|
|
||
| if "estimator" in sig or "base_estimator" in sig: | ||
| if issubclass(Estimator, RegressorMixin): |
There was a problem hiding this comment.
Was there an issue with using is_regressor here?
| if issubclass(Estimator, RegressorMixin): | |
| if is_regressor(Estimator): |
sklearn/tests/test_metaestimators.py
Outdated
|
|
||
| elif "estimators" in sig: | ||
| # stacking, voting | ||
| if issubclass(Estimator, RegressorMixin): |
There was a problem hiding this comment.
Same here:
| if issubclass(Estimator, RegressorMixin): | |
| if is_regressor(Estimator): |
…n#19755) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Meta-estimators should delegate data validation to their inner estimator(s), which is currently not the case (only SearchCV estimators don't).
This PR proposes to introduce a new common test for meta-estimators only, to check that they work with non tabular data as long as the inner estimator does (a pipeline with text preprocessing followed by ridge or logreg for instance).
This is also related to
n_features_in_since meta estimators should delegate setting n_features_in_ to their inner estimator, when applicable (see #19333).There's a long blacklist for now from which meta-estimators should be removed as they are fixed
CC @ogrisel