FIX Check and correct the input_tags.sparse flag#30187
FIX Check and correct the input_tags.sparse flag#30187glemaitre merged 35 commits intoscikit-learn:mainfrom
Conversation
ogrisel
left a comment
There was a problem hiding this comment.
Thanks for the PR. Any reason to decide to introduce a new check_estimator_sparse_tagestimator check instead of making the existing check_estimator_sparse_array tag-aware?
Since estimator checks are already costly to run, I would rather avoid proliferation and redundancy when possible.
I tried to make Actually the new test is done with the assumption that the Perhaps a cleaner alternative as suggested by @glemaitre would be that
|
I am a bit afraid about the breaking change induced by entirely changing the meaning of the +1 for updating your PR to implement what you suggest above and see what it practically entails. |
glemaitre
left a comment
There was a problem hiding this comment.
This a first round of review just looking at the test. I'll check now the changes in the different estimator to know if I'm getting surprise by what I'll see.
doc/whats_new/upcoming_changes/sklearn.utils/30187.enhancement.rst
Outdated
Show resolved
Hide resolved
glemaitre
left a comment
There was a problem hiding this comment.
OK I think that I made a pass on all estimators. I think that sometimes, we can move the definition in the base class.
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
|
@antoinebaker do not hesitate to ping me when you want another review. |
|
I followed @glemaitre suggestion for setting the We now define As a lesser of two evils the new definition will have false negatives (Pipelines accepting sparse data wrongly tagged |
glemaitre
left a comment
There was a problem hiding this comment.
There a small conflict to solve.
Otherwise, it looks good to me. I'm fine with the current policy for the Pipeline here.
|
|
||
| def __sklearn_tags__(self): | ||
| tags = super().__sklearn_tags__() | ||
| tags.input_tags.sparse = False |
There was a problem hiding this comment.
I assume that it's because you set it to True on the base class. I wonder if we should try to always set this only to non default values. It would mean that we'd have to leave it to the default value on the base class and set it to True on most of its child classes here. What do you think @glemaitre ?
| # WARNING: the sparse tag can be incorrect. | ||
| # Some Pipelines accepting sparse data are wrongly tagged sparse=False. | ||
| # For example Pipeline([PCA(), estimator]) accepts sparse data | ||
| # even if the estimator doesn't as PCA outputs a dense array. | ||
| tags.input_tags.sparse = all( | ||
| get_tags(step).input_tags.sparse | ||
| for name, step in self.steps | ||
| if step != "passthrough" | ||
| ) |
There was a problem hiding this comment.
This makes me wonder if we aren't expecting too much from the sparse tag. Maybe we should narrow its scope to the estimator level, regardless of whether or not its sub-estimator(s), if any, do support sparse input. In that case sparse means "this estimator can handle sparse input but its inner estimator(s) (if it's a meta-estimator) might not in which case it's undefined behavior". What do you think @glemaitre and @adrinjalali ?
There was a problem hiding this comment.
As discussed irl with Guillaume, an other option is to change sparse to accept True, False and Maybe (or Undefined, name tbd).
| def __sklearn_tags__(self): | ||
| tags = super().__sklearn_tags__() | ||
| try: | ||
| tags.input_tags.sparse = all( | ||
| get_tags(trans).input_tags.sparse | ||
| for name, trans in self.transformer_list | ||
| if trans not in {"passthrough", "drop"} | ||
| ) | ||
| except Exception: | ||
| # If `transformer_list` does not comply with our API (list of tuples) | ||
| # then it will fail. In this case, we assume that `sparse` is False | ||
| # but the parameter validation will raise an error during `fit`. | ||
| pass # pragma: no cover | ||
| return tags |
There was a problem hiding this comment.
my previous comment is also motivated by this pattern that shows-up a few times. We're trying to inspect the full compute graph ahead of time which is something scikit-learn is really not good at by design.
|
Also, I think we should include it in 1.6.1 because it fixes tags that we just made public |
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
jeremiedbb
left a comment
There was a problem hiding this comment.
LGTM.
I think there are still things to improve with the sparse tag but it can be done in a separate PR. I'm in favor of merging this one as is to not delay the 1.6.1 release further.
Yep it is indeed better than what we had. We can later improve and discussed having a ternary option instead of the binary one. |
|
Thanks @antoinebaker. It was a big one ;) |
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
ItsIronOxide
left a comment
There was a problem hiding this comment.
Any help or guidance on this issue is very much appreciated.
| def __sklearn_tags__(self): | ||
| tags = super().__sklearn_tags__() | ||
| tags.input_tags.sparse = True | ||
| tags.input_tags.sparse = not self.positive |
There was a problem hiding this comment.
My team changed to scikit-learn v1.6.1 this week. We had v1.5.1 before. Our code crashes in this exact line with the error "Unexpected <class 'AttributeError'>. 'LinearRegression' object has no attribute 'positive'".
We cannot deploy in production because of this. I am desperate enough to come here to ask for help. I do not understand why it would complain that the attribute does not exist given that we were using v1.5.1 before and the attribute has existed for 4 years now.
There was a problem hiding this comment.
@ItsIronOxide feel free to open a new issue with a reproducer, and ping me. Happy to have a look and help out.
Reference Issues/PRs
Fixes #30139
What does this implement/fix? Explain your changes.
The test
check_estimator_sparse_tagis added.It checks that the
input_tags.sparsetag of an estimator is consistent:input_tags.sparse=Truethe estimator can be fitted on sparse datainput_tags.sparse=Falsethe estimator must raise a error when fitted on sparse dataThe
input_tags.sparseflag was edited for a lot of estimators.