FIX Check and correct the input_tags.sparse flag by antoinebaker · Pull Request #30187 · scikit-learn/scikit-learn

antoinebaker · 2024-10-31T16:58:20Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The test check_estimator_sparse_tag is added.
It checks that the input_tags.sparse tag of an estimator is consistent:

if input_tags.sparse=True the estimator can be fitted on sparse data
if input_tags.sparse=False the estimator must raise a error when fitted on sparse data

The input_tags.sparse flag was edited for a lot of estimators.

github-actions · 2024-10-31T16:59:41Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: b54a408. Link to the linter CI: here}

ogrisel

Thanks for the PR. Any reason to decide to introduce a new check_estimator_sparse_tagestimator check instead of making the existing check_estimator_sparse_array tag-aware?

Since estimator checks are already costly to run, I would rather avoid proliferation and redundancy when possible.

doc/whats_new/upcoming_changes/sklearn.utils/30187.api.rst

antoinebaker · 2024-11-04T15:46:28Z

Thanks for the PR. Any reason to decide to introduce a new check_estimator_sparse_tagestimator check instead of making the existing check_estimator_sparse_array tag-aware?

Since estimator checks are already costly to run, I would rather avoid proliferation and redundancy when possible.

I tried to make check_estimator_sparse_array tag-aware, but didn't manage to combine the old behavior (testing on various sparse formats, and if the estimator doesn't support the sparse format checking that it fails gracefully with the appropriate error) and what the new test is doing (checking that sparse tag is equivalent to raising or not the error).

Actually the new test is done with the assumption that the sparse flag is equivalent to accept csr input. It seems that the csr format is kind of a default (if the estimator can accept sparse input it always include csr, but it may reject other sparse formats), I didn't encounter a counterexample in the estimators.

Perhaps a cleaner alternative as suggested by @glemaitre would be that

the sparse tag is the list of supported sparse formats instead of a boolean
make validate_data rely on the sparse estimator tag instead of the accept_sparse argument.

ogrisel · 2024-11-05T15:13:55Z

Perhaps a cleaner alternative as suggested by @glemaitre would be that
the sparse tag is the list of supported sparse formats instead of a boolean
make validate_data rely on the sparse estimator tag instead of the accept_sparse argument.

I am a bit afraid about the breaking change induced by entirely changing the meaning of the .input_tags.sparse tag but if that many estimators needed a fix in the first place, it probably means that nobody was actually relying on that tag anywhere within our outside the scikit-learn code base anyway so we are free to make the changes we want.

+1 for updating your PR to implement what you suggest above and see what it practically entails.

glemaitre

This a first round of review just looking at the test. I'll check now the changes in the different estimator to know if I'm getting surprise by what I'll see.

doc/whats_new/upcoming_changes/sklearn.utils/30187.enhancement.rst

doc/whats_new/upcoming_changes/changed-models/30187.fix.rst

sklearn/utils/estimator_checks.py

sklearn/utils/tests/test_estimator_checks.py

glemaitre

OK I think that I made a pass on all estimators. I think that sometimes, we can move the definition in the base class.

sklearn/cluster/_affinity_propagation.py

sklearn/decomposition/_incremental_pca.py

sklearn/decomposition/_pca.py

sklearn/ensemble/_gb.py

sklearn/ensemble/_weight_boosting.py

sklearn/svm/_classes.py

sklearn/tree/_classes.py

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>

glemaitre · 2024-11-18T08:45:32Z

@antoinebaker do not hesitate to ping me when you want another review.

antoinebaker · 2024-11-26T17:19:47Z

I followed @glemaitre suggestion for setting the sparse tag in Pipeline.

We now definesparse = all steps accept sparse instead of the previous one sparse = first non passthrough step accepts sparse. Both are incorrect, but determining the correct flag seems a nightmare in general as some transformers may change the sparse nature of the data.

As a lesser of two evils the new definition will have false negatives (Pipelines accepting sparse data wrongly tagged sparse=False) while the old one had false positives (Pipelines rejecting sparse data wrongly tagged sparse=True).

sklearn/utils/estimator_checks.py

glemaitre

There a small conflict to solve.

Otherwise, it looks good to me. I'm fine with the current policy for the Pipeline here.

sklearn/ensemble/_forest.py

jeremiedbb · 2024-12-11T17:41:26Z

sklearn/linear_model/_coordinate_descent.py


    def __sklearn_tags__(self):
        tags = super().__sklearn_tags__()
+        tags.input_tags.sparse = False


I assume that it's because you set it to True on the base class. I wonder if we should try to always set this only to non default values. It would mean that we'd have to leave it to the default value on the base class and set it to True on most of its child classes here. What do you think @glemaitre ?

sklearn/linear_model/_ransac.py

sklearn/linear_model/_ridge.py

jeremiedbb · 2024-12-12T12:45:52Z

sklearn/pipeline.py

+            # WARNING: the sparse tag can be incorrect.
+            # Some Pipelines accepting sparse data are wrongly tagged sparse=False.
+            # For example Pipeline([PCA(), estimator]) accepts sparse data
+            # even if the estimator doesn't as PCA outputs a dense array.
+            tags.input_tags.sparse = all(
+                get_tags(step).input_tags.sparse
+                for name, step in self.steps
+                if step != "passthrough"
+            )


This makes me wonder if we aren't expecting too much from the sparse tag. Maybe we should narrow its scope to the estimator level, regardless of whether or not its sub-estimator(s), if any, do support sparse input. In that case sparse means "this estimator can handle sparse input but its inner estimator(s) (if it's a meta-estimator) might not in which case it's undefined behavior". What do you think @glemaitre and @adrinjalali ?

As discussed irl with Guillaume, an other option is to change sparse to accept True, False and Maybe (or Undefined, name tbd).

jeremiedbb · 2024-12-12T12:49:28Z

sklearn/pipeline.py

+    def __sklearn_tags__(self):
+        tags = super().__sklearn_tags__()
+        try:
+            tags.input_tags.sparse = all(
+                get_tags(trans).input_tags.sparse
+                for name, trans in self.transformer_list
+                if trans not in {"passthrough", "drop"}
+            )
+        except Exception:
+            # If `transformer_list` does not comply with our API (list of tuples)
+            # then it will fail. In this case, we assume that `sparse` is False
+            # but the parameter validation will raise an error during `fit`.
+            pass  # pragma: no cover
+        return tags


my previous comment is also motivated by this pattern that shows-up a few times. We're trying to inspect the full compute graph ahead of time which is something scikit-learn is really not good at by design.

sklearn/semi_supervised/_self_training.py

jeremiedbb · 2024-12-12T13:19:31Z

Also, I think we should include it in 1.6.1 because it fixes tags that we just made public

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>

jeremiedbb

LGTM.

I think there are still things to improve with the sparse tag but it can be done in a separate PR. I'm in favor of merging this one as is to not delay the 1.6.1 release further.

glemaitre · 2025-01-02T12:06:14Z

I think there are still things to improve with the sparse tag but it can be done in a separate PR. I'm in favor of merging this one as is to not delay the 1.6.1 release further.

Yep it is indeed better than what we had. We can later improve and discussed having a ternary option instead of the binary one.

glemaitre · 2025-01-02T12:06:31Z

Thanks @antoinebaker. It was a big one ;)

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>

ItsIronOxide

Any help or guidance on this issue is very much appreciated.

ItsIronOxide · 2025-01-31T14:44:26Z

sklearn/linear_model/_base.py

    def __sklearn_tags__(self):
        tags = super().__sklearn_tags__()
-        tags.input_tags.sparse = True
+        tags.input_tags.sparse = not self.positive


My team changed to scikit-learn v1.6.1 this week. We had v1.5.1 before. Our code crashes in this exact line with the error "Unexpected <class 'AttributeError'>. 'LinearRegression' object has no attribute 'positive'".

We cannot deploy in production because of this. I am desperate enough to come here to ask for help. I do not understand why it would complain that the attribute does not exist given that we were using v1.5.1 before and the attribute has existed for 4 years now.

@ItsIronOxide feel free to open a new issue with a reproducer, and ping me. Happy to have a look and help out.

add input_tags.sparse and test

a896f2e

antoinebaker changed the title ~~add input_tags.sparse and test~~ FIX Check and correct the input_tags.sparse flag Oct 31, 2024

antoinebaker added 2 commits November 4, 2024 10:19

fix LinearRegression tag

b0e605d

changelog

6c72527

antoinebaker marked this pull request as ready for review November 4, 2024 12:57

Merge branch 'main' into fix_input_sparse_tag

22b5a6b

ogrisel reviewed Nov 4, 2024

View reviewed changes

doc/whats_new/upcoming_changes/sklearn.utils/30187.api.rst Show resolved Hide resolved

ogrisel reviewed Nov 4, 2024

View reviewed changes

doc/whats_new/upcoming_changes/sklearn.utils/30187.api.rst Show resolved Hide resolved

changelog

c306593

antoinebaker added 3 commits November 14, 2024 16:49

fix column transformer tag

6aadc95

change error message

7979fa9

changelog

6d7c2b1

glemaitre self-requested a review November 15, 2024 14:03

Merge remote-tracking branch 'upstream/main' into fix_input_sparse_tag

f73913f

glemaitre reviewed Nov 15, 2024

View reviewed changes

antoinebaker and others added 7 commits November 15, 2024 18:19

fix passthrough sparse tag

7005797

fix SelfTrainingClassifier

6962aa9

Apply suggestions from code review

2668fea

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>

add suggestions

705c414

fix multitask

5178539

black formatting

e5a4458

Merge branch 'main' into fix_input_sparse_tag

64ddb62

antoinebaker mentioned this pull request Nov 18, 2024

Check sample weight equivalence on sparse data #30137

Merged

1 task

antoinebaker added 2 commits November 18, 2024 11:56

add meta test

d6f277f

add feature union

0bcc765

change pipeline tag

a9fb7d7

ogrisel mentioned this pull request Nov 29, 2024

API drop Tags.regressor_tags.multi_label #30373

Merged

antoinebaker added 3 commits November 29, 2024 16:16

tag RobustScaler

e85f94a

tag RANSAC

8f2f3db

multi_output in LinearModelCV

3ac38e1

antoinebaker commented Dec 2, 2024

View reviewed changes

sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved

glemaitre approved these changes Dec 8, 2024

View reviewed changes

antoinebaker and others added 5 commits December 9, 2024 10:32

Merge remote-tracking branch 'upstream/main' into fix_input_sparse_tag

42d815e

raise from exception

425b473

no cover

17ccb72

test raise inappropriate error

8429e8f

Merge branch 'main' into fix_input_sparse_tag

87e5211

jeremiedbb reviewed Dec 12, 2024

View reviewed changes

jeremiedbb added this to the 1.6.1 milestone Dec 12, 2024

antoinebaker and others added 3 commits December 12, 2024 17:46

Apply suggestions from code review

97d700d

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>

suggestions from code review

66c3bd9

Merge remote-tracking branch 'upstream/main' into pr/antoinebaker/30187

b54a408

jeremiedbb approved these changes Jan 2, 2025

View reviewed changes

glemaitre merged commit 446adff into scikit-learn:main Jan 2, 2025

stanmart mentioned this pull request Jan 3, 2025

Daily run failure: Unit tests Quantco/glum#892

Closed

antoinebaker mentioned this pull request Jan 8, 2025

Filtering on the sparse tag to yield checks #30608

Merged

jeremiedbb added a commit that referenced this pull request Jan 9, 2025

FIX Check and correct the input_tags.sparse flag (#30187)

761e10c

Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>

aazuspan mentioned this pull request Jan 16, 2025

Fix failing tests caused by inherited sparse tag in scikit-learn==1.6.1 lemma-osu/sknnr#83

Closed

ItsIronOxide reviewed Jan 31, 2025

View reviewed changes

ItsIronOxide mentioned this pull request Jan 31, 2025

Unexpected <class 'AttributeError'>. 'LinearRegression' object has no attribute 'positive #30744

Closed

Uh oh!

Conversation

antoinebaker commented Oct 31, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

github-actions bot commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

antoinebaker commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Nov 5, 2024

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Nov 18, 2024

Uh oh!

antoinebaker commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeremiedbb Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeremiedbb Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeremiedbb commented Dec 12, 2024

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jan 2, 2025

Uh oh!

glemaitre commented Jan 2, 2025

Uh oh!

ItsIronOxide left a comment

Choose a reason for hiding this comment

Uh oh!

ItsIronOxide Jan 31, 2025

github-actions bot commented Oct 31, 2024 •

edited

Loading

antoinebaker commented Nov 4, 2024 •

edited

Loading

antoinebaker commented Nov 26, 2024 •

edited

Loading