ENH Support pipelines in CalibratedClassifierCV by lucyleeow · Pull Request #17546 · scikit-learn/scikit-learn

lucyleeow · 2020-06-09T14:49:17Z

Reference Issues/PRs

Towards #8710 (this is the first/main issue but other issues are mentioned in the comments)

What does this implement/fix? Explain your changes.

Don't _validate_data if prefit estimator used in CalibratedClassifierCV
Includes test

Any other comments?

glemaitre

A couple of thoughts to investigate.

sklearn/tests/test_calibration.py

glemaitre · 2020-06-15T11:25:13Z

sklearn/calibration.py

-                                   force_all_finite=False, allow_nd=True)
        X, y = indexable(X, y)
        le = LabelBinarizer().fit(y)
        self.classes_ = le.classes_


I am a bit curious about these lines here. In case, that the model has been prefitted, it would be best to use prefitted_model.classes_ instead? Meaning that we have this part only when the model is not prefitted.

sklearn/calibration.py

cmarmo · 2020-06-15T11:46:16Z

Hi @glemaitre and @lucyleeow, there is a PR related to #8710 waiting for review for a year and still alive. It isn't clear to me how those two PRs interact, but may I suggest to have a look at #13060, before moving on with this one? Thanks!

glemaitre · 2020-06-15T11:50:07Z

They will be dissociated (even if we are going to have some git conflicts). This one is about internal sklearn interoperability while #13060 is about supporting a new type of y (i.e. multilabel). I will try to find the time to review #13060 but I am not the most knowledgeable person with multilabel.

glemaitre · 2020-06-15T20:34:40Z

So most probably the assert on n_features_in_ will not be valid with the DictVectorizer. I didn't recall the choice we made.

lucyleeow · 2020-06-16T15:51:13Z

@glemaitre I'm am not sure about how I addressed this:

For instance, I think that calib_clf.n_features_in_ will not be defined, isn't it?

for a NOT prefit estimator, we have to _validate_data before we fit and append to the list self.calibrated_classifiers_. Thus we can't use _validate_data(reset=False). I've changed back to reset=True. (We also can't use self.calibrated_classifiers_[0].n_features_in_ as self.calibrated_classifiers_ can be an empty list.)
I used your suggested code to set the attribute for prefit estimator in the code if self.cv == "prefit":.

Not sure if this is a good approach so happy to change.

sklearn/calibration.py

lucyleeow · 2020-06-18T14:15:59Z

ping @glemaitre (the red is codecov)

sklearn/calibration.py

glemaitre

We will need an entry in what's new as well

sklearn/calibration.py

sklearn/tests/test_calibration.py

lucyleeow · 2020-06-25T18:36:06Z

@glemaitre I amended it such that the attributes n_features_in_ and classes_ are derived from base_estimator[-1] when base_estimator is a pipeline.

I think we could/should add a test that checks for when cv is a CV splitter but maybe in another PR since it isn't related to this one?

lucyleeow · 2020-06-25T20:33:21Z

I realise why the line:

            elif hasattr(self.cv, "n_folds"):
                n_folds = self.cv.n_folds

was red. I think it should but n_splits. I don't think cv iterators have n_folds attribute.

I have fixed it here but let me know if you want me to separate into it's own PR

lucyleeow · 2020-06-25T20:37:28Z

I've also added a test for cv iterator and test for default base_estimator as codecov was red

thomasjpfan

Thank you for the PR @lucyleeow !

thomasjpfan · 2020-06-26T00:34:06Z

sklearn/calibration.py

+            if isinstance(self.base_estimator, Pipeline):
+                estimator = self.base_estimator[-1]
+            else:
+                estimator = self.base_estimator
+            check_is_fitted(estimator)
+            self.n_features_in_ = estimator.n_features_in_
+            self.classes_ = estimator.classes_


I can't find the discussion about this.

It seems weird that CalibratedClassiferCV takes the n_features_in_ from the final step. If the first step of the Pipeline was a feature selector, then CalibratedClassiferCV would not have correct n_feature_in_?

Also, for third party estimators in a pipeline, if they do not have n_feature_in_ or classes_ this would fail. I would prefer being slightly more lenient:

with suppress(AttributeError): self.n_features_in_ = base_estimator.n_features_in_ with suppress(AttributeError): self.classes_ = base_estimator.classes_

You are right for the n_features_in_ that it should be the value of the first step and not the last one.

Regarding classes_, I would be a bit more conservative. Our estimator_check explicitly checks for classes_ so this is a kind of contract that if you do a scikit-learn classifier, you should have this attribute.

And for n_features_in_, we can be lenient in this case because we still don't impose anything yet.

WDYT @thomasjpfan ?

Yes, you're right. I got mixed up.

Regarding classes_, I would be a bit more conservative. Our estimator_check explicitly checks for classes_ so this is a kind of contract that if you do a scikit-learn classifier, you should have this attribute.

As in don't suppress warning when assigning this attribute?

Also out of interest, which of:

if hasattr(calib_clf, "n_features_in_"): self.n_features_in_ = base_estimator.n_features_in_

and

with suppress(AttributeError): self.n_features_in_ = base_estimator.n_features_in_

would be preferred here?

I used the first so I don't need to import suppress but happy to change

I am actually not sure what is more explicit. I am not used to suppress but I find it elegant and it should be more pythonic. Go ahead with it.

WDYT @thomasjpfan ?

I agree with the being more strict with classes_ and be lenient with n_features_in_.

lucyleeow · 2020-06-26T21:07:20Z

thanks @thomasjpfan and @glemaitre, I think I've made the suggested changes

thomasjpfan

LGTM Thank you @lucyleeow !

glemaitre

LGTM

glemaitre · 2020-06-30T13:23:45Z

Thanks @lucyleeow

lucyleeow · 2020-06-30T13:42:20Z

Thank you!

odedbd · 2021-01-26T09:50:28Z

@lucyleeow I've tried using this (pipeline base estimator, cv='prefit') and now the .fit() call is working, but I ran into a similar issue when I tried to predict, since there is also a data validation step in predict_proba. So I can fit an estimator but I don't see how to use it. What am I missing here?

glemaitre · 2021-01-26T10:54:00Z

@odedbd Please open a thread with your question on GitHub discussion: https://github.com/scikit-learn/scikit-learn/discussions

Commenting on a merged PR will not attract traffic to get an answer and your issue will be useful for the entire community since this is a usage question. If it leads to a missing feature, we can create an associated issue+PR.

lucyleeow added 5 commits June 9, 2020 15:56

wip

c0ef18b

allow prefit

4f7d57b

lint

96070da

remove print

d07c734

move check

430fb73

lucyleeow changed the title ~~ENH Don't _validate_data if prefit estimator used in CalibratedClassifierCV~~ ENH Support pipelines in CalibratedClassifierCV Jun 9, 2020

glemaitre reviewed Jun 15, 2020

View reviewed changes

lucyleeow mentioned this pull request Jun 15, 2020

Should n_features_in_ in pipeline delgate to first step? #17597

Closed

connorbrinton mentioned this pull request Jun 15, 2020

[MRG] Support multi-label probability calibration #13060

Closed

lucyleeow added 6 commits June 16, 2020 13:57

suggestions

bd373b5

lint

ff9f3eb

lint

245e9ed

lint

04bfb74

formatting

bba8aff

fix attribute

e3499ea

lucyleeow commented Jun 16, 2020

View reviewed changes

sklearn/calibration.py Outdated Show resolved Hide resolved

cmarmo reviewed Jun 24, 2020

View reviewed changes

sklearn/calibration.py Outdated Show resolved Hide resolved

lucyleeow added 3 commits June 24, 2020 17:22

wip

4224b84

add test

2625dfa

lint

3271dbb

glemaitre reviewed Jun 24, 2020

View reviewed changes

sklearn/calibration.py Outdated Show resolved Hide resolved

sklearn/calibration.py Outdated Show resolved Hide resolved

sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved

sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved

lucyleeow added 2 commits June 25, 2020 12:57

fix check is fitted, add test

a0e96ef

lint

c11b06f

glemaitre reviewed Jun 25, 2020

View reviewed changes

sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved

glemaitre reviewed Jun 25, 2020

View reviewed changes

sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved

glemaitre reviewed Jun 25, 2020

View reviewed changes

sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved

glemaitre reviewed Jun 25, 2020

View reviewed changes

sklearn/tests/test_calibration.py Outdated Show resolved Hide resolved

fix test, fix pipeline att

45055d1

lucyleeow added 7 commits June 25, 2020 21:34

add whats new

452d333

merge master

b90d24c

fix n splits

14ee50c

add tests

a792d7b

revert

bafe88d

fix test

22ee974

lint

d122407

amend test

8cf3912

thomasjpfan reviewed Jun 26, 2020

View reviewed changes

lucyleeow added 2 commits June 26, 2020 11:36

fix n features in

a07ee06

use suppres

cf93284

thomasjpfan approved these changes Jun 27, 2020

View reviewed changes

glemaitre approved these changes Jun 30, 2020

View reviewed changes

glemaitre merged commit 6b29bc6 into scikit-learn:master Jun 30, 2020

lucyleeow deleted the IS/8710 branch June 30, 2020 13:42

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jul 17, 2020

ENH Support pipelines in CalibratedClassifierCV (scikit-learn#17546)

fe8d90a

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

ENH Support pipelines in CalibratedClassifierCV (scikit-learn#17546)

cf47259

scikit-learn locked as resolved and limited conversation to collaborators Jan 26, 2021

Uh oh!

Conversation

lucyleeow commented Jun 9, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cmarmo commented Jun 15, 2020

Uh oh!

glemaitre commented Jun 15, 2020

Uh oh!

glemaitre commented Jun 15, 2020

Uh oh!

lucyleeow commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lucyleeow commented Jun 18, 2020

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucyleeow commented Jun 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucyleeow commented Jun 25, 2020

Uh oh!

lucyleeow commented Jun 25, 2020

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow commented Jun 26, 2020

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jun 30, 2020

Uh oh!

lucyleeow commented Jun 16, 2020 •

edited

Loading

lucyleeow commented Jun 25, 2020 •

edited

Loading

thomasjpfan Jun 26, 2020 •

edited

Loading