ENH Support pipelines in CalibratedClassifierCV#17546
ENH Support pipelines in CalibratedClassifierCV#17546glemaitre merged 27 commits intoscikit-learn:masterfrom
Conversation
glemaitre
left a comment
There was a problem hiding this comment.
A couple of thoughts to investigate.
sklearn/calibration.py
Outdated
| force_all_finite=False, allow_nd=True) | ||
| X, y = indexable(X, y) | ||
| le = LabelBinarizer().fit(y) | ||
| self.classes_ = le.classes_ |
There was a problem hiding this comment.
I am a bit curious about these lines here. In case, that the model has been prefitted, it would be best to use prefitted_model.classes_ instead? Meaning that we have this part only when the model is not prefitted.
|
Hi @glemaitre and @lucyleeow, there is a PR related to #8710 waiting for review for a year and still alive. It isn't clear to me how those two PRs interact, but may I suggest to have a look at #13060, before moving on with this one? Thanks! |
|
They will be dissociated (even if we are going to have some git conflicts). This one is about internal |
|
So most probably the assert on |
|
@glemaitre I'm am not sure about how I addressed this:
Not sure if this is a good approach so happy to change. |
|
ping @glemaitre (the red is codecov) |
glemaitre
left a comment
There was a problem hiding this comment.
We will need an entry in what's new as well
|
@glemaitre I amended it such that the attributes I think we could/should add a test that checks for when |
|
I realise why the line: was red. I think it should but I have fixed it here but let me know if you want me to separate into it's own PR |
|
I've also added a test for cv iterator and test for default |
thomasjpfan
left a comment
There was a problem hiding this comment.
Thank you for the PR @lucyleeow !
sklearn/calibration.py
Outdated
| if isinstance(self.base_estimator, Pipeline): | ||
| estimator = self.base_estimator[-1] | ||
| else: | ||
| estimator = self.base_estimator | ||
| check_is_fitted(estimator) | ||
| self.n_features_in_ = estimator.n_features_in_ | ||
| self.classes_ = estimator.classes_ |
There was a problem hiding this comment.
I can't find the discussion about this.
It seems weird that CalibratedClassiferCV takes the n_features_in_ from the final step. If the first step of the Pipeline was a feature selector, then CalibratedClassiferCV would not have correct n_feature_in_?
Also, for third party estimators in a pipeline, if they do not have n_feature_in_ or classes_ this would fail. I would prefer being slightly more lenient:
with suppress(AttributeError):
self.n_features_in_ = base_estimator.n_features_in_
with suppress(AttributeError):
self.classes_ = base_estimator.classes_There was a problem hiding this comment.
You are right for the n_features_in_ that it should be the value of the first step and not the last one.
Regarding classes_, I would be a bit more conservative. Our estimator_check explicitly checks for classes_ so this is a kind of contract that if you do a scikit-learn classifier, you should have this attribute.
And for n_features_in_, we can be lenient in this case because we still don't impose anything yet.
There was a problem hiding this comment.
Yes, you're right. I got mixed up.
There was a problem hiding this comment.
Regarding classes_, I would be a bit more conservative. Our estimator_check explicitly checks for classes_ so this is a kind of contract that if you do a scikit-learn classifier, you should have this attribute.
As in don't suppress warning when assigning this attribute?
There was a problem hiding this comment.
Also out of interest, which of:
if hasattr(calib_clf, "n_features_in_"):
self.n_features_in_ = base_estimator.n_features_in_
and
with suppress(AttributeError):
self.n_features_in_ = base_estimator.n_features_in_
would be preferred here?
There was a problem hiding this comment.
I used the first so I don't need to import suppress but happy to change
There was a problem hiding this comment.
I am actually not sure what is more explicit. I am not used to suppress but I find it elegant and it should be more pythonic. Go ahead with it.
There was a problem hiding this comment.
WDYT @thomasjpfan ?
I agree with the being more strict with classes_ and be lenient with n_features_in_.
|
thanks @thomasjpfan and @glemaitre, I think I've made the suggested changes |
thomasjpfan
left a comment
There was a problem hiding this comment.
LGTM Thank you @lucyleeow !
|
Thanks @lucyleeow |
|
Thank you! |
|
@lucyleeow I've tried using this (pipeline base estimator, cv='prefit') and now the .fit() call is working, but I ran into a similar issue when I tried to predict, since there is also a data validation step in predict_proba. So I can fit an estimator but I don't see how to use it. What am I missing here? |
|
@odedbd Please open a thread with your question on GitHub discussion: https://github.com/scikit-learn/scikit-learn/discussions Commenting on a merged PR will not attract traffic to get an answer and your issue will be useful for the entire community since this is a usage question. If it leads to a missing feature, we can create an associated issue+PR. |
Reference Issues/PRs
Towards #8710 (this is the first/main issue but other issues are mentioned in the comments)
What does this implement/fix? Explain your changes.
Don't
_validate_dataif prefit estimator used inCalibratedClassifierCVIncludes test
Any other comments?