ENH Add libsvm-like calibration procedure to CalibratedClassifierCV#17856
ENH Add libsvm-like calibration procedure to CalibratedClassifierCV#17856ogrisel merged 52 commits intoscikit-learn:masterfrom
Conversation
|
@NicolasHug, this is probably not going to be a problem but I thought I would mention it. We use |
NicolasHug
left a comment
There was a problem hiding this comment.
Thanks @lucyleeow , made a first pass.
I think the refactoring makes sense, WDYT?
I haven't looked in detail at the changes related to the label encoders though. Are these needed here or can they wait for another PR?
sklearn/calibration.py
Outdated
| # When binary, proba of clf_fitted.classes_[1] | ||
| # output but `pos_class_indices` = 0 |
There was a problem hiding this comment.
it seems that the grammar is incorrect?
There was a problem hiding this comment.
I amended to:
# When binary, df consists of predictions for
# clf_fitted.classes_[1] but `pos_class_indices` = 0
not sure if this is clearer
sklearn/calibration.py
Outdated
| if base_estimator_method == "decision_function": | ||
| if df.ndim == 1: | ||
| df = df[:, np.newaxis] | ||
| else: | ||
| if len(self.label_encoder_.classes_) == 2: | ||
| df = df[:, 1:] |
There was a problem hiding this comment.
The duplication is a bit unfortunate, there should be a way to merge this logic with that of _get_predictions
Maybe by extracting this into a _reshape_df utility?
Or maybe we can even let _get_predictions call cross_val_predict somehow, so that we can also merge the method checks above.
Also is df = df[:, 1:] just equivalent to df = df[:, 1]?
There was a problem hiding this comment.
Or maybe we can even let
_get_predictionscallcross_val_predictsomehow, so that we can also merge the method checks above.
This is what I was trying to do before - but _get_predictions is is used within _fit_calibrator which is within the cv for loop. This is why I wanted to put the cv for loop within _get_predictions but it got complicated.
I will can add a _reshape_df function but I have a little dislike for the df naming as it is not always 'decision_function`. (I also first think df=dataframe but maybe that is just me) WDYT?
(of course if this naming is used elsewhere, happy to keep)
There was a problem hiding this comment.
Also I can't think of an elegant way to add the _reshape_df function as we get the predictions of the classifier in _get_predictions but we use cross_val_predict to get the predictions in the case above...
There was a problem hiding this comment.
Yes I was confused by df too. We can rename it as predictions or just preds.
There was a problem hiding this comment.
Also I can't think of an elegant way to add the _reshape_df
Would it help to also add a _get_method helper, like
def _get_method(clf):
# return decision_function (the actual method) or predict_proba or raise error?
There was a problem hiding this comment.
Looking into it, _fit_calibrator returns a _CalibratedClassifierPipeline instance for each cv fold. For ensemble=False, this won't work (for how _fit_calibrator works atm) - we want to fit once with all the predictions. We would need to take _fit_calibrator out of the cv for loop.
There was a problem hiding this comment.
Looking into it, _fit_calibrator returns a _CalibratedClassifierPipeline instance for each cv fold
Only when ensemble=True, right?
There's no CV loop when ensemble=False, the "loop" is in cross_val_predict and we only need to call _fit_calibrator once. Maybe I misunderstood?
There was a problem hiding this comment.
This is difficult over text! I may be confused too.
At the moment we have:
class CalibratedClassifierCV():
def fit():
if self.cv == "prefit":
# `_fit_calibrator` using all X and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
else:
if ensemble:
for train, test in cv.split(X, y):
# fit classifier
# call `_fit_calibrator`, producing a `_CalibratedClassiferPipeline` instance
# append `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`,
# ending up with `n_cv` `_CalibratedClassiferPipeline` instances
else:
# Use `cross_val_predict` to get all `preds`
# call `_fit_calibrator` using `preds` and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`we want to get rid of the last else as there is duplication of code there. With the new _get_prediction_method we have:
class CalibratedClassifierCV():
def fit():
if self.cv == "prefit":
# `_fit_calibrator` using all X and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
else:
for train, test in cv.split(X, y):
# fit classifier
# call `_fit_calibrator`, which first calls `_get_predictions`.
# In case of `ensemble=False`, we use use `partial(cross_val_predict())` (this works fine)
# `_fit_calibrator` also fits calibrators and produces `_CalibratedClassiferPipeline` instance
# Only one _CalibratedClassiferPipeline` is required when `ensemble=False`
# (i.e. shouldn't be in the for loop)
# when `ensemble=True` `n_cv` `_CalibratedClassiferPipeline` instances are needed
# (i.e., should be in the for loop) There was a problem hiding this comment.
Sorry if I wasn't clear! I don't think we can move away from the first version (with the last else). The duplication of code that I wanted to avoid was:
- getting the right method
if hasattr(base_estimator, "decision_function"):
base_estimator_method = "decision_function"
elif hasattr(base_estimator, "predict_proba"):
base_estimator_method = "predict_proba"- reshaping the predictions
if base_estimator_method == "decision_function":
if preds.ndim == 1:
preds = preds[:, np.newaxis]
else:
if len(self.label_encoder_.classes_) == 2:
preds = preds[:, 1]Right now, these 2 things happen both in CalibratedClassifierCV.fit (for ensemble=False) and in _get_predictions (for ensemble=True).
So to expand on my previous suggestion:
def get_prediction_method(clf, return_string=False):
# return clf.decision_function or clf.predict_proba or their corresponding string
# we need strings when using cross_val_predict
def _get_predictions(X, pred_func):
# pred_func is either a method of the clf, or a partial cross_val_predict
preds = pred_func(X)
# reshape preds here
return preds
class CalibratedClassifierCV():
def fit():
if self.cv == "prefit":
# `_fit_calibrator` using all X and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`
else:
if ensemble:
for train, test in cv.split(X, y):
# fit classifier on train
preds = _get_predictions(test, get_prediction_method(clf, return_string=False)
# call `_fit_calibrator`, producing a `_CalibratedClassiferPipeline` instance
# append `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`,
# ending up with `n_cv` `_CalibratedClassiferPipeline` instances
else:
preds = _get_predictions(X, partial(cross_val_predict(..., method=get_prediction_method(clf, return_string=True)))
# call `_fit_calibrator` using `preds` and y
# append one `_CalibratedClassiferPipeline` instance to `self.calibrated_classifiers_`I think we might need to modify _fit_calibrator to only accept preds and not call _get_predictions
There was a problem hiding this comment.
FYI:
Also is df = df[:, 1:] just equivalent to df = df[:, 1]?
it seems that it is not the same, you get 2d output with df[:, 1:]
There wasn't a big change to this. Previously we passed the classes to scikit-learn/sklearn/calibration.py Lines 180 to 181 in fd23727 and then instantiate a new scikit-learn/sklearn/calibration.py Lines 325 to 329 in fd23727 (note: I just changed it so we do not pass the classes but the fitted |
|
I've make the changes. I like that I do think that it is a bit of duplication to have both would do the trick. But maybe that is a separate issue. |
NicolasHug
left a comment
There was a problem hiding this comment.
Thanks @lucyleeow , made another pass.
I think the use of partial is appropriate and the refactoring looks good to me. We'll need tests for the new ensemble parameter ;)
I do think that it is a bit of duplication to have both _check_classifier_response_method and get_prediction_method [...]
Yes we can consider having a unified utility, but indeed this would be better done in a separate PR.
|
ping @NicolasHug @ogrisel |
ogrisel
left a comment
There was a problem hiding this comment.
This looks very good to me. Thank you very much @lucyleeow for your patience.
Here are some fixes / small suggestions for further improvements but otherwise it seems ready for second review and merge.
Please add a new entry to doc/whats_new/0.24.rst.
|
Thanks for the review @ogrisel, suggestions added! |
glemaitre
left a comment
There was a problem hiding this comment.
It looks good. Just a couple of nitpicks.
|
Merged! Thank you very much @lucyleeow! |
| sparse.csr_matrix(X_test))]: | ||
| cal_clf = CalibratedClassifierCV( | ||
| clf, method=method, cv=2, ensemble=ensemble | ||
| clf, method=method, cv=5, ensemble=ensemble |
There was a problem hiding this comment.
I used cv=5 whenever we assess the quality of the calibration because cv=2 trains models on much small dataset and can degrade the base model performance, hence make the test quite brittle.
Reference Issues/PRs
closes #16145
Follows from PR #16167
The first step was to refactor
CalibratedClassifierCV, which was done in #17803. This is a branch off of #17803. (will close #17803)What does this implement/fix? Explain your changes.
Add
ensembleoption toCalibratedClassifierCV.ensemble=Truewould be the default and would be the current implementation.ensemble=Falsewould be the same as the libsvm implementation (used whenprobability=True):When
cvis not'prefit':Note
ensemble=Falseis the procedure described in the original Platt paper (see #16145 (comment))Any other comments?