[MRG] Implement calibration loss metrics#11096
[MRG] Implement calibration loss metrics#11096samronsin wants to merge 97 commits intoscikit-learn:mainfrom
Conversation
jnothman
left a comment
There was a problem hiding this comment.
You will also need to update:
doc/modules/classes.rstdoc/modules/model_evaluation.rstsklearn/metrics/tests/test_common.py
jnothman
left a comment
There was a problem hiding this comment.
Should this be available as a scorer for cross validation? Should it be available for CalibratedClassifierCV?
|
You don't need to test sample weights as the common tests will |
|
I'm not sure if it makes sense to have this as a ready-made scorer, and so I might want to hold out on that? It should probably be mentioned in the calibration docs and examples. |
|
Can you maybe also add https://www.math.ucdavis.edu/~saito/data/roc/ferri-class-perf-metrics.pdf to the references and maybe include some of the discussion there about this loss? In particular, this is one of two calibration losses they discuss, so maybe we should be more specific with the name? |
|
Actually I did not implement any of the losses from the paper you mention @amueller, as I take non-overlapping bins instead of sliding window in CalB.
|
|
I ended up implementing the calB loss suggested by @amueller, but did not provide support for sample weights. |
jnothman
left a comment
There was a problem hiding this comment.
Sorry for fashion again not being in a situation to review the main content...
doc/modules/model_evaluation.rst
Outdated
| 'balanced_accuracy' :func:`metrics.balanced_accuracy_score` for binary targets | ||
| 'average_precision' :func:`metrics.average_precision_score` | ||
| 'brier_score_loss' :func:`metrics.brier_score_loss` | ||
| 'calibration_loss' :func:`metrics.calibration_loss` |
There was a problem hiding this comment.
to respect the convention higher is better we should maybe call it
neg_calibration_error
thoughts anyone?
There was a problem hiding this comment.
Yes for the grid search tools in scikit-learn will always to maximize the scoring criterion. If the criterion is derived from a loss to minimize we instead maximize the the negative criterion. It's important to name it accordingly to make the column names and column values of pd.DataFrame(grid_search.cv_results_) consistent.
There was a problem hiding this comment.
We might also want to provide standard aliases neg_max_calibration_error and neg_expected_calibration_error to set the norm parameter to make it possible to grid search for the ECE or MCE metrics traditionally used in the literature.
I have not followed recent development but this is related to this discussion: #17704 |
glemaitre
left a comment
There was a problem hiding this comment.
It is just some comments regarding the code itself.
I have read more about the score itself and check if we don't miss anything in the test.
In particular, I see that we don't test anything when it is used as a scorer. I am not sure we have a common test there.
|
|
||
| if pos_label is None: | ||
| pos_label = y_true.max() | ||
| y_true = np.array(y_true == pos_label, int) |
There was a problem hiding this comment.
If the score is not symmetric, I would make pos_label consistent with the scorer that raise an error with string.
To be more specific, we should carry this check:
scikit-learn/sklearn/metrics/_ranking.py
Lines 553 to 572 in e89157e
We could make a small refactoring indeed. So pos_label would be in line with:
pos_label: int or str, default=None
The label of the positive class. When pos_label=None, if y_true is in {-1, 1} or {0, 1},
pos_label is set to 1, otherwise, an error will be raised.
| raise ValueError("y_prob has values outside of [0, 1] range") | ||
|
|
||
| labels = np.unique(y_true) | ||
| if len(labels) > 2: |
There was a problem hiding this comment.
here I think that we should use sklearn.multiclass.type_of_target for consistency
y_type = type_of_target(y_true)
if y_type != "binary":
raise ValueError(...)| y_pred = np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75, 0.75]) | ||
| sample_weight = np.array([1, 1, 1, 1] + [3, 3, 3, 3]) | ||
|
|
||
| assert_almost_equal( |
There was a problem hiding this comment.
We are using the following pattern
assert calibration_error(...) == pytest.approx(0.1875)
instead of assert_almost_equal since we introduced pytest
| 0.2165063) | ||
|
|
||
|
|
||
| def test_calibration_error_raises(): |
There was a problem hiding this comment.
we can parametrize this test and use pytest
@pytest.mark.parametrize(
"y_pred",
[np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75]),
np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75, 1.75]),
....]
def test_calibration_error_raises(y_pred):
...
with pytest.raises(ValueError, match=err_msg):
calibration_error(y_true, y_pred)There was a problem hiding this comment.
If the err_msg is changing, we can include it in the parametrization as well.
| y_true = np.array([0, 1, 0, 1]) | ||
| y_pred = np.array([0., 0.25, 0.5, 0.75]) | ||
|
|
||
| assert_almost_equal( |
| y_true = np.array([1, 0, 0, 1]) | ||
| y_pred = np.array([0.25, 0.25, 0.75, 0.75]) | ||
|
|
||
| assert_almost_equal( |
|
Sorry to add more work but would it be possible to add a quick note in
to clarify that the calibration loss metric we implement here is not the same as the one referred to above, and maybe how they relate? (ref: #18051 (comment)) |
|
Thanks for this work! In terms of implementation can't we re-use the calculation of |
Format fixes Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
| The :func:`calibration_error` function computes the expected and maximum | ||
| calibration errors as defined in [1]_ for binary classes. |
There was a problem hiding this comment.
Would the general ECE for multiclass problems would be more desirable, as used in Guo 2017 (Ref 1)? This implementation seems targeted at binary classification. I believe you could use this as is for multiclass if you pass in y_pred as the maximum predicted probability and y_true as 1 if the predicted class equals the true class and 0 otherwise. But I would think it more desirable to just have the multiclass logic under the hood.
| sample_weight[i_start:i_end]) | ||
| / delta_count[i]) | ||
| if norm == "l2" and reduce_bias: | ||
| delta_debias = ( |
There was a problem hiding this comment.
Is this implementation of the bias term correct? I believe it should be (avg_pred_true[i] * (avg_pred_true[i] - 1) * delta_count[i])/ (count * (delta_count[i] - 1)) going off the definition here. The 1/count doesn't get distributed properly, I think. This will produce nans if there's only one element in a bin, so I'm not sure if this was changed for numerical stability reasons. The authors of the paper mention this in their implementation, so it seems like the "undefined if bin size is 1" issue is expected.
|
We should have a look at https://cran.r-project.org/web/packages/reliabilitydiag/index.html and the accompanying paper https://doi.org/10.1073/pnas.2016191118. First, let us look if this PR corresponds to their miscalibration measure (MCB). Second, they propose a binning that might help us. |
|
Folks, I've made a separate implementation in #22233 which has some specific benefits (e.g. binless). What do you think? |
|
I would much prefer a more general solution for calibration scores by a general score decomposition as proposed in #23767. |
|
|
||
| norm : {'l1', 'l2', 'max'}, default='l2' | ||
| Norm method. The l1-norm is the Expected Calibration Error (ECE), | ||
| and the max-norm corresponds to Maximum Calibration Error (MCE). |
| strategy : {'uniform', 'quantile'}, default='uniform' | ||
| Strategy used to define the widths of the bins. | ||
|
|
||
| uniform |
There was a problem hiding this comment.
| uniform | |
| - 'uniform': |
Fix format to meet sklearn docstrings format, like here
|
|
||
| uniform | ||
| All bins have identical widths. | ||
| quantile |
There was a problem hiding this comment.
| quantile | |
| - 'quantile': |
Fix format to meet sklearn docstrings format, like here
| positive class. | ||
|
|
||
| reduce_bias : bool, default=True | ||
| Add debiasing term as in Verified Uncertainty Calibration, A. Kumar. |
|
When are we going to see the calibration loss released? |
See discussion on issue #10883.
Closes #18268.
This PR implements calibration losses for binary classifiers.
It also updates the doc about calibration, especially inaccurate references to the Brier score.