[MRG] Implement calibration loss metrics by samronsin · Pull Request #11096 · scikit-learn/scikit-learn

samronsin · 2018-05-15T14:07:07Z

See discussion on issue #10883.
Closes #18268.

This PR implements calibration losses for binary classifiers.

It also updates the doc about calibration, especially inaccurate references to the Brier score.

…within bin

jnothman

You will also need to update:

doc/modules/classes.rst
doc/modules/model_evaluation.rst
sklearn/metrics/tests/test_common.py

jnothman

Should this be available as a scorer for cross validation? Should it be available for CalibratedClassifierCV?

sklearn/metrics/classification.py

…mple weights

jnothman · 2018-05-16T11:07:21Z

You don't need to test sample weights as the common tests will

…gmoid calibration methods

amueller · 2018-05-19T19:20:17Z

I'm not sure if it makes sense to have this as a ready-made scorer, and so I might want to hold out on that? It should probably be mentioned in the calibration docs and examples.

amueller · 2018-05-19T19:21:36Z

Can you maybe also add https://www.math.ucdavis.edu/~saito/data/roc/ferri-class-perf-metrics.pdf to the references and maybe include some of the discussion there about this loss? In particular, this is one of two calibration losses they discuss, so maybe we should be more specific with the name?

samronsin · 2018-05-20T08:30:46Z

Actually I did not implement any of the losses from the paper you mention @amueller, as I take non-overlapping bins instead of sliding window in CalB.
CalB would totally make sense in this PR, although:

n_bins would have a very different meaning: e.g. the number of non-overlapping bins
taking sample_weight into account seems non-trivial because of the sliding window
implementing CalB efficiently would actually be non-trivial in itself since it is based on a sliding window
greedy implementation would be quite drag on performance, compared to non-overlapping binning

samronsin · 2018-05-25T13:07:30Z

I ended up implementing the calB loss suggested by @amueller, but did not provide support for sample weights.
Also, I'll be happy to take suggestions regarding its implementation.

jnothman

Sorry for fashion again not being in a situation to review the main content...

doc/modules/model_evaluation.rst

sklearn/metrics/classification.py

doc/modules/model_evaluation.rst

agramfort · 2018-07-16T13:49:35Z

doc/modules/model_evaluation.rst

 'balanced_accuracy'               :func:`metrics.balanced_accuracy_score`           for binary targets
 'average_precision'               :func:`metrics.average_precision_score`
 'brier_score_loss'                :func:`metrics.brier_score_loss`
+'calibration_loss'                :func:`metrics.calibration_loss`


to respect the convention higher is better we should maybe call it

neg_calibration_error

thoughts anyone?

Yes for the grid search tools in scikit-learn will always to maximize the scoring criterion. If the criterion is derived from a loss to minimize we instead maximize the the negative criterion. It's important to name it accordingly to make the column names and column values of pd.DataFrame(grid_search.cv_results_) consistent.

We might also want to provide standard aliases neg_max_calibration_error and neg_expected_calibration_error to set the norm parameter to make it possible to grid search for the ECE or MCE metrics traditionally used in the literature.

sklearn/metrics/classification.py

ogrisel · 2020-08-06T09:30:30Z

@ogrisel unless I'm missing something, the only remaining discussion point is what should happen when no pos_label is explicitly given by the user (especially for non-integer target) ? Is there a new consensus we should follow ?

I have not followed recent development but this is related to this discussion: #17704

lucyleeow · 2020-08-11T16:20:18Z

For housekeeping, I think this will close #10971 and #12479

glemaitre

It is just some comments regarding the code itself.
I have read more about the score itself and check if we don't miss anything in the test.

In particular, I see that we don't test anything when it is used as a scorer. I am not sure we have a common test there.

glemaitre · 2020-08-14T09:21:11Z

sklearn/metrics/_classification.py

+
+    if pos_label is None:
+        pos_label = y_true.max()
+    y_true = np.array(y_true == pos_label, int)


If the score is not symmetric, I would make pos_label consistent with the scorer that raise an error with string.
To be more specific, we should carry this check:

scikit-learn/sklearn/metrics/_ranking.py

Lines 553 to 572 in e89157e

# ensure binary classification if pos_label is not specified

# classes.dtype.kind in ('O', 'U', 'S') is required to avoid

# triggering a FutureWarning by calling np.array_equal(a, b)

# when elements in the two arrays are not comparable.

classes = np.unique(y_true)

if (pos_label is None and (

classes.dtype.kind in ('O', 'U', 'S') or

not (np.array_equal(classes, [0, 1]) or

np.array_equal(classes, [-1, 1]) or

np.array_equal(classes, [0]) or

np.array_equal(classes, [-1]) or

np.array_equal(classes, [1])))):

classes_repr = ", ".join(repr(c) for c in classes)

raise ValueError("y_true takes value in {{{classes_repr}}} and "

"pos_label is not specified: either make y_true "

"take value in {{0, 1}} or {{-1, 1}} or "

"pass pos_label explicitly.".format(

classes_repr=classes_repr))

elif pos_label is None:

pos_label = 1.

We could make a small refactoring indeed. So pos_label would be in line with:

pos_label: int or str, default=None The label of the positive class. When pos_label=None, if y_true is in {-1, 1} or {0, 1}, pos_label is set to 1, otherwise, an error will be raised.

sklearn/metrics/_classification.py

glemaitre · 2020-08-14T09:23:38Z

sklearn/metrics/_classification.py

+        raise ValueError("y_prob has values outside of [0, 1] range")
+
+    labels = np.unique(y_true)
+    if len(labels) > 2:


here I think that we should use sklearn.multiclass.type_of_target for consistency

y_type = type_of_target(y_true) if y_type != "binary": raise ValueError(...)

sklearn/metrics/_classification.py

sklearn/metrics/_scorer.py

glemaitre · 2020-08-14T09:29:32Z

sklearn/metrics/tests/test_classification.py

+    y_pred = np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75, 0.75])
+    sample_weight = np.array([1, 1, 1, 1] + [3, 3, 3, 3])
+
+    assert_almost_equal(


We are using the following pattern

assert calibration_error(...) == pytest.approx(0.1875)

instead of assert_almost_equal since we introduced pytest

glemaitre · 2020-08-14T09:32:09Z

sklearn/metrics/tests/test_classification.py

+        0.2165063)
+
+
+def test_calibration_error_raises():


we can parametrize this test and use pytest

@pytest.mark.parametrize( "y_pred", [np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75]), np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75, 1.75]), ....] def test_calibration_error_raises(y_pred): ... with pytest.raises(ValueError, match=err_msg): calibration_error(y_true, y_pred)

If the err_msg is changing, we can include it in the parametrization as well.

glemaitre · 2020-08-14T09:32:47Z

sklearn/metrics/tests/test_classification.py

+    y_true = np.array([0, 1, 0, 1])
+    y_pred = np.array([0., 0.25, 0.5, 0.75])
+
+    assert_almost_equal(


use pytest.approx

glemaitre · 2020-08-14T09:33:04Z

sklearn/metrics/tests/test_classification.py

+    y_true = np.array([1, 0, 0, 1])
+    y_pred = np.array([0.25, 0.25, 0.75, 0.75])
+
+    assert_almost_equal(


use pytest.approx

lucyleeow · 2020-08-14T09:51:42Z

Sorry to add more work but would it be possible to add a quick note in model_evaluation.rst under Brier score:

This is because the Brier score can be decomposed as the sum of calibration loss and refinement loss [Bella2012]. Calibration loss is defined as the mean squared deviation from empirical probabilities derived from the slope of ROC segments.

to clarify that the calibration loss metric we implement here is not the same as the one referred to above, and maybe how they relate?

(ref: #18051 (comment))

rth · 2020-09-11T12:54:14Z

Thanks for this work! In terms of implementation can't we re-use the calculation of prob_true, prob_pred done in calibration_curve, as suggested in #18268 instead of re-computing it from scratch? Or do we need a re-implementatation for reduce_bias=True case? In case, if that work it would be be good to a add a test comparing the manual calculation e.g. of ECE form calibration_curve with the added function.

Format fixes Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

rflperry · 2021-03-02T18:34:22Z

doc/modules/model_evaluation.rst

+The :func:`calibration_error` function computes the expected and maximum
+calibration errors as defined in [1]_ for binary classes.


Would the general ECE for multiclass problems would be more desirable, as used in Guo 2017 (Ref 1)? This implementation seems targeted at binary classification. I believe you could use this as is for multiclass if you pass in y_pred as the maximum predicted probability and y_true as 1 if the predicted class equals the true class and 0 otherwise. But I would think it more desirable to just have the multiclass logic under the hood.

edwardclem · 2021-06-06T01:34:55Z

sklearn/metrics/_classification.py

+                                  sample_weight[i_start:i_end])
+                           / delta_count[i])
+        if norm == "l2" and reduce_bias:
+            delta_debias = (


Is this implementation of the bias term correct? I believe it should be (avg_pred_true[i] * (avg_pred_true[i] - 1) * delta_count[i])/ (count * (delta_count[i] - 1)) going off the definition here. The 1/count doesn't get distributed properly, I think. This will produce nans if there's only one element in a bin, so I'm not sure if this was changed for numerical stability reasons. The authors of the paper mention this in their implementation, so it seems like the "undefined if bin size is 1" issue is expected.

lorentzenchr · 2021-11-29T23:27:32Z

We should have a look at https://cran.r-project.org/web/packages/reliabilitydiag/index.html and the accompanying paper https://doi.org/10.1073/pnas.2016191118. First, let us look if this PR corresponds to their miscalibration measure (MCB). Second, they propose a binning that might help us.

ColdTeapot273K · 2022-01-17T15:33:00Z

Folks, I've made a separate implementation in #22233 which has some specific benefits (e.g. binless). What do you think?

lorentzenchr · 2022-06-27T14:46:26Z

I would much prefer a more general solution for calibration scores by a general score decomposition as proposed in #23767.

rolaguna · 2022-11-24T10:49:33Z

sklearn/metrics/_classification.py

+
+    norm : {'l1', 'l2', 'max'}, default='l2'
+        Norm method. The l1-norm is the Expected Calibration Error (ECE),
+        and the max-norm corresponds to Maximum Calibration Error (MCE).


and l2 corresponds to...?

rolaguna · 2022-11-24T10:52:48Z

sklearn/metrics/_classification.py

+    strategy : {'uniform', 'quantile'}, default='uniform'
+        Strategy used to define the widths of the bins.
+
+        uniform


Suggested change

uniform

- 'uniform':

Fix format to meet sklearn docstrings format, like here

rolaguna · 2022-11-24T10:53:12Z

sklearn/metrics/_classification.py

+
+        uniform
+            All bins have identical widths.
+        quantile


Suggested change

quantile

- 'quantile':

Fix format to meet sklearn docstrings format, like here

rolaguna · 2022-11-24T10:53:40Z

sklearn/metrics/_classification.py

+        positive class.
+
+    reduce_bias : bool, default=True
+        Add debiasing term as in Verified Uncertainty Calibration, A. Kumar.


if possible, add link to A. Kumar

Jerry-Master · 2022-12-31T19:08:16Z

When are we going to see the calibration loss released?

samronsin added 2 commits May 15, 2018 11:54

Implement calibration loss metric (WIP)

7e20414

Add reference and replace center of bin with centroid of predictions …

8206172

…within bin

samronsin mentioned this pull request May 15, 2018

Incorrect interpretation of Brier score loss in docstring #10883

Closed

samronsin added 2 commits May 15, 2018 18:09

Fix docstring examples and misc PEP8 issues

3282fe6

Fix docstring examples

aa01c7b

jnothman reviewed May 15, 2018

View reviewed changes

Fix misc docstring stuff + add test on sample weights + fix bug on sa…

76ba581

…mple weights

samronsin added 4 commits May 19, 2018 20:00

Switch loss computation to algorithm based on sorting

d71da9e

Fix bug in test with sample_weight

2f2742a

Switch from Brier score to calibration loss in tests of isotonic / si…

52d6fe3

…gmoid calibration methods

Update doc

7c69ae9

Fix misc formating

a69160e

Add sliding window implementation (calB from Ferri et al.) + doc

b6846af

Fix misc formatting

867a46d

jnothman reviewed May 28, 2018

View reviewed changes

doc/modules/model_evaluation.rst Outdated Show resolved Hide resolved

sklearn/metrics/classification.py Outdated Show resolved Hide resolved

sklearn/metrics/classification.py Outdated Show resolved Hide resolved

doc/modules/model_evaluation.rst Outdated Show resolved Hide resolved

samronsin added 5 commits July 16, 2018 09:46

Merge branch 'master' into calibration-loss

c81e9ec

Improve wording on calibration loss with sliding window

f150c57

Fix and update doc

554b4f4

Fix indenting

1b1dd39

WIP: improve doc

197f456

agramfort reviewed Jul 16, 2018

View reviewed changes

sklearn/metrics/classification.py Outdated Show resolved Hide resolved

agramfort reviewed Jul 16, 2018

View reviewed changes

sklearn/metrics/classification.py Outdated Show resolved Hide resolved

agramfort requested changes Jul 16, 2018

View reviewed changes

sklearn/metrics/classification.py Outdated Show resolved Hide resolved

sklearn/metrics/classification.py Show resolved Hide resolved

sklearn/metrics/classification.py Outdated Show resolved Hide resolved

sklearn/metrics/classification.py Outdated Show resolved Hide resolved

ogrisel mentioned this pull request Aug 6, 2020

ENH add a parameter pos_label in roc_auc_score #17704

Closed

glemaitre reviewed Aug 14, 2020

View reviewed changes

glemaitre mentioned this pull request Aug 18, 2020

MNT make error message consistent in brier_score_loss #18183

Merged

lucyleeow mentioned this pull request Aug 27, 2020

Feature Request: function to calculate Expected Calibration Error (ECE) #18268

Open

rth self-requested a review September 11, 2020 12:54

ogrisel mentioned this pull request Oct 30, 2020

Issue with CalibratedClassifierCV with multiclass classification problems #18709

Open

dsleo and others added 2 commits November 5, 2020 18:15

resolving merge conflict

2be5a5c

Apply suggestions from code review

643330d

Format fixes Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

ogrisel mentioned this pull request Nov 9, 2020

[MRG] Add calibration loss metric in classification #12479

Closed

Base automatically changed from master to main January 22, 2021 10:50

rflperry reviewed Mar 2, 2021

View reviewed changes

edwardclem reviewed Jun 6, 2021

View reviewed changes

edwardclem mentioned this pull request Jun 13, 2021

Add Expected Calibration Error Lightning-AI/torchmetrics#218

Closed

edwardclem mentioned this pull request Jul 23, 2021

Implementation of calibration error metrics Lightning-AI/torchmetrics#394

Merged

4 tasks

lorentzenchr mentioned this pull request Nov 19, 2021

RFC Principled metrics for scoring and calibration of supervised learning #21718

Open

ColdTeapot273K mentioned this pull request Nov 24, 2021

Calibration and Refinement loss for Brier score loss #21774

Open

rolaguna reviewed Nov 24, 2022

View reviewed changes

lorentzenchr mentioned this pull request Apr 15, 2024

Calibration: added expected calibration error (ECE) #28831

Draft

lucyleeow added the Stalled label Apr 30, 2025

	# ensure binary classification if pos_label is not specified
	# classes.dtype.kind in ('O', 'U', 'S') is required to avoid
	# triggering a FutureWarning by calling np.array_equal(a, b)
	# when elements in the two arrays are not comparable.
	classes = np.unique(y_true)
	if (pos_label is None and (
	classes.dtype.kind in ('O', 'U', 'S') or
	not (np.array_equal(classes, [0, 1]) or
	np.array_equal(classes, [-1, 1]) or
	np.array_equal(classes, [0]) or
	np.array_equal(classes, [-1]) or
	np.array_equal(classes, [1])))):
	classes_repr = ", ".join(repr(c) for c in classes)
	raise ValueError("y_true takes value in {{{classes_repr}}} and "
	"pos_label is not specified: either make y_true "
	"take value in {{0, 1}} or {{-1, 1}} or "
	"pass pos_label explicitly.".format(
	classes_repr=classes_repr))
	elif pos_label is None:
	pos_label = 1.

		The :func:`calibration_error` function computes the expected and maximum
		calibration errors as defined in [1]_ for binary classes.

Uh oh!

Conversation

samronsin commented May 15, 2018 • edited by glemaitre Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jnothman commented May 16, 2018

Uh oh!

amueller commented May 19, 2018

Uh oh!

amueller commented May 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samronsin commented May 20, 2018

Uh oh!

samronsin commented May 25, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Aug 6, 2020

Uh oh!

lucyleeow commented Aug 11, 2020

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow commented Aug 14, 2020

Uh oh!

rth commented Sep 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edwardclem Jun 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

samronsin commented May 15, 2018 •

edited by glemaitre

Loading

amueller commented May 19, 2018 •

edited

Loading

edwardclem Jun 6, 2021 •

edited

Loading

ColdTeapot273K commented Jan 17, 2022 •

edited

Loading