[WIP] Brier score binless decomposition by ColdTeapot273K · Pull Request #22233 · scikit-learn/scikit-learn

ColdTeapot273K · 2022-01-17T15:30:06Z

Reference Issues/PRs

Closes #21774. See also #18268, #21718

What does this implement/fix? Explain your changes.

Described in #21774

Any other comments?

ogrisel · 2022-05-17T09:33:26Z

Thanks for the PR. However, before reviewing it we should settle the discussion on the related issue: #21774 (comment)

e-pet · 2022-05-31T15:42:56Z

I just tried using your implementation of brier_score_loss_decomposition and ran into a problem (at least I believe it is a problem). Basically, I have a MWE where I obtain BS = calibration loss and refinement loss = 0, which seems incorrect to me (the predictor is far from perfect).

    rng = np.random.RandomState(42)
    from sklearn.model_selection import train_test_split
    from scipy import stats
    from sklearn.linear_model import LogisticRegression
    import numpy as np
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    from sklearn.metrics import roc_auc_score

    def risk(x, group, confounder):
        """The true risk of each sample"""
        return 0.2 + x + 0.3*group*confounder - 0.15*group

    x = np.atleast_2d(rng.uniform(0, 0.5, size=1000)).T
    group = np.atleast_2d(rng.binomial(1, 0.5, size=1000)).T
    confounder = np.atleast_2d(rng.binomial(1, 0.2, size=1000)).T
    expected_risk = risk(x, group, confounder)
    lower, upper = -0.1, 0.1
    mu, sigma = 0, 0.05
    risk_noise = np.atleast_2d(stats.truncnorm(
        (lower - mu) / sigma, (upper - mu) / sigma, loc=mu, scale=sigma).rvs(1000)).T
    true_risk = np.maximum(np.minimum(expected_risk + risk_noise, 1), 0)
    outcomes = rng.binomial(1, p=true_risk)
    X_train, X_test, y_train, y_test, risk_train, risk_test = train_test_split(
    np.concatenate((x, group), axis=1), outcomes, true_risk, test_size=0.25, random_state=rng)
    idces = np.argsort(X_test[:, 0], axis=0)
    X_test = np.atleast_2d(X_test[idces, :].squeeze())
    y_test = np.atleast_2d(y_test[idces, :].squeeze()).T
    risk_test = np.atleast_2d(risk_test[idces, :].squeeze()).T
    LR = LogisticRegression()
    LR.fit(X_train, y_train.ravel())
    risk_test_pred_uncalib = LR.predict_proba(X_test)[:, 1]
    # yields bs==cl==0.22..., rl == 0
    bs, cl, rl = brier_score_loss_decomposition(y_test.reshape(-1), risk_test_pred_uncalib.reshape(-1))
    
    plt.figure()
    df = pd.DataFrame({'risk prediction': risk_test_pred_uncalib.reshape(-1), 'true risk': risk_test.reshape(-1)})
    sns.scatterplot(data=df, x='risk prediction', y='true risk')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.plot([0, 1], [0, 1], '--')   

    if roc_auc_score(y_test.reshape(-1), risk_test_pred_uncalib.reshape(-1)) < 1:  # AUROC = 0.67...
        assert(rl > 0)

ColdTeapot273K · 2022-06-21T18:48:46Z

@e-pet

I've checked your example

TLDR: working as intended, since it's an exact, binless implementation; but we can do something about it to give more flexibility without introducing bins.

Two points:

refinement loss grows when predictions are alike -> when model differentiates/disctiminates less, generalizes more
since this is a binless implementation, i.e. exact, the refinement loss grows only when predictions for different labels are exactly similar, as in exact number to an epsilon. in practice it means you're likely to get significantly nonzero refinement loss only from tree-based models rather than from linear (although it happens)

e.g. this is how you get the maximum refinement loss:

bs, cl, rl = brier_score_loss_decomposition(
    np.array([1, 0]).reshape(-1), np.array([0.5, 0.5]).reshape(-1)
)
# 0.25, 0.0, 0.25

while the calibration loss here is 0 and it's an expected result. that's how you can hack the calibration metric in general btw - just predict class balance value (countered by considering joint likelihood metric instead, see McElreath's "Statistical Rethinking 2nd ed", paragraph 7.2)

Now, this got me thinking that in practical applications we don't often consider numbers which differ in very distant decimals (e.g. 0.XXXXXX3 and 0.XXXXXX4) as too different. So we might relax the constraint on 'exactness' of this implementation by introducing some absolute/relative tolerance (like in math.isclose()).
So that:

0.XXX3 and 0.XXX4 are treated as the same (and therefore pred[i]=0.XXX3, test[i]=0 and pred[i]=0.XXX4, test[i]=1 would increase refinement loss)
0.X3 and 0.X4 are considered different (and therefore pred[i]=0.X3, test[i]=0 and pred[i]=0.X4, test[i]=1 wouldn't increase refinement loss)

Tolerance in this case would act as alternative parameter to binning, the degree of "exactness" of this exact implementation.
Or maybe instead it should be left to a user to do some np.round() over predicted probabilities if they so desire. Simple, explicit alternative.

lorentzenchr · 2022-06-27T14:41:12Z

I would much prefer a more general solution to score decompositions as proposed in #23767.

ColdTeapot273K added 4 commits November 24, 2021 10:45

add brier score loss decomposition (WIP)

950d80b

add new loss to import namespace

75545f1

fix doctest

ef5942b

Merge branch 'scikit-learn:main' into brier-score-binless-decomposition

ef176e9

github-actions bot added the module:metrics label Jan 17, 2022

ColdTeapot273K mentioned this pull request Jan 17, 2022

[MRG] Implement calibration loss metrics #11096

Open

ColdTeapot273K changed the title ~~Brier score binless decomposition~~ [WIP] Brier score binless decomposition Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Brier score binless decomposition#22233

[WIP] Brier score binless decomposition#22233
ColdTeapot273K wants to merge 4 commits intoscikit-learn:mainfrom
ColdTeapot273K:brier-score-binless-decomposition

ColdTeapot273K commented Jan 17, 2022

Uh oh!

ogrisel commented May 17, 2022 •

edited

Loading

Uh oh!

e-pet commented May 31, 2022 •

edited

Loading

Uh oh!

ColdTeapot273K commented Jun 21, 2022 •

edited

Loading

Uh oh!

lorentzenchr commented Jun 27, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

ColdTeapot273K commented Jan 17, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

ogrisel commented May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

e-pet commented May 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ColdTeapot273K commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented Jun 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ogrisel commented May 17, 2022 •

edited

Loading

e-pet commented May 31, 2022 •

edited

Loading

ColdTeapot273K commented Jun 21, 2022 •

edited

Loading

lorentzenchr commented Jun 27, 2022 •

edited

Loading