Skip to content

DOC attempt to fix lorenz_curve in plot tweedie regression example#30198

Merged
ogrisel merged 1 commit intoscikit-learn:mainfrom
m-maggi:doc/lorenz_curve_alpha
Nov 25, 2024
Merged

DOC attempt to fix lorenz_curve in plot tweedie regression example#30198
ogrisel merged 1 commit intoscikit-learn:mainfrom
m-maggi:doc/lorenz_curve_alpha

Conversation

@m-maggi
Copy link
Copy Markdown
Contributor

@m-maggi m-maggi commented Nov 2, 2024

Reference Issues/PRs

Fix attempt of #28534

What does this implement/fix? Explain your changes.

Take definition of Lorenz Curve from Poisson regression and non-normal loss and use it in Tweedie regression on insurance claims

Any other comments?

Following the discussion in #28534 it seems to me that the Lorenz curve should not use a linespace for the x values of the curve if the data is weighted.
Example snippet to test behaviour when linspace is used:

import matplotlib.pyplot as plt
import numpy as np


rng = np.random.default_rng(0)
n = 30
y_true = rng.uniform(low=0, high=10, size=n)
y_pred = y_true * rng.uniform(low=0.9, high=1.1, size=n)
exposure = rng.integers(low=0, high=10, size=n)

def lorenz_curve_linspace(frequency, exposure):
    ranking = np.argsort(frequency)
    ranked_frequencies = frequency[ranking]
    ranked_exposure = exposure[ranking]
    cumulated_claims = np.cumsum(ranked_frequencies * ranked_exposure)
    cumulated_claims = cumulated_claims / cumulated_claims[-1]
    cumulated_exposure = np.linspace(0, 1, len(frequency))
    plt.scatter(
        cumulated_exposure,
        cumulated_claims,
        marker=".",
        alpha=0.5,
    )
    return cumulated_exposure, cumulated_claims

y_true_repeated = y_true.repeat(exposure)
y_pred_repeated = y_pred.repeat(exposure)
sample_weight = np.ones_like(y_pred_repeated)
res = lorenz_curve_linspace(y_pred_repeated, sample_weight)

y_true_weighted = y_true
y_pred_weighted = y_pred
sample_weight = exposure
res = lorenz_curve_linspace(y_pred_weighted, sample_weight)

Results in

image

Snippet for results using the version of Poisson regression and non-normal loss, also implemented in this PR:

Details
import matplotlib.pyplot as plt
import numpy as np


rng = np.random.default_rng(0)
n = 30
y_true = rng.uniform(low=0, high=10, size=n)
y_pred = y_true * rng.uniform(low=0.9, high=1.1, size=n)
exposure = rng.integers(low=0, high=10, size=n)

def lorenz_curve(frequency, exposure, weighted=True):
    ranking = np.argsort(frequency)
    ranked_frequencies = frequency[ranking]
    ranked_exposure = exposure[ranking]
    cumulated_claims = np.cumsum(ranked_frequencies * ranked_exposure)
    cumulated_claims = cumulated_claims / cumulated_claims[-1]
    if weighted:
        cumulated_exposure = np.cumsum(ranked_exposure)
        cumulated_exposure = cumulated_exposure / cumulated_exposure[-1]
        plt.scatter(
            cumulated_exposure,
            cumulated_claims,
            marker=".",
            alpha=0.5,
            label="weighted",
        )
    else:
        cumulated_exposure = np.linspace(0, 1, len(frequency))
        plt.scatter(
            cumulated_exposure,
            cumulated_claims,
            marker=".",
            alpha=0.5,
            label="unweighted",
        )
    return cumulated_exposure, cumulated_claims

y_true_repeated = y_true.repeat(exposure)
y_pred_repeated = y_pred.repeat(exposure)
sample_weight = np.ones_like(y_pred_repeated)
res = lorenz_curve(y_pred_repeated, sample_weight, False)

y_true_weighted = y_true
y_pred_weighted = y_pred
sample_weight = exposure
res = lorenz_curve(y_pred_weighted, sample_weight, True)
plt.legend();

image

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 2, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 21eed35. Link to the linter CI: here

@adrinjalali
Copy link
Copy Markdown
Member

ping @ogrisel @antoinebaker @snath-xoc for reviews here.

Copy link
Copy Markdown
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @m-maggi

@ogrisel Could you have a look and merge?

@OmarManzoor OmarManzoor added the Waiting for Second Reviewer First reviewer is done, need a second one! label Nov 15, 2024
@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Nov 25, 2024

Thanks for the PR. I updated the snippet to compare the 2 strategies on the synthetic repeated/reweighted data and make one plot for each:

Details
# %%
import matplotlib.pyplot as plt
import numpy as np

rng = np.random.default_rng(0)
n = 30
y_true = rng.uniform(low=0, high=10, size=n)
y_pred = y_true * rng.uniform(low=0.9, high=1.1, size=n)
exposure = rng.integers(low=0, high=10, size=n)


def lorenz_curve_linspace(frequency, exposure, label=None, use_cumulated_exposure=True):
    ranking = np.argsort(frequency)
    ranked_frequencies = frequency[ranking]
    ranked_exposure = exposure[ranking]
    cumulated_claims = np.cumsum(ranked_frequencies * ranked_exposure)
    cumulated_claims = cumulated_claims / cumulated_claims[-1]

    if use_cumulated_exposure:
        cumulated_exposure = np.cumsum(ranked_exposure).astype(np.float64)
        cumulated_exposure /= cumulated_exposure[-1]
    else:
        cumulated_exposure = np.linspace(0, 1, len(frequency))

    plt.scatter(
        cumulated_exposure,
        cumulated_claims,
        marker=".",
        alpha=0.5,
        label=label,
    )
    return cumulated_exposure, cumulated_claims


y_true_repeated = y_true.repeat(exposure)
y_pred_repeated = y_pred.repeat(exposure)
sample_weight_repeated = np.ones_like(y_pred_repeated)
res_repeated = lorenz_curve_linspace(
    y_pred_repeated, sample_weight_repeated, label="repeated"
)

y_true_weighted = y_true
y_pred_weighted = y_pred
sample_weight_weighted = exposure
res_weighted = lorenz_curve_linspace(
    y_pred_weighted, sample_weight_weighted, label="weighted"
)

plt.legend()
plt.title("Lorenz curve using cumulated exposure as x-axis");
# %%
res_repeated = lorenz_curve_linspace(
    y_pred_repeated,
    sample_weight_repeated,
    label="repeated",
    use_cumulated_exposure=False,
)
res_weighted = lorenz_curve_linspace(
    y_pred_weighted,
    sample_weight_weighted,
    label="weighted",
    use_cumulated_exposure=False,
)

plt.legend()
plt.title("Lorenz curve using linear exposure as x-axis");

Here are the resulting plots:

image
image

So this confirms that using the cumulated exposure as the x-axis to plot the Lorenz curve is the correct solution, otherwise, the expected repetitions/reweighting equivalence does not hold.

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also took a look at the impact on the example figure in the notebook. The Gini score values are a bit larger than what we had in main (the Lorenz curves lie slightly further away from the diagonal) but otherwise they are quite qualitatively similar to what we had before. The text of the example does not need to be updated in particular.

+1 for merge. Thanks for the follow-up @m-maggi.

@ogrisel ogrisel merged commit fa5d727 into scikit-learn:main Nov 25, 2024
@ogrisel ogrisel mentioned this pull request Nov 25, 2024
17 tasks
@antoinebaker
Copy link
Copy Markdown
Contributor

Sorry I was too late to give a feedback, but I think the xlabel should be changed accordingly, something like "Cumulative proportion of exposure (from safest to riskiest)" as in the Poisson tutorial or "Fraction of total exposure\n(ordered by model from safest to riskiest)". The tutorial text when introducing the Lorenz curve should now state that we are plotting against the cumulative exposure on the x-axis.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Nov 26, 2024

@antoinebaker I agree, please feel free to open a PR with that fix and I apologize for merging too early :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Waiting for Second Reviewer First reviewer is done, need a second one!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants