DOC attempt to fix lorenz_curve in plot tweedie regression example#30198
DOC attempt to fix lorenz_curve in plot tweedie regression example#30198ogrisel merged 1 commit intoscikit-learn:mainfrom
Conversation
|
ping @ogrisel @antoinebaker @snath-xoc for reviews here. |
|
Thanks for the PR. I updated the snippet to compare the 2 strategies on the synthetic repeated/reweighted data and make one plot for each: Details# %%
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng(0)
n = 30
y_true = rng.uniform(low=0, high=10, size=n)
y_pred = y_true * rng.uniform(low=0.9, high=1.1, size=n)
exposure = rng.integers(low=0, high=10, size=n)
def lorenz_curve_linspace(frequency, exposure, label=None, use_cumulated_exposure=True):
ranking = np.argsort(frequency)
ranked_frequencies = frequency[ranking]
ranked_exposure = exposure[ranking]
cumulated_claims = np.cumsum(ranked_frequencies * ranked_exposure)
cumulated_claims = cumulated_claims / cumulated_claims[-1]
if use_cumulated_exposure:
cumulated_exposure = np.cumsum(ranked_exposure).astype(np.float64)
cumulated_exposure /= cumulated_exposure[-1]
else:
cumulated_exposure = np.linspace(0, 1, len(frequency))
plt.scatter(
cumulated_exposure,
cumulated_claims,
marker=".",
alpha=0.5,
label=label,
)
return cumulated_exposure, cumulated_claims
y_true_repeated = y_true.repeat(exposure)
y_pred_repeated = y_pred.repeat(exposure)
sample_weight_repeated = np.ones_like(y_pred_repeated)
res_repeated = lorenz_curve_linspace(
y_pred_repeated, sample_weight_repeated, label="repeated"
)
y_true_weighted = y_true
y_pred_weighted = y_pred
sample_weight_weighted = exposure
res_weighted = lorenz_curve_linspace(
y_pred_weighted, sample_weight_weighted, label="weighted"
)
plt.legend()
plt.title("Lorenz curve using cumulated exposure as x-axis");
# %%
res_repeated = lorenz_curve_linspace(
y_pred_repeated,
sample_weight_repeated,
label="repeated",
use_cumulated_exposure=False,
)
res_weighted = lorenz_curve_linspace(
y_pred_weighted,
sample_weight_weighted,
label="weighted",
use_cumulated_exposure=False,
)
plt.legend()
plt.title("Lorenz curve using linear exposure as x-axis");Here are the resulting plots: So this confirms that using the cumulated exposure as the x-axis to plot the Lorenz curve is the correct solution, otherwise, the expected repetitions/reweighting equivalence does not hold. |
ogrisel
left a comment
There was a problem hiding this comment.
I also took a look at the impact on the example figure in the notebook. The Gini score values are a bit larger than what we had in main (the Lorenz curves lie slightly further away from the diagonal) but otherwise they are quite qualitatively similar to what we had before. The text of the example does not need to be updated in particular.
+1 for merge. Thanks for the follow-up @m-maggi.
|
Sorry I was too late to give a feedback, but I think the |
|
@antoinebaker I agree, please feel free to open a PR with that fix and I apologize for merging too early :) |


Reference Issues/PRs
Fix attempt of #28534
What does this implement/fix? Explain your changes.
Take definition of Lorenz Curve from Poisson regression and non-normal loss and use it in Tweedie regression on insurance claims
Any other comments?
Following the discussion in #28534 it seems to me that the Lorenz curve should not use a linespace for the x values of the curve if the data is weighted.
Example snippet to test behaviour when linspace is used:
Results in
Snippet for results using the version of Poisson regression and non-normal loss, also implemented in this PR:
Details