Conversation
|
@JosephBARBIERDARNAL Could you please address all the comments from #28752? |
|
Yes, I'm ready to do it, but I won't be able to get back to it quickly (2 to 3 months) unfortunately. Hope that's okay |
|
Working on this PR again. Here are the current updates:
Some (likely non-exhaustive) issues:
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import CapCurveDisplay
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
y_scores = clf.decision_function(X_test)
fig, ax = plt.subplots(ncols=2, dpi=300, figsize=(12,12))
display = CapCurveDisplay.from_predictions(
ax=ax[0],
y_true=y_test,
y_pred=y_scores,
name='normalize_scale=False',
normalize_scale=False,
plot_chance_level=True
)
display = CapCurveDisplay.from_predictions(
ax=ax[1],
y_true=y_test,
y_pred=y_scores,
name='normalize_scale=True',
normalize_scale=True,
plot_chance_level=True
)
|
|
I'll come back on this PR after the release. This will be one of my priority to be merged for 1.7. |
Ok cool. There's still a few things I need to do anyway before a review (cov test and adding the #30023 check). |
sklearn/metrics/_plot/cap_curve.py
Outdated
|
|
||
| # compute cumulative sums for true positives and all cases | ||
| y_true_cumulative = np.cumsum(y_true_sorted * sample_weight_sorted) | ||
| cumulative_total = np.cumsum(sample_weight_sorted) |
There was a problem hiding this comment.
For information, there was a concurrent PR to fix the lack of cumsum of the sample_weight to define the x-axis of the Lorenz curves in one of our regression examples. To check that this was the correct fix, we ran a quick check on synthetic data with integer valued sample weights to check that there is an exact equivalence between repeated data points and reweighting them by exposure:
Maybe this PR could be expanded to test for this property also holds for CapCurveDisplay.from_predictions with non-constant, integer-valued sample_weight.
EDIT: I am not entirely sure how to write such a test, but possibly we could numpy.interp1d the CAP curve computed on the repeated for the x-axis location of the points of the weighted CAP curve and check that the two curves match up to a small eps at those locations.
ogrisel
left a comment
There was a problem hiding this comment.
There are several review comments in #28752 that have not yet been addressed.
I would like to make sure that we do not forget about them when iterating on this PR.
@JosephBARBIERDARNAL to sort things out, please reply in the threads of the review of #28752 and make it explicit which comments have been addressed in #28972 and how, and then mark those threads as resolved.
Also, we need some test and update an existing example, where we compare ROC, DET and CAP curves on the same classifier. I suppose this example is the best candidate:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_det.html
|
@ogrisel That sounds good to me. I won't be able to work on it immediately, but I’ll definitely be able to get to it within the next few weeks. I'll ping you and/or @glemaitre for review |
|
2 things are important for me:
|
|
I resolved most of the conversations there. I didn't touch some of them if I wasn't sure. Feel free to ping me if any changes are needed. I'm just not sure what |
|
A few things that I think need to be addressed:
|
I think we should rename to class ROCCurveDisplay(...):
...
# Backward compat alias to keep code implemented for scikit-learn 1.7 and earlier working.
RocCurveDisplay = ROCCurveDisplayWe could introduce a deprecation warning, but I am worried that this will break educational resources (e.g. code in blog posts, tutorials, books) for little benefit, so I would rather go with "soft" deprecation: rename and keep a long term backward compat alias without warnings. But let's not do that as part of this PR and rather do the renaming/deprecation in a follow-up instead. |
It's already part of the API, so we have to stick with it for the API. My earlier comment is not to put it in section headers or documentation paragraphs without explanations. I prefer introducing the concept with more precise terms such as "The expected curve of a non-informative or constant classifier whose predictions do not depend on the input features". Note that for ROC and DET, the expected curve of the non-informative classifier in the infinite sample limit (what we name "chance level" in the API) matches exactly the curve of the constant predictor obtained on a finite size test set. For CAP it's more subtle: the "chance level" line only approximately matches the curve of the constant predictor obtained on a finite size test set. The two would only match in the large sample size limit. I think this is an interesting point and I would like to highlight in this section of the example. |
|
I'm a bit short of time at the moment, but to sum up the TODO:
Feel free to add any other missing elements. |
lucyleeow
left a comment
There was a problem hiding this comment.
This is a very thorough PR, thanks @JosephBARBIERDARNAL
A very high level review, mostly nitpicks.
| a :class:`~sklearn.metrics.CAPCurveDisplay`. All parameters are | ||
| stored as attributes. | ||
|
|
||
| Read more in the :ref:`User Guide <visualizations>`. |
There was a problem hiding this comment.
We've decided to add a reference to both the visualizations.rst page and the page in the user guide that talks about the curve, see #31238
|
|
||
| name : str, default=None | ||
| Name of CAP curve for labeling. If `None`, name will be set to | ||
| `"Classifier"` or `"Regressor"`. |
There was a problem hiding this comment.
I know we do this elsewhere, but I don't think these default names add much and I'd rather not have them. WDYT @glemaitre ?
| chance_level_ : matplotlib Artist | ||
| Curve of the independent classifier. | ||
|
|
||
| perfect_level_ : matplotlib Artist | ||
| Curve of the perfect classifier. |
There was a problem hiding this comment.
Are these lines only relevant for classification, and not for regression..?
Co-authored-by: Lucy Liu <jliu176@gmail.com>
Co-authored-by: Lucy Liu <jliu176@gmail.com>
|
Updated with main just so the CI's are run again |
There was a problem hiding this comment.
Thanks for your patience @JosephBARBIERDARNAL ! Please let me know if you need any assistance.
This will get the CI green at least.
The other CI failure is because the parameters in the docstring do not match the order in the method param. I think this is the case for all 3 plot, from_estimator and from_predictions.
| ls="--", | ||
| color="k", |
There was a problem hiding this comment.
| ls="--", | |
| color="k", | |
| curve_kwargs={"ls": "--", "color": "k"}, |
I think this is why doc build is failing
| ls="--", | ||
| color="k", |
There was a problem hiding this comment.
| ls="--", | |
| color="k", | |
| curve_kwargs={"ls": "--", "color": "k"}, |
|
@lucyleeow thanks! I haven't forgotten about that PR, but I've been a little busier than expected. I think there's still a lot of work to be done here, so I hope to get back to it soon! |
|
I happened to show someone how to solve conflicts in git and this PR was the first one I found with conflicts. Now that the work is done, I may as well push into your branch. I think there are more CI issues but I will leave this exercise for the author 😉. |

Reference Issue
Fixes #10003.
Supersedes #15176. (edit by @lorentzenchr)
What does this implement/fix?
creation of a CumulativeAccuracyDisplay class for plots
"The CAP of a model represents the cumulative number of positive outcomes along the y-axis versus the corresponding cumulative number of a classifying parameter along the x-axis. The output is called a CAP curve.[1] The CAP is distinct from the receiver operating characteristic (ROC) curve, which plots the true-positive rate against the false-positive rate." (wikipedia definition)
It's mainly inspired from the
RocCurveDisplayclass.other
It's currently a work in progress.
TODO
Binary classification
ValueErrorinfrom_estimatorif the estimator is not fitted or is a classifier that was fitted with more than 3 classes;pos_labelhandling when the positive class;response_method="decision_function"andresponse_method="predict_proba"for aLogisticRegressionclassifier fit with string labels and for all 3 possible values ofpos_label;test_display_from_estimator_and_from_prediction;y_true_cumulativeandcumulative_totalhave the same dtype asy_predin the test aboutfrom_predictions. We can test fory_predpassed either asnp.float32ornp.float64.CAPCurveDisplay.from_estimator(LinearSVC().fit(X, y), ...)works (even if it does not have apredict_probamethod. This should cover one of the line reported as uncovered by codecov.test_common_curve_display.pyto reuse some generic tests onCAPCurveDisplayand maybe remove redundant tests on invalid inputs fromtest_cap_curve_display.pyif any;despineargument?*Displayclasses in scikit-learn. Feel free to open an issue to discuss this with screen shots e.g. on ROC or PR curves and your analysis of pros and cons.despinekeyword for ROC and PR curves #26367). I'm not sure it makes much sense forConfusionMatrixDisplay(?). I'll open an issue (when this PR will be merged) forCAPCurveDisplay,PredictionErrorDisplayandDetCurveDisplaybecause I think they're the only ones that don't have this option.Regression
ValueErrorwith an informative error message ify_truehas negative values;ValueErrorif ally_trueare zeros (the plot would be degenerate and would raise a low leveldivide by zerowarning whithnormalize_scale=True);y_trueare zeros, it will be considered a case of classificationPoissonRegressor) and check that the regressor curve lie between the "chance level" and "perfect" curves;examples/linear_model/plot_tweedie_regression_insurance_claims.pyandexamples/linear_model/plot_poisson_regression_non_normal_loss.py) to use theCAPCurveDisplayclass instead of manually plotting the Lorenz curves.Other
doc/whats_new/upcoming_changes/doc/visualization) to reference this new tool.Nice to have