FEA Add metric_at_thresholds by lucyleeow · Pull Request #32732 · scikit-learn/scikit-learn

lucyleeow · 2025-11-18T01:29:35Z

Reference Issues/PRs

Another approach for #31338 (which superseded #25639 - closes #25639)
closes #21391

What does this implement/fix? Explain your changes.

Uses the approach suggested by @jeremiedbb (#31338 (comment)) of using re-using confusion_matrix_at_thresholds (previously called _binary_clf_curve) to implement curve calculation - given a callable metric, calculates the metric for each threshold.

I think I like this approach better:

Brings decision_threshold_curve in line with the other curve functions we have (det_curve, precision_recall_curve, roc_curve), which all use confusion_matrix_at_thresholds.
- all of these curve functions would under go the same validation, errors/changes are more easily made for all curve functions
As you can see much fewer code changes required for this vs using _CurveScorer
- we could undo the changes in MNT Moving _CurveScorer from model_selection to metrics #29216
- parts of _CurveScorer are probably unnecessary for this function, e.g., sign - as this is a stand alone function (vs being used inside an estimator (TunedThresholdClassifierCV), this is probably not needed the user can manipulate the output as required. Also _CurveScorer is particular in mapping thresholded labels to original classes (requiring call to unique(y)), which isn't really needed in this case
- having to instantiate a class for this function is probably over-complicating things

Any other comments?

Working on this also raised two questions:

I noticed in sklearn/metrics most modules have a leading _ (e.g., _regression.py). I thought this implied private module/functions?
This function was originally named metric_per_threshold , then changed to decision_threshold_curve, since it is already in the metric module, the name metric was considered redundant (see FEA Implementation of "threshold-dependent metric per threshold value" curve #25639 (comment)). Is 'decision' meant to imply that this curve is meant to help with determining the threshold to use? Is term commonly used (genuine question, as I don't have the background in this area)?

cc @jeremiedbb @glemaitre , and just in case you have time @StefanieSenger

github-actions · 2025-11-18T01:30:35Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 51a8fbd. Link to the linter CI: here}

lucyleeow · 2026-01-12T10:04:55Z

ping @jeremiedbb if you have time to take a look, thank you

jeremiedbb

This exactly what I had in mind. The implementation is a lot more natural than the approach of #31338.

Here are a few comments but already looks good.

sklearn/metrics/__init__.py

sklearn/metrics/decision_threshold.py

sklearn/metrics/_ranking.py

sklearn/metrics/tests/test_decision_threshold.py

sklearn/metrics/decision_threshold.py

StefanieSenger · 2026-01-27T13:39:01Z

This function was originally named metric_per_threshold , then changed to decision_threshold_curve, since it is already in the metric module, the name metric was considered redundant (see #25639 (comment)). Is 'decision' meant to imply that this curve is meant to help with determining the threshold to use? Is term commonly used (genuine question, as I don't have the background in this area)?

I had also stumbled over this and decision somehow sounds more complicated to me than it should/could? For me, metric_threshold_curve or something similar would sound more intuitive. But same, I haven't worked in this area, though I wanted to surface this questions, because I think it is important to find a name that users would recognise and be immediately excited about.

StefanieSenger

Thanks for the PR, @lucyleeow! I find this function really useful and also learned a lot while looking through your PR.

Here is some feedback and some questions, too.

sklearn/metrics/_ranking.py

sklearn/metrics/decision_threshold.py

sklearn/metrics/_ranking.py

jeremiedbb · 2026-02-11T10:34:07Z

The function should be referenced in api_reference.py since it's a public function.

StefanieSenger · 2026-02-23T14:13:05Z

FYI: I plan to review this PR tomorrow.

Edit: Sorry, I need a bit longer, can only start later today.

doc/modules/classification_threshold.rst

doc/whats_new/upcoming_changes/sklearn.metrics/32732.major-feature.rst

sklearn/metrics/_ranking.py

StefanieSenger · 2026-02-26T10:36:34Z

sklearn/metrics/_ranking.py

+    array([0.8 , 0.4 , 0.35, 0.1 ])
+    >>> scores
+    array([0.75, 0.5 , 0.75, 0.5 ])
+    """


What do you think about raising an error if a metric_func is passed that accepts y_score instead of y_pred?

Right now, this just passes and returns useless values.

Though trying something like this

from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import top_k_accuracy_score, metric_at_thresholds X, y = make_classification(n_samples=100, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) clf = LogisticRegression() clf.fit(X_train, y_train) y_score = clf.predict_proba(X_test)[:, 1] metric_values, thresholds = metric_at_thresholds(y_test, y_score, top_k_accuracy_score)

returns metric_values, thresholds and raises

UndefinedMetricWarning: 'k' (2) greater than or equal to 'n_classes' (2) will result in a perfect score and is therefore meaningless.

(n_thresholds times).

It can still be confusing. I think we can help users if we raise a ValueError early on.

I don't think we are consistent enough that y_pred always means thresholded values, e.g., d2_log_loss_score has y_pred, which takes "Predicted probabilities".

I think the docstring updates you suggested should be enough for now but if we get issues we can look into changing ?

sklearn/metrics/_ranking.py

StefanieSenger · 2026-02-26T12:30:37Z

doc/modules/classification_threshold.rst

+:func:`~sklearn.metrics.metric_at_thresholds` allows you to easily generate such plots as it
+computes the values required for each axis; scores per threshold and threshold values.


Do we want to (later) have a section in the user guide where we show metric_at_thresholds in action?

I'm working through the workflow, and this is what I currently understand:

The thresholds we calculate via y_score = clf.predict_proba(X) are dependent on the data (X) we use to get y_score, so they're limited to what exists in that specific dataset

if new data (X_test) produces different y_score, its best threshold might not have been evaluated at all before and we cannot be sure the best threshold for X is the best threshold for X_test, too

on the other hand, if we tried to evaluate a metric on the best threshold using the same data (X) that we used to find the threshold, we have data leakage

(Can you verify this is correct, @lucyleeow?)

I'd be happy to document the workflow so users can use this new tool safely.

Do we want to (later) have a section in the user guide where we show metric_at_thresholds in action?

The second step will be to create a Display object to easily plot curves (metric vs threshold) from the results of this function.

if new data (X_test) produces different y_score, its best threshold might not have been evaluated at all before and we cannot be sure the best threshold for X is the best threshold for X_test, too
on the other hand, if we tried to evaluate a metric on the best threshold using the same data (X) that we used to find the threshold, we have data leakage

Not sure what you mean by that. The purpose of this function is not to find the optimal threshold (for that we have the TunedThresholdClassifier). It's to plot a metric against all threshold values to visualize how the metric depends on the threshold. The curve may indeed be a bit different depending on the input data but that's not an issue.

Oh I see, I misunderstood how this function would be used. Thanks for settling this, @jeremiedbb.

Maybe we can be a bit more explicit about it being for visualisation purposes then in the docstring and also refer to TunedThresholdClassifier from there for threshold tuning, @lucyleeow?

Made some amendments, see comment for details.

StefanieSenger · 2026-02-26T12:56:03Z

sklearn/metrics/_ranking.py

+        Ground truth (correct) target labels.
+
+    y_score : array-like of shape (n_samples,)
+        Continuous response scores.


Could we use a more intuitive description? The other functions in this file phase it like this:

Suggested change

Continuous response scores.

Target scores, can either be probability estimates, confidence values,

or non-thresholded measure of decisions (as returned by

"decision_function" on some classifiers).

(Though I'm not sure what "confidence values" means.)
Would that also be correct here?

This does seem to be the phrasing used for most (all?) y_score definitions in _ranking.py.

Searching for "confidence values", I see in SVC we say: "confidence values of :term:decision_function". I can also see some other cases where we say:

probability estimates of the positive class, confidence values, or binary decisions values.

So we seem to be implying "confidence values" is decision function output? Though here, we specifically list output of decision functions in these _ranking.py metrics.

I don't know if "confidence values" is a technical term? I also wonder if it may cause confusion with "confidence intervals" in statistics. I would consider re-phrasing of the y_score parameter descriptions to be honest but @jeremiedbb would know much more than me though, WDYT?

It also seems to me that there's redundancy between "confidence values" and "non-thresholded measure of decision", but I'm not sure. Looking at the blame, all these "confidence values" occurrences come from 11 years ago.

For this function I'd rather be coherent with the docstring of confusion_matrix_at_thresholds:
"Estimated probabilities or output of a decision function."

(We could still amend both to be a bit more precise: "Estimated probabilities of the positive class or output of a decision function.")

"Estimated probabilities of the positive class or output of a decision function."

That's pretty nice.

(And maybe we can remove "confidence values" from the param description on the other functions in a later PR.)

Looking at the blame, all these "confidence values" occurrences come from 11 years ago.

Agreed and it seems we've copy pasted all the parameter descriptions for y_score.

Done, and I agree about removing "confidence values" - maybe open a PR and see what others have to say? @StefanieSenger are you interested?

lucyleeow · 2026-02-27T06:18:08Z

doc/modules/classification_threshold.rst

+Visualizing thresholds
+----------------------
+
+A useful visualization when tuning the decision threshold is a plot of metric values
+across different thresholds. This is  particularly valuable when there is more than
+one metric of interest. The :func:`~sklearn.metrics.metric_at_thresholds` function
+computes metric values at each unique score threshold, returning both the metric
+array and corresponding threshold values for easy plotting.
+


I've moved this section further down to improve the flow of the docs - now all the TunedThresholdClassifierCV stuff is sequentially together.

This does mean that I do not have a metric per threshold plot to refer to.

I thought about adding a plot to bottom of the example (with the distribution of decision threshold plot) https://scikit-learn.org/dev/auto_examples/model_selection/plot_tuned_decision_threshold.html but since we set: cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42) - there are 50 different estimators and 50 lines to plot. I could have done some interpolation (since y_score would be different for each estimator) and summary statistic (quantile or mean/std bands) but I was not confident of the statistical validity of this.

I imagine once we add the display class we will want to use it in existing example (probably https://scikit-learn.org/dev/auto_examples/model_selection/plot_cost_sensitive_learning.html) and possibly add another example? We could expand this section then as well.

cc @AnneBeyer who may be interested in adding the display class following this PR

doc/modules/classification_threshold.rst

sklearn/metrics/_ranking.py

StefanieSenger

Now I reviewed all the remaining tests. There are only a few nits to address, @lucyleeow. Thanks for your work.

Other than these, this looks ready for merging and I have approved.
Maybe @glemaitre wants to have a look, since he had mentioned starting to review this.

sklearn/metrics/tests/test_ranking.py

lucyleeow · 2026-03-03T04:23:57Z

Thanks @StefanieSenger !

I will merge at the end of this week unless @glemaitre reviews.

lucyleeow · 2026-03-07T05:58:47Z

Fixing merge conflicts with #32755 was a bit tricky but I think this is fine to merge now. Thanks all!

vitaliset · 2026-03-10T20:26:34Z

Thanks for continuing the work on this, @lucyleeow!! :D Glad to see this was merged and to know that scikit-learn users will have more native tools for threshold optimization! 🎉 🎉 🎉

Next mission: get rid of .predict for classifiers! ☠️ hahah

lucyleeow added 2 commits November 18, 2025 11:07

initial commit

4483544

revert name back to decision threshold

51a8fbd

github-actions bot added the module:metrics label Nov 18, 2025

lucyleeow changed the title ~~FEA Add decision_threshold_curve~~ FEA Add decision_threshold_curve (approach 2) Nov 18, 2025

This was referenced Nov 18, 2025

FEA Adds decision_threshold_curve function #31338

Closed

DOC Add links to glossary for terms binary, multiclass, multilabel for ranking metrics #32733

Merged

adrinjalali moved this to Todo in Labs Dec 10, 2025

adrinjalali added this to Labs Dec 10, 2025

adrinjalali assigned StefanieSenger and jeremiedbb Dec 10, 2025

StefanieSenger moved this from Todo to In progress in Labs Dec 11, 2025

adrinjalali unassigned jeremiedbb and StefanieSenger Dec 11, 2025

adrinjalali moved this from In progress to In progress - High Priority in Labs Jan 6, 2026

jeremiedbb reviewed Jan 17, 2026

View reviewed changes

StefanieSenger reviewed Jan 29, 2026

View reviewed changes

lucyleeow added 4 commits February 9, 2026 16:06

Merge branch 'main' into metric_per_threshold_2

a4d21d4

review

09118a9

add whats new

a6c4bfb

doc updates

42cc35f

auguste-probabl moved this from In progress - High Priority to In progress in Labs Feb 9, 2026

lucyleeow added 2 commits February 10, 2026 20:03

mv tests

d0797c9

fix docstring

104b89f

jeremiedbb reviewed Feb 10, 2026

View reviewed changes

sklearn/metrics/_ranking.py Outdated Show resolved Hide resolved

sklearn/metrics/_ranking.py Outdated Show resolved Hide resolved

sklearn/metrics/_ranking.py Outdated Show resolved Hide resolved

sklearn/metrics/_ranking.py Outdated Show resolved Hide resolved

lucyleeow mentioned this pull request Feb 13, 2026

TST Add common test for mixed array API inputs for metrics #32755

Merged

Merge branch 'main' into metric_per_threshold_2

0a2ba87

lucyleeow added 2 commits February 24, 2026 12:31

review

ea39359

Merge branch 'main' into metric_per_threshold_2

a32d15a

StefanieSenger reviewed Feb 25, 2026

View reviewed changes

doc/modules/classification_threshold.rst Outdated Show resolved Hide resolved

doc/whats_new/upcoming_changes/sklearn.metrics/32732.major-feature.rst Outdated Show resolved Hide resolved

sklearn/metrics/_ranking.py Show resolved Hide resolved

lucyleeow added 4 commits February 26, 2026 16:02

review

773e5d9

merge main

e095088

fix docstring?

31f13fd

fix func ref

1742564

StefanieSenger reviewed Feb 26, 2026

View reviewed changes

review

b5d2b66

lucyleeow mentioned this pull request Feb 27, 2026

FEA Implementation of "threshold-dependent metric per threshold value" curve #25639

Closed

7 tasks

lucyleeow added this to Visualization and displays Feb 27, 2026

lucyleeow moved this to In Progress in Visualization and displays Feb 27, 2026

amend doc

8a6b16a

lucyleeow commented Feb 27, 2026

View reviewed changes

StefanieSenger reviewed Feb 27, 2026

View reviewed changes

doc/modules/classification_threshold.rst Outdated Show resolved Hide resolved

sklearn/metrics/_ranking.py Outdated Show resolved Hide resolved

sklearn/metrics/_ranking.py Show resolved Hide resolved

review

595da3d

StefanieSenger approved these changes Mar 2, 2026

View reviewed changes

review

121c3be

review

86797f2

StefanieSenger mentioned this pull request Mar 3, 2026

DOC Improve description of values expected by y_score param #33443

Open

4 tasks

merge main

febdb0e

lucyleeow merged commit 3acced3 into scikit-learn:main Mar 7, 2026
37 checks passed

github-project-automation bot moved this from In Progress to Done in Visualization and displays Mar 7, 2026

github-project-automation bot moved this from In progress to Done in Labs Mar 7, 2026

lucyleeow deleted the metric_per_threshold_2 branch March 7, 2026 05:58

		:func:`~sklearn.metrics.metric_at_thresholds` allows you to easily generate such plots as it
		computes the values required for each axis; scores per threshold and threshold values.

-        Continuous response scores.
+        Target scores, can either be probability estimates, confidence values,
+        or non-thresholded measure of decisions (as returned by
+        "decision_function" on some classifiers).

Uh oh!

Conversation

lucyleeow commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Nov 18, 2025

✔️ Linting Passed

Uh oh!

lucyleeow commented Jan 12, 2026

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StefanieSenger commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanieSenger left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeremiedbb commented Feb 11, 2026

Uh oh!

StefanieSenger commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow commented Nov 18, 2025 •

edited

Loading

StefanieSenger commented Jan 27, 2026 •

edited

Loading

StefanieSenger left a comment •

edited

Loading

StefanieSenger commented Feb 23, 2026 •

edited

Loading

StefanieSenger Feb 26, 2026 •

edited

Loading

StefanieSenger Feb 26, 2026 •

edited

Loading

StefanieSenger Feb 26, 2026 •

edited

Loading