Skip to content

FEA Add metric_at_thresholds#32732

Merged
lucyleeow merged 34 commits intoscikit-learn:mainfrom
lucyleeow:metric_per_threshold_2
Mar 7, 2026
Merged

FEA Add metric_at_thresholds#32732
lucyleeow merged 34 commits intoscikit-learn:mainfrom
lucyleeow:metric_per_threshold_2

Conversation

@lucyleeow
Copy link
Copy Markdown
Member

@lucyleeow lucyleeow commented Nov 18, 2025

Reference Issues/PRs

Another approach for #31338 (which superseded #25639 - closes #25639)
closes #21391

What does this implement/fix? Explain your changes.

Uses the approach suggested by @jeremiedbb (#31338 (comment)) of using re-using confusion_matrix_at_thresholds (previously called _binary_clf_curve) to implement curve calculation - given a callable metric, calculates the metric for each threshold.

I think I like this approach better:

  • Brings decision_threshold_curve in line with the other curve functions we have (det_curve, precision_recall_curve, roc_curve), which all use confusion_matrix_at_thresholds.
    • all of these curve functions would under go the same validation, errors/changes are more easily made for all curve functions
  • As you can see much fewer code changes required for this vs using _CurveScorer
    • we could undo the changes in MNT Moving _CurveScorer from model_selection to metrics #29216
    • parts of _CurveScorer are probably unnecessary for this function, e.g., sign - as this is a stand alone function (vs being used inside an estimator (TunedThresholdClassifierCV), this is probably not needed the user can manipulate the output as required. Also _CurveScorer is particular in mapping thresholded labels to original classes (requiring call to unique(y)), which isn't really needed in this case
    • having to instantiate a class for this function is probably over-complicating things

Any other comments?

Working on this also raised two questions:

  • I noticed in sklearn/metrics most modules have a leading _ (e.g., _regression.py). I thought this implied private module/functions?
  • This function was originally named metric_per_threshold , then changed to decision_threshold_curve, since it is already in the metric module, the name metric was considered redundant (see FEA Implementation of "threshold-dependent metric per threshold value" curve #25639 (comment)). Is 'decision' meant to imply that this curve is meant to help with determining the threshold to use? Is term commonly used (genuine question, as I don't have the background in this area)?

cc @jeremiedbb @glemaitre , and just in case you have time @StefanieSenger

@github-actions
Copy link
Copy Markdown

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 51a8fbd. Link to the linter CI: here

@lucyleeow lucyleeow changed the title FEA Add decision_threshold_curve FEA Add decision_threshold_curve (approach 2) Nov 18, 2025
@adrinjalali adrinjalali moved this to Todo in Labs Dec 10, 2025
@adrinjalali adrinjalali added this to Labs Dec 10, 2025
@StefanieSenger StefanieSenger moved this from Todo to In progress in Labs Dec 11, 2025
@adrinjalali adrinjalali moved this from In progress to In progress - High Priority in Labs Jan 6, 2026
@lucyleeow
Copy link
Copy Markdown
Member Author

ping @jeremiedbb if you have time to take a look, thank you

Copy link
Copy Markdown
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exactly what I had in mind. The implementation is a lot more natural than the approach of #31338.

Here are a few comments but already looks good.

@StefanieSenger
Copy link
Copy Markdown
Member

StefanieSenger commented Jan 27, 2026

This function was originally named metric_per_threshold , then changed to decision_threshold_curve, since it is already in the metric module, the name metric was considered redundant (see #25639 (comment)). Is 'decision' meant to imply that this curve is meant to help with determining the threshold to use? Is term commonly used (genuine question, as I don't have the background in this area)?

I had also stumbled over this and decision somehow sounds more complicated to me than it should/could? For me, metric_threshold_curve or something similar would sound more intuitive. But same, I haven't worked in this area, though I wanted to surface this questions, because I think it is important to find a name that users would recognise and be immediately excited about.

Copy link
Copy Markdown
Member

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @lucyleeow! I find this function really useful and also learned a lot while looking through your PR.

Here is some feedback and some questions, too.

@auguste-probabl auguste-probabl moved this from In progress - High Priority to In progress in Labs Feb 9, 2026
@jeremiedbb
Copy link
Copy Markdown
Member

The function should be referenced in api_reference.py since it's a public function.

@StefanieSenger
Copy link
Copy Markdown
Member

StefanieSenger commented Feb 23, 2026

FYI: I plan to review this PR tomorrow.

Edit: Sorry, I need a bit longer, can only start later today.

array([0.8 , 0.4 , 0.35, 0.1 ])
>>> scores
array([0.75, 0.5 , 0.75, 0.5 ])
"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about raising an error if a metric_func is passed that accepts y_score instead of y_pred?

Right now, this just passes and returns useless values.

Copy link
Copy Markdown
Member

@StefanieSenger StefanieSenger Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though trying something like this

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import top_k_accuracy_score, metric_at_thresholds

X, y = make_classification(n_samples=100, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_score = clf.predict_proba(X_test)[:, 1]
metric_values, thresholds = metric_at_thresholds(y_test, y_score, top_k_accuracy_score)

returns metric_values, thresholds and raises

UndefinedMetricWarning: 'k' (2) greater than or equal to 'n_classes' (2) will result in a perfect score and is therefore meaningless.

(n_thresholds times).

It can still be confusing. I think we can help users if we raise a ValueError early on.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we are consistent enough that y_pred always means thresholded values, e.g., d2_log_loss_score has y_pred, which takes "Predicted probabilities".

I think the docstring updates you suggested should be enough for now but if we get issues we can look into changing ?

Comment on lines +91 to +92
:func:`~sklearn.metrics.metric_at_thresholds` allows you to easily generate such plots as it
computes the values required for each axis; scores per threshold and threshold values.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to (later) have a section in the user guide where we show metric_at_thresholds in action?

I'm working through the workflow, and this is what I currently understand:

  • The thresholds we calculate via y_score = clf.predict_proba(X) are dependent on the data (X) we use to get y_score, so they're limited to what exists in that specific dataset
  • if new data (X_test) produces different y_score, its best threshold might not have been evaluated at all before and we cannot be sure the best threshold for X is the best threshold for X_test, too
  • on the other hand, if we tried to evaluate a metric on the best threshold using the same data (X) that we used to find the threshold, we have data leakage

(Can you verify this is correct, @lucyleeow?)

I'd be happy to document the workflow so users can use this new tool safely.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to (later) have a section in the user guide where we show metric_at_thresholds in action?

The second step will be to create a Display object to easily plot curves (metric vs threshold) from the results of this function.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if new data (X_test) produces different y_score, its best threshold might not have been evaluated at all before and we cannot be sure the best threshold for X is the best threshold for X_test, too
on the other hand, if we tried to evaluate a metric on the best threshold using the same data (X) that we used to find the threshold, we have data leakage

Not sure what you mean by that. The purpose of this function is not to find the optimal threshold (for that we have the TunedThresholdClassifier). It's to plot a metric against all threshold values to visualize how the metric depends on the threshold. The curve may indeed be a bit different depending on the input data but that's not an issue.

Copy link
Copy Markdown
Member

@StefanieSenger StefanieSenger Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, I misunderstood how this function would be used. Thanks for settling this, @jeremiedbb.

Maybe we can be a bit more explicit about it being for visualisation purposes then in the docstring and also refer to TunedThresholdClassifier from there for threshold tuning, @lucyleeow?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some amendments, see comment for details.

Ground truth (correct) target labels.

y_score : array-like of shape (n_samples,)
Continuous response scores.
Copy link
Copy Markdown
Member

@StefanieSenger StefanieSenger Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a more intuitive description? The other functions in this file phase it like this:

Suggested change
Continuous response scores.
Target scores, can either be probability estimates, confidence values,
or non-thresholded measure of decisions (as returned by
"decision_function" on some classifiers).

(Though I'm not sure what "confidence values" means.)
Would that also be correct here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem to be the phrasing used for most (all?) y_score definitions in _ranking.py.

Searching for "confidence values", I see in SVC we say: "confidence values of :term:decision_function". I can also see some other cases where we say:

probability estimates of the positive class, confidence values, or binary decisions values.

So we seem to be implying "confidence values" is decision function output? Though here, we specifically list output of decision functions in these _ranking.py metrics.

I don't know if "confidence values" is a technical term? I also wonder if it may cause confusion with "confidence intervals" in statistics. I would consider re-phrasing of the y_score parameter descriptions to be honest but @jeremiedbb would know much more than me though, WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also seems to me that there's redundancy between "confidence values" and "non-thresholded measure of decision", but I'm not sure. Looking at the blame, all these "confidence values" occurrences come from 11 years ago.

For this function I'd rather be coherent with the docstring of confusion_matrix_at_thresholds:
"Estimated probabilities or output of a decision function."

(We could still amend both to be a bit more precise: "Estimated probabilities of the positive class or output of a decision function.")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Estimated probabilities of the positive class or output of a decision function."

That's pretty nice.

(And maybe we can remove "confidence values" from the param description on the other functions in a later PR.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the blame, all these "confidence values" occurrences come from 11 years ago.

Agreed and it seems we've copy pasted all the parameter descriptions for y_score.


Done, and I agree about removing "confidence values" - maybe open a PR and see what others have to say? @StefanieSenger are you interested?

Comment on lines +141 to +149
Visualizing thresholds
----------------------

A useful visualization when tuning the decision threshold is a plot of metric values
across different thresholds. This is particularly valuable when there is more than
one metric of interest. The :func:`~sklearn.metrics.metric_at_thresholds` function
computes metric values at each unique score threshold, returning both the metric
array and corresponding threshold values for easy plotting.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this section further down to improve the flow of the docs - now all the TunedThresholdClassifierCV stuff is sequentially together.

This does mean that I do not have a metric per threshold plot to refer to.

I thought about adding a plot to bottom of the example (with the distribution of decision threshold plot) https://scikit-learn.org/dev/auto_examples/model_selection/plot_tuned_decision_threshold.html but since we set: cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42) - there are 50 different estimators and 50 lines to plot. I could have done some interpolation (since y_score would be different for each estimator) and summary statistic (quantile or mean/std bands) but I was not confident of the statistical validity of this.

I imagine once we add the display class we will want to use it in existing example (probably https://scikit-learn.org/dev/auto_examples/model_selection/plot_cost_sensitive_learning.html) and possibly add another example? We could expand this section then as well.

cc @AnneBeyer who may be interested in adding the display class following this PR

Copy link
Copy Markdown
Member

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I reviewed all the remaining tests. There are only a few nits to address, @lucyleeow. Thanks for your work.

Other than these, this looks ready for merging and I have approved.
Maybe @glemaitre wants to have a look, since he had mentioned starting to review this.

@lucyleeow
Copy link
Copy Markdown
Member Author

Thanks @StefanieSenger !

I will merge at the end of this week unless @glemaitre reviews.

@lucyleeow
Copy link
Copy Markdown
Member Author

Fixing merge conflicts with #32755 was a bit tricky but I think this is fine to merge now. Thanks all!

@lucyleeow lucyleeow merged commit 3acced3 into scikit-learn:main Mar 7, 2026
37 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Visualization and displays Mar 7, 2026
@github-project-automation github-project-automation bot moved this from In progress to Done in Labs Mar 7, 2026
@lucyleeow lucyleeow deleted the metric_per_threshold_2 branch March 7, 2026 05:58
@vitaliset
Copy link
Copy Markdown
Contributor

Thanks for continuing the work on this, @lucyleeow!! :D Glad to see this was merged and to know that scikit-learn users will have more native tools for threshold optimization! 🎉 🎉 🎉

Next mission: get rid of .predict for classifiers! ☠️ hahah

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

add sklearn.metrics Display class to plot Precision/Recall/F1 for probability thresholds

6 participants