FEA Add metric_at_thresholds#32732
Conversation
decision_threshold_curve decision_threshold_curve (approach 2)
|
ping @jeremiedbb if you have time to take a look, thank you |
jeremiedbb
left a comment
There was a problem hiding this comment.
This exactly what I had in mind. The implementation is a lot more natural than the approach of #31338.
Here are a few comments but already looks good.
I had also stumbled over this and |
There was a problem hiding this comment.
Thanks for the PR, @lucyleeow! I find this function really useful and also learned a lot while looking through your PR.
Here is some feedback and some questions, too.
|
The function should be referenced in |
|
FYI: I plan to review this PR tomorrow. Edit: Sorry, I need a bit longer, can only start later today. |
doc/whats_new/upcoming_changes/sklearn.metrics/32732.major-feature.rst
Outdated
Show resolved
Hide resolved
| array([0.8 , 0.4 , 0.35, 0.1 ]) | ||
| >>> scores | ||
| array([0.75, 0.5 , 0.75, 0.5 ]) | ||
| """ |
There was a problem hiding this comment.
What do you think about raising an error if a metric_func is passed that accepts y_score instead of y_pred?
Right now, this just passes and returns useless values.
There was a problem hiding this comment.
Though trying something like this
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import top_k_accuracy_score, metric_at_thresholds
X, y = make_classification(n_samples=100, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_score = clf.predict_proba(X_test)[:, 1]
metric_values, thresholds = metric_at_thresholds(y_test, y_score, top_k_accuracy_score)returns metric_values, thresholds and raises
UndefinedMetricWarning: 'k' (2) greater than or equal to 'n_classes' (2) will result in a perfect score and is therefore meaningless.
(n_thresholds times).
It can still be confusing. I think we can help users if we raise a ValueError early on.
There was a problem hiding this comment.
I don't think we are consistent enough that y_pred always means thresholded values, e.g., d2_log_loss_score has y_pred, which takes "Predicted probabilities".
I think the docstring updates you suggested should be enough for now but if we get issues we can look into changing ?
| :func:`~sklearn.metrics.metric_at_thresholds` allows you to easily generate such plots as it | ||
| computes the values required for each axis; scores per threshold and threshold values. |
There was a problem hiding this comment.
Do we want to (later) have a section in the user guide where we show metric_at_thresholds in action?
I'm working through the workflow, and this is what I currently understand:
- The
thresholdswe calculate viay_score = clf.predict_proba(X)are dependent on the data (X) we use to gety_score, so they're limited to what exists in that specific dataset - if new data (
X_test) produces differenty_score, its best threshold might not have been evaluated at all before and we cannot be sure the best threshold forXis the best threshold forX_test, too - on the other hand, if we tried to evaluate a metric on the best threshold using the same data (
X) that we used to find the threshold, we have data leakage
(Can you verify this is correct, @lucyleeow?)
I'd be happy to document the workflow so users can use this new tool safely.
There was a problem hiding this comment.
Do we want to (later) have a section in the user guide where we show metric_at_thresholds in action?
The second step will be to create a Display object to easily plot curves (metric vs threshold) from the results of this function.
There was a problem hiding this comment.
if new data (X_test) produces different y_score, its best threshold might not have been evaluated at all before and we cannot be sure the best threshold for X is the best threshold for X_test, too
on the other hand, if we tried to evaluate a metric on the best threshold using the same data (X) that we used to find the threshold, we have data leakage
Not sure what you mean by that. The purpose of this function is not to find the optimal threshold (for that we have the TunedThresholdClassifier). It's to plot a metric against all threshold values to visualize how the metric depends on the threshold. The curve may indeed be a bit different depending on the input data but that's not an issue.
There was a problem hiding this comment.
Oh I see, I misunderstood how this function would be used. Thanks for settling this, @jeremiedbb.
Maybe we can be a bit more explicit about it being for visualisation purposes then in the docstring and also refer to TunedThresholdClassifier from there for threshold tuning, @lucyleeow?
There was a problem hiding this comment.
Made some amendments, see comment for details.
sklearn/metrics/_ranking.py
Outdated
| Ground truth (correct) target labels. | ||
|
|
||
| y_score : array-like of shape (n_samples,) | ||
| Continuous response scores. |
There was a problem hiding this comment.
Could we use a more intuitive description? The other functions in this file phase it like this:
| Continuous response scores. | |
| Target scores, can either be probability estimates, confidence values, | |
| or non-thresholded measure of decisions (as returned by | |
| "decision_function" on some classifiers). |
(Though I'm not sure what "confidence values" means.)
Would that also be correct here?
There was a problem hiding this comment.
This does seem to be the phrasing used for most (all?) y_score definitions in _ranking.py.
Searching for "confidence values", I see in SVC we say: "confidence values of :term:decision_function". I can also see some other cases where we say:
probability estimates of the positive class, confidence values, or binary decisions values.
So we seem to be implying "confidence values" is decision function output? Though here, we specifically list output of decision functions in these _ranking.py metrics.
I don't know if "confidence values" is a technical term? I also wonder if it may cause confusion with "confidence intervals" in statistics. I would consider re-phrasing of the y_score parameter descriptions to be honest but @jeremiedbb would know much more than me though, WDYT?
There was a problem hiding this comment.
It also seems to me that there's redundancy between "confidence values" and "non-thresholded measure of decision", but I'm not sure. Looking at the blame, all these "confidence values" occurrences come from 11 years ago.
For this function I'd rather be coherent with the docstring of confusion_matrix_at_thresholds:
"Estimated probabilities or output of a decision function."
(We could still amend both to be a bit more precise: "Estimated probabilities of the positive class or output of a decision function.")
There was a problem hiding this comment.
"Estimated probabilities of the positive class or output of a decision function."
That's pretty nice.
(And maybe we can remove "confidence values" from the param description on the other functions in a later PR.)
There was a problem hiding this comment.
Looking at the blame, all these "confidence values" occurrences come from 11 years ago.
Agreed and it seems we've copy pasted all the parameter descriptions for y_score.
Done, and I agree about removing "confidence values" - maybe open a PR and see what others have to say? @StefanieSenger are you interested?
| Visualizing thresholds | ||
| ---------------------- | ||
|
|
||
| A useful visualization when tuning the decision threshold is a plot of metric values | ||
| across different thresholds. This is particularly valuable when there is more than | ||
| one metric of interest. The :func:`~sklearn.metrics.metric_at_thresholds` function | ||
| computes metric values at each unique score threshold, returning both the metric | ||
| array and corresponding threshold values for easy plotting. | ||
|
|
There was a problem hiding this comment.
I've moved this section further down to improve the flow of the docs - now all the TunedThresholdClassifierCV stuff is sequentially together.
This does mean that I do not have a metric per threshold plot to refer to.
I thought about adding a plot to bottom of the example (with the distribution of decision threshold plot) https://scikit-learn.org/dev/auto_examples/model_selection/plot_tuned_decision_threshold.html but since we set: cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42) - there are 50 different estimators and 50 lines to plot. I could have done some interpolation (since y_score would be different for each estimator) and summary statistic (quantile or mean/std bands) but I was not confident of the statistical validity of this.
I imagine once we add the display class we will want to use it in existing example (probably https://scikit-learn.org/dev/auto_examples/model_selection/plot_cost_sensitive_learning.html) and possibly add another example? We could expand this section then as well.
cc @AnneBeyer who may be interested in adding the display class following this PR
StefanieSenger
left a comment
There was a problem hiding this comment.
Now I reviewed all the remaining tests. There are only a few nits to address, @lucyleeow. Thanks for your work.
Other than these, this looks ready for merging and I have approved.
Maybe @glemaitre wants to have a look, since he had mentioned starting to review this.
|
Thanks @StefanieSenger ! I will merge at the end of this week unless @glemaitre reviews. |
|
Fixing merge conflicts with #32755 was a bit tricky but I think this is fine to merge now. Thanks all! |
|
Thanks for continuing the work on this, @lucyleeow!! :D Glad to see this was merged and to know that scikit-learn users will have more native tools for threshold optimization! 🎉 🎉 🎉 Next mission: get rid of |
Reference Issues/PRs
Another approach for #31338 (which superseded #25639 - closes #25639)
closes #21391
What does this implement/fix? Explain your changes.
Uses the approach suggested by @jeremiedbb (#31338 (comment)) of using re-using
confusion_matrix_at_thresholds(previously called_binary_clf_curve) to implement curve calculation - given a callable metric, calculates the metric for each threshold.I think I like this approach better:
decision_threshold_curvein line with the other curve functions we have (det_curve,precision_recall_curve,roc_curve), which all useconfusion_matrix_at_thresholds._CurveScorer_CurveScorerfrommodel_selectiontometrics#29216_CurveScorerare probably unnecessary for this function, e.g.,sign- as this is a stand alone function (vs being used inside an estimator (TunedThresholdClassifierCV), this is probably not needed the user can manipulate the output as required. Also_CurveScoreris particular in mapping thresholded labels to original classes (requiring call tounique(y)), which isn't really needed in this caseAny other comments?
Working on this also raised two questions:
sklearn/metricsmost modules have a leading_(e.g.,_regression.py). I thought this implied private module/functions?metric_per_threshold, then changed todecision_threshold_curve, since it is already in themetricmodule, the name metric was considered redundant (see FEA Implementation of "threshold-dependent metric per threshold value" curve #25639 (comment)). Is 'decision' meant to imply that this curve is meant to help with determining the threshold to use? Is term commonly used (genuine question, as I don't have the background in this area)?cc @jeremiedbb @glemaitre , and just in case you have time @StefanieSenger