FEA Add array API support for brier_score_loss, log_loss, d2_brier_score and d2_log_loss_score#32422
Conversation
| check_consistent_length(y_prob, y_true, sample_weight) | ||
| if sample_weight is not None: | ||
| _check_sample_weight(sample_weight, y_true, force_float_dtype=False) | ||
| _check_sample_weight(sample_weight, y_prob, force_float_dtype=False) |
There was a problem hiding this comment.
Updated this because we follow the namespace and device of y_prob as y_true can or cannot be on the xp namespace and device
| check_consistent_length(y_prob, y_true, sample_weight) | ||
| if sample_weight is not None: | ||
| _check_sample_weight(sample_weight, y_true, force_float_dtype=False) | ||
| _check_sample_weight(sample_weight, y_prob, force_float_dtype=False) |
There was a problem hiding this comment.
Same as above: Updated this because we follow the namespace and device of y_prob as y_true can or cannot be on the xp namespace and device.
ogrisel
left a comment
There was a problem hiding this comment.
Thanks @OmarManzoor. Here is a pass of feedback.
Maybe @lucyleeow would also be interested in reviewing and express her opinion on my suggestions below.
| y_prob_sum = xp.sum(y_prob, axis=1) | ||
|
|
||
| if not xp.all( | ||
| xpx.isclose( |
There was a problem hiding this comment.
It would be great to contribute an xpx.allclose upstream.
There was a problem hiding this comment.
I think it can be done but for now it seems like a straightforward change to add an additional xp.all. What do you think?
There was a problem hiding this comment.
Using xp.all is ok for now, I was just suggesting a follow-up improvement.
sklearn/metrics/_classification.py
Outdated
| if is_y_true_array_api: | ||
| y_true = _convert_to_numpy(y_true, xp=xp_y_true) |
There was a problem hiding this comment.
| if is_y_true_array_api: | |
| y_true = _convert_to_numpy(y_true, xp=xp_y_true) | |
| # For classification metrics, both array API compatible and non array | |
| # API compatible inputs are allowed for y_true: in particular arrays | |
| # that store class labels as Python string with an object dtype cannot | |
| # be represented with non-NumPy namespaces. To avoid having to maintain | |
| # two code branches, we always convert y_true to NumPy and move the | |
| # integer encoded output of LabelBinarizer.transform back to the y_prob | |
| # namespace, irrespective of the original y_true namespace. | |
| if is_y_true_array_api: | |
| y_true = _convert_to_numpy(y_true, xp=xp_y_true) |
There was a problem hiding this comment.
Actually, I think we could improve the readability of the _validate_multiclass_probabilistic_prediction function by extracting the namespace aware one-hot encoding of y_true into its own private helper:
def _one_hot_multiclass_target(y_true, target_xp, target_device, labels=None):
# Ensure that y_true is numpy, call the LabelBinarizer, perform labels consistency
# checks and move the result to the target namespace and device.
...
return transformed_labelsWe could similarly extract the matching code for the binary case out of _validate_binary_probabilistic_prediction:
def _one_hot_binary_target(y_true, target_xp, target_device, pos_label=None):
...
return transformed_labelsThose helpers might also to be reusable to improve array API support for other classification metrics for which y_pred is not probabilistic (e.g. ROC AUC, average precision, confusion matrix based metrics and so on...)
There was a problem hiding this comment.
@OmarManzoor @ogrisel
To clarify, is the intention that all classification metrics should support string y_true (when it is numpy). E.g., for accuracy_score
scikit-learn/sklearn/metrics/_classification.py
Lines 407 to 408 in 473fef0
we would want to change this to something similar-ish to what is in _one_hot_encoding_binary_target, like:
xp_y_true = get_namespace(y_true)
y_true = xp.asarray(y_true, dytpe=xp_y_true.int64)
y_true, sample_weight = move_to(y_true, sample_weight, xp=xp, device=device) or even
xp_y_true = get_namespace(y_true)
if _is_numpy_namespace(xp_y_true):
y_true = xp.asarray(y_true, dytpe=xp_y_true.int64)
y_true, sample_weight = move_to(y_true, sample_weight, xp=xp, device=device) There was a problem hiding this comment.
(just un-resolving for visibility)
| ) | ||
| y_proba_null = np.average(transformed_labels, axis=0, weights=sample_weight) | ||
| y_proba_null = np.tile(y_proba_null, (len(y_true), 1)) | ||
| transformed_labels = xp.astype(transformed_labels, y_proba.dtype, copy=False) |
There was a problem hiding this comment.
Maybe we could add a dtype param to the one hot encoding helpers.
There was a problem hiding this comment.
But that would then create confusions when we don't need such a change. I think it might make sense to do this where it's required.
There was a problem hiding this comment.
Ok, but let's keep that possibility in mind if we repeat this .astype calls in future reuses of the _one_hot_encoding_binary/multiclass_target functions.
| # `LabelBinarizer` and then transfer the integer encoded output back to the | ||
| # target namespace and device. | ||
| if is_y_true_array_api: | ||
| y_true = _convert_to_numpy(y_true, xp=xp_y_true) |
There was a problem hiding this comment.
Note that in a follow-up PR, we could change the label binarizer to spare this forced NumPy conversion (when using numeric class labels). This would involve adding array API support to LabelBinarizer when sparse_output=False.
But we can probably do that in a follow-up PR.
Note that this discussion caused https://github.com/scikit-learn/scikit-learn/pull/30439/files#r1875958580 to stall in the past but I think we can decouple the concerns.
There was a problem hiding this comment.
I agree it can be modified to allow the array api when the inputs are already numeric and compatible. However not sure how much benefit we can get from that considering it's basically mainly used for encoding labels.
There was a problem hiding this comment.
BTW, I wouldn't be opposed to include the changes of #30439 into this PR and also add array API support for Brier score since they are related functions with shared private helpers.
There was a problem hiding this comment.
Maybe after merging this one we can complete log_loss and brier_score in one PR? But we can do it in this PR too. As you would prefer.
There was a problem hiding this comment.
However not sure how much benefit we can get from that considering it's basically mainly used for encoding labels.
The main benefit would be cleaner/simpler code.
There was a problem hiding this comment.
Maybe after merging this one we can complete log_loss and brier_score in one PR? But we can do it in this PR too. As you would prefer.
No strong opinion.
There was a problem hiding this comment.
And it would be interesting to see the impact of not converting to numpy when running the benchmarks of #32422 (comment) which uses integer class values in y_true.
|
A couple of quick benchmarks n_samples = 1000000
n_classes = 10Avg execution_time for d2_log_loss_score Numpy (main): 0.5043204307556153
Avg execution_time for d2_brier_score Numpy (main): 0.35432188510894774
Avg execution_time for d2_log_loss_score Numpy (current branch): 0.5363539457321167
Avg execution_time for d2_brier_score Numpy (current branch): 0.36591079235076907
Avg execution_time for d2_log_loss_score Pytorch Cuda (current branch): 0.1941575288772583
Avg execution_time for d2_brier_score Pytorch Cuda (current branch): 0.14546685218811034Approximate speedup of cuda with respect to main: Detailsfrom time import time
import numpy as np
import torch as xp
from tqdm import tqdm
from sklearn._config import config_context
from sklearn.metrics import d2_brier_score, d2_log_loss_score
n_samples = 1000000
n_classes = 10
d2_log_times = []
d2_brier_times = []
for _ in tqdm(range(10), desc="Numpy (branch)"):
y_prob = np.random.rand(n_samples, n_classes).astype(np.float64)
y_prob = y_prob / y_prob.sum(axis=1, keepdims=True)
y_true = np.random.randint(low=0, high=10, size=(n_samples,))
start = time()
d2_log_loss_score(y_true, y_prob)
d2_log_times.append(time() - start)
start = time()
d2_brier_score(y_true, y_prob)
d2_brier_times.append(time() - start)
avg_d2_log_time = sum(d2_log_times) / 10
avg_d2_brier_time = sum(d2_brier_times) / 10
print(f"Avg execution_time for d2_log_loss_score Numpy (branch): {avg_d2_log_time}")
print(f"Avg execution_time for d2_brier_score Numpy (branch): {avg_d2_brier_time}")
d2_log_times = []
d2_brier_times = []
for _ in tqdm(range(10), desc="Pytorch Cuda (branch)"):
y_prob = np.random.rand(n_samples, n_classes).astype(np.float64)
y_prob = y_prob / y_prob.sum(axis=1, keepdims=True)
y_true = np.random.randint(low=0, high=10, size=(n_samples,))
with config_context(array_api_dispatch=True):
y_prob = xp.asarray(y_prob, device="cuda")
y_true = xp.asarray(y_true, device="cuda")
start = time()
d2_log_loss_score(y_true, y_prob)
d2_log_times.append(time() - start)
start = time()
d2_brier_score(y_true, y_prob)
d2_brier_times.append(time() - start)
avg_d2_log_time = sum(d2_log_times) / 10
avg_d2_brier_time = sum(d2_brier_times) / 10
print(
f"Avg execution_time for d2_log_loss_score Pytorch Cuda (branch): {avg_d2_log_time}"
)
print(
f"Avg execution_time for d2_brier_score Pytorch Cuda (branch): {avg_d2_brier_time}"
) |
|
Thanks @OmarManzoor and @virchan. I just merged. Any of you would be interested in a follow-up PR to tackle https://github.com/scikit-learn/scikit-learn/pull/32422/files#r2413404620? |
|
I can try it out but if @virchan wants to then he is welcome to do so. |
|
Yea, I'd like to work on adding array API support to |
…ore and d2_log_loss_score (scikit-learn#32422)
| @pytest.mark.parametrize( | ||
| "array_namespace, device_, dtype_name", yield_namespace_device_dtype_combinations() | ||
| ) | ||
| def test_probabilistic_metrics_array_api( |
There was a problem hiding this comment.
@OmarManzoor what do you think about adding this check to test_common.py and adding a check_array_api_binary_continuous_classification_metric
For context was working on #32755, and was looking at our array API tests.
There was a problem hiding this comment.
I don't think we test for string y_true in the common tests but if you want to refactor this into the common tests that is fine.
There was a problem hiding this comment.
We actually don't have one for continuous, y_score, metrics at all.
But yes the string is also something not tested either.
Should be reasonable to refactor. And then it's all in one place for future ranking metrics
| try: | ||
| pos_label = _check_pos_label_consistency(pos_label, y_true) | ||
| except ValueError: | ||
| classes = np.unique(y_true) |
There was a problem hiding this comment.
Should this use xp and not np? I would think that there could be a case where the input is xp and the classes are not {0,1} {-1,1}, which would cause ValueError here?
In this case, I don't think we have tested this scenario. If so, I think this case should probably be included in common testing #32755, as I want to be able to capture all cases.
It may be easier to fix in that PR or in a separate PR and indicate the test is coming in #32755
cc @OmarManzoor
There was a problem hiding this comment.
We cannot use xp here, but we would need to use xp_y_true as xp will result in errors when y_true consists of strings. But yes we can fix this in the other PR you are working on or if required earlier I can create a separate PR for this one.
| ) | ||
|
|
||
| transformed_labels = lb.transform(y_true) | ||
| transformed_labels = target_xp.asarray(transformed_labels, device=target_device) |
There was a problem hiding this comment.
why asarray here and not move_to ? Similar question for _one_hot_encoding_binary_target
There was a problem hiding this comment.
move_to was added more recently. It doesn't matter much though I think, we can use either.
There was a problem hiding this comment.
I agree that for numpy to any xp conversions, both xp.asarray and dlpack via move_to should yield similar outcomes, as I don't think any dlpack enabled namespace will drop the __array__ protocol / numpy compat.
There was a problem hiding this comment.
Is it possible that transformed_labels is not numpy though?
Also I don't think asarray works when transformed_labels is array api strict, from #32755 : https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=83092&view=logs&j=dde5042c-7464-5d47-9507-31bdd2ee0a3a&t=4bd2dad8-62b3-5bf9-08a5-a9880c530c94 :
Details
../1/s/sklearn/metrics/_classification.py:229: in _one_hot_encoding_multiclass_target
transformed_labels = target_xp.asarray(transformed_labels, device=target_device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ = True
labels = None
lb = LabelBinarizer()
target_device = device(type='cpu')
target_xp = <module 'sklearn.externals.array_api_compat.torch' from '/home/vsts/work/1/s/sklearn/externals/array_api_compat/torch/__init__.py'>
transformed_labels = Array([[1],
[0],
[1],
[0]], dtype=array_api_strict.int64)
xp = <module 'array_api_strict' from '/home/vsts/miniforge3/envs/testvenv/lib/python3.13/site-packages/array_api_strict/__init__.py'>
y_true = Array([1, 0, 1, 0], dtype=array_api_strict.int64)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
obj = Array([[1],
[0],
[1],
[0]], dtype=array_api_strict.int64)
dtype = None, device = device(type='cpu'), copy = None, kwargs = {}
def asarray(
obj: (
Array
| bool | int | float | complex
| NestedSequence[bool | int | float | complex]
| SupportsBufferProtocol
),
/,
*,
dtype: DType | None = None,
device: Device | None = None,
copy: bool | None = None,
**kwargs: Any,
) -> Array:
# torch.asarray does not respect input->output device propagation
# https://github.com/pytorch/pytorch/issues/150199
if device is None and isinstance(obj, torch.Tensor):
device = obj.device
> return torch.asarray(obj, dtype=dtype, device=device, copy=copy, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E RuntimeError: could not retrieve buffer from object
copy = None
device = device(type='cpu')
dtype = None
obj = Array([[1],
[0],
[1],
[0]], dtype=array_api_strict.int64)
There was a problem hiding this comment.
Slightly off-topic, does any other metric allow mixed array input support, or just the ones in this PR? (just to help me tackle #32755)
There was a problem hiding this comment.
I don't think any other metrics handle strings other than the ones in this PR.
There was a problem hiding this comment.
Also the code snippet you shared seems to suggest that the namespace is torch while the array is from array-api-strict. If we want to handle such combinations and move_to handles this sort of a scenario, I think we will need to use it.
There was a problem hiding this comment.
Yeah that was a separate point to the first one. Obviously array api strict to torch is more about tests passing, but it does also demonstrate that it is possible/we cover the case where y_true / transformed_labels is not numpy
|
I know CI is passing, but locally I get the following test error for I updated Note in scikit-learn/sklearn/metrics/_classification.py Line 3557 in 473fef0 Relevant comment: #32422 (comment) |
|
I cannot reproduce on the current Details |
|
@lucyleeow could you please run pytest with the |
DetailsEnv: |
|
Not super helpful I know but this is the kind of thing where I would look at a CI build log environment e.g. this macOS arm from a recent main branch and play the game of "7 differences" compared to your environment It could be scipy version, it could be numpy, who knows ... |
|
@lucyleeow since the problem seems to come from scipy's I am running scipy 1.16.3 and cannot reproduce the problem. |
If it's scipy xlogy it reminds me a lot of #32552 which was happening for scipy 1.15 indeed. |
|
#32552 was exactly it! Thanks team! |
Reference Issues/PRs
Towards: #26024
What does this implement/fix? Explain your changes.
Adds array API support for
Any other comments?
CC: @ogrisel