FEA Add array API support for brier_score_loss, log_loss, d2_brier_score and d2_log_loss_score by OmarManzoor · Pull Request #32422 · scikit-learn/scikit-learn

OmarManzoor · 2025-10-07T16:21:24Z

Reference Issues/PRs

Towards: #26024

What does this implement/fix? Explain your changes.

Adds array API support for

brier_score_loss
log_loss
d2_brier_score
d2_log_loss_score

Any other comments?

CC: @ogrisel

github-actions · 2025-10-07T16:22:29Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: fcf056a. Link to the linter CI: here}

OmarManzoor · 2025-10-08T06:35:51Z

sklearn/metrics/_classification.py

    check_consistent_length(y_prob, y_true, sample_weight)
    if sample_weight is not None:
-        _check_sample_weight(sample_weight, y_true, force_float_dtype=False)
+        _check_sample_weight(sample_weight, y_prob, force_float_dtype=False)


Updated this because we follow the namespace and device of y_prob as y_true can or cannot be on the xp namespace and device

OmarManzoor · 2025-10-08T06:36:07Z

sklearn/metrics/_classification.py

    check_consistent_length(y_prob, y_true, sample_weight)
    if sample_weight is not None:
-        _check_sample_weight(sample_weight, y_true, force_float_dtype=False)
+        _check_sample_weight(sample_weight, y_prob, force_float_dtype=False)


Same as above: Updated this because we follow the namespace and device of y_prob as y_true can or cannot be on the xp namespace and device.

ogrisel

Thanks @OmarManzoor. Here is a pass of feedback.

Maybe @lucyleeow would also be interested in reviewing and express her opinion on my suggestions below.

ogrisel · 2025-10-08T08:40:34Z

sklearn/metrics/_classification.py

+    y_prob_sum = xp.sum(y_prob, axis=1)
+
+    if not xp.all(
+        xpx.isclose(


It would be great to contribute an xpx.allclose upstream.

I think it can be done but for now it seems like a straightforward change to add an additional xp.all. What do you think?

Using xp.all is ok for now, I was just suggesting a follow-up improvement.

sklearn/metrics/_classification.py

ogrisel · 2025-10-08T08:55:13Z

sklearn/metrics/_classification.py

+    if is_y_true_array_api:
+        y_true = _convert_to_numpy(y_true, xp=xp_y_true)


Suggested change

if is_y_true_array_api:

y_true = _convert_to_numpy(y_true, xp=xp_y_true)

# For classification metrics, both array API compatible and non array

# API compatible inputs are allowed for y_true: in particular arrays

# that store class labels as Python string with an object dtype cannot

# be represented with non-NumPy namespaces. To avoid having to maintain

# two code branches, we always convert y_true to NumPy and move the

# integer encoded output of LabelBinarizer.transform back to the y_prob

# namespace, irrespective of the original y_true namespace.

if is_y_true_array_api:

y_true = _convert_to_numpy(y_true, xp=xp_y_true)

Actually, I think we could improve the readability of the _validate_multiclass_probabilistic_prediction function by extracting the namespace aware one-hot encoding of y_true into its own private helper:

def _one_hot_multiclass_target(y_true, target_xp, target_device, labels=None): # Ensure that y_true is numpy, call the LabelBinarizer, perform labels consistency # checks and move the result to the target namespace and device. ... return transformed_labels

We could similarly extract the matching code for the binary case out of _validate_binary_probabilistic_prediction:

def _one_hot_binary_target(y_true, target_xp, target_device, pos_label=None): ... return transformed_labels

Those helpers might also to be reusable to improve array API support for other classification metrics for which y_pred is not probabilistic (e.g. ROC AUC, average precision, confusion matrix based metrics and so on...)

@OmarManzoor @ogrisel
To clarify, is the intention that all classification metrics should support string y_true (when it is numpy). E.g., for accuracy_score

scikit-learn/sklearn/metrics/_classification.py

Lines 407 to 408 in 473fef0

xp, _, device = get_namespace_and_device(y_pred)

y_true, sample_weight = move_to(y_true, sample_weight, xp=xp, device=device)

we would want to change this to something similar-ish to what is in _one_hot_encoding_binary_target, like:

xp_y_true = get_namespace(y_true) y_true = xp.asarray(y_true, dytpe=xp_y_true.int64) y_true, sample_weight = move_to(y_true, sample_weight, xp=xp, device=device)

or even

xp_y_true = get_namespace(y_true) if _is_numpy_namespace(xp_y_true): y_true = xp.asarray(y_true, dytpe=xp_y_true.int64) y_true, sample_weight = move_to(y_true, sample_weight, xp=xp, device=device)

(just un-resolving for visibility)

sklearn/metrics/_classification.py

ogrisel · 2025-10-08T10:11:55Z

sklearn/metrics/_classification.py

        )
-    y_proba_null = np.average(transformed_labels, axis=0, weights=sample_weight)
-    y_proba_null = np.tile(y_proba_null, (len(y_true), 1))
+    transformed_labels = xp.astype(transformed_labels, y_proba.dtype, copy=False)


Maybe we could add a dtype param to the one hot encoding helpers.

But that would then create confusions when we don't need such a change. I think it might make sense to do this where it's required.

Ok, but let's keep that possibility in mind if we repeat this .astype calls in future reuses of the _one_hot_encoding_binary/multiclass_target functions.

ogrisel · 2025-10-08T10:32:34Z

sklearn/metrics/_classification.py

+    # `LabelBinarizer` and then transfer the integer encoded output back to the
+    # target namespace and device.
+    if is_y_true_array_api:
+        y_true = _convert_to_numpy(y_true, xp=xp_y_true)


Note that in a follow-up PR, we could change the label binarizer to spare this forced NumPy conversion (when using numeric class labels). This would involve adding array API support to LabelBinarizer when sparse_output=False.

But we can probably do that in a follow-up PR.

Note that this discussion caused https://github.com/scikit-learn/scikit-learn/pull/30439/files#r1875958580 to stall in the past but I think we can decouple the concerns.

I agree it can be modified to allow the array api when the inputs are already numeric and compatible. However not sure how much benefit we can get from that considering it's basically mainly used for encoding labels.

BTW, I wouldn't be opposed to include the changes of #30439 into this PR and also add array API support for Brier score since they are related functions with shared private helpers.

Maybe after merging this one we can complete log_loss and brier_score in one PR? But we can do it in this PR too. As you would prefer.

However not sure how much benefit we can get from that considering it's basically mainly used for encoding labels.

The main benefit would be cleaner/simpler code.

Maybe after merging this one we can complete log_loss and brier_score in one PR? But we can do it in this PR too. As you would prefer.

No strong opinion.

And it would be interesting to see the impact of not converting to numpy when running the benchmarks of #32422 (comment) which uses integer class values in y_true.

OmarManzoor · 2025-10-08T11:28:10Z

A couple of quick benchmarks

n_samples = 1000000
n_classes = 10

Avg execution_time for d2_log_loss_score Numpy (main): 0.5043204307556153
Avg execution_time for d2_brier_score Numpy (main): 0.35432188510894774

Avg execution_time for d2_log_loss_score Numpy (current branch): 0.5363539457321167
Avg execution_time for d2_brier_score Numpy (current branch): 0.36591079235076907

Avg execution_time for d2_log_loss_score Pytorch Cuda (current branch): 0.1941575288772583
Avg execution_time for d2_brier_score Pytorch Cuda (current branch): 0.14546685218811034

Approximate speedup of cuda with respect to main:
d2_log_loss_score: 2.5x
d2_brier_score: 2.4x

Details

from time import time

import numpy as np
import torch as xp
from tqdm import tqdm

from sklearn._config import config_context
from sklearn.metrics import d2_brier_score, d2_log_loss_score

n_samples = 1000000
n_classes = 10

d2_log_times = []
d2_brier_times = []
for _ in tqdm(range(10), desc="Numpy (branch)"):
    y_prob = np.random.rand(n_samples, n_classes).astype(np.float64)
    y_prob = y_prob / y_prob.sum(axis=1, keepdims=True)
    y_true = np.random.randint(low=0, high=10, size=(n_samples,))

    start = time()
    d2_log_loss_score(y_true, y_prob)
    d2_log_times.append(time() - start)

    start = time()
    d2_brier_score(y_true, y_prob)
    d2_brier_times.append(time() - start)

avg_d2_log_time = sum(d2_log_times) / 10
avg_d2_brier_time = sum(d2_brier_times) / 10
print(f"Avg execution_time for d2_log_loss_score Numpy (branch): {avg_d2_log_time}")
print(f"Avg execution_time for d2_brier_score Numpy (branch): {avg_d2_brier_time}")


d2_log_times = []
d2_brier_times = []
for _ in tqdm(range(10), desc="Pytorch Cuda (branch)"):
    y_prob = np.random.rand(n_samples, n_classes).astype(np.float64)
    y_prob = y_prob / y_prob.sum(axis=1, keepdims=True)
    y_true = np.random.randint(low=0, high=10, size=(n_samples,))

    with config_context(array_api_dispatch=True):
        y_prob = xp.asarray(y_prob, device="cuda")
        y_true = xp.asarray(y_true, device="cuda")
        start = time()
        d2_log_loss_score(y_true, y_prob)
        d2_log_times.append(time() - start)

        start = time()
        d2_brier_score(y_true, y_prob)
        d2_brier_times.append(time() - start)

avg_d2_log_time = sum(d2_log_times) / 10
avg_d2_brier_time = sum(d2_brier_times) / 10
print(
    f"Avg execution_time for d2_log_loss_score Pytorch Cuda (branch): {avg_d2_log_time}"
)
print(
    f"Avg execution_time for d2_brier_score Pytorch Cuda (branch): {avg_d2_brier_time}"
)

virchan

LGTM! Thanks, @OmarManzoor!

ogrisel · 2025-10-09T14:40:44Z

Thanks @OmarManzoor and @virchan. I just merged. Any of you would be interested in a follow-up PR to tackle https://github.com/scikit-learn/scikit-learn/pull/32422/files#r2413404620?

OmarManzoor · 2025-10-09T16:17:13Z

I can try it out but if @virchan wants to then he is welcome to do so.

virchan · 2025-10-10T00:03:23Z

Yea, I'd like to work on adding array API support to LabelBinarizer for sparse_output=False.

…ore and d2_log_loss_score (scikit-learn#32422)

lucyleeow · 2025-11-21T06:30:27Z

sklearn/metrics/tests/test_classification.py

+@pytest.mark.parametrize(
+    "array_namespace, device_, dtype_name", yield_namespace_device_dtype_combinations()
+)
+def test_probabilistic_metrics_array_api(


@OmarManzoor what do you think about adding this check to test_common.py and adding a check_array_api_binary_continuous_classification_metric

For context was working on #32755, and was looking at our array API tests.

I don't think we test for string y_true in the common tests but if you want to refactor this into the common tests that is fine.

We actually don't have one for continuous, y_score, metrics at all.
But yes the string is also something not tested either.

Should be reasonable to refactor. And then it's all in one place for future ranking metrics

lucyleeow · 2025-12-01T05:34:25Z

sklearn/metrics/_classification.py

    try:
        pos_label = _check_pos_label_consistency(pos_label, y_true)
    except ValueError:
        classes = np.unique(y_true)


Should this use xp and not np? I would think that there could be a case where the input is xp and the classes are not {0,1} {-1,1}, which would cause ValueError here?

In this case, I don't think we have tested this scenario. If so, I think this case should probably be included in common testing #32755, as I want to be able to capture all cases.

It may be easier to fix in that PR or in a separate PR and indicate the test is coming in #32755

cc @OmarManzoor

We cannot use xp here, but we would need to use xp_y_true as xp will result in errors when y_true consists of strings. But yes we can fix this in the other PR you are working on or if required earlier I can create a separate PR for this one.

lucyleeow · 2025-12-03T04:07:10Z

sklearn/metrics/_classification.py

+            )
+
+    transformed_labels = lb.transform(y_true)
+    transformed_labels = target_xp.asarray(transformed_labels, device=target_device)


why asarray here and not move_to ? Similar question for _one_hot_encoding_binary_target

@OmarManzoor @ogrisel

move_to was added more recently. It doesn't matter much though I think, we can use either.

I agree that for numpy to any xp conversions, both xp.asarray and dlpack via move_to should yield similar outcomes, as I don't think any dlpack enabled namespace will drop the __array__ protocol / numpy compat.

Is it possible that transformed_labels is not numpy though?

Also I don't think asarray works when transformed_labels is array api strict, from #32755 : https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=83092&view=logs&j=dde5042c-7464-5d47-9507-31bdd2ee0a3a&t=4bd2dad8-62b3-5bf9-08a5-a9880c530c94 :

Details

../1/s/sklearn/metrics/_classification.py:229: in _one_hot_encoding_multiclass_target transformed_labels = target_xp.asarray(transformed_labels, device=target_device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ _ = True labels = None lb = LabelBinarizer() target_device = device(type='cpu') target_xp = <module 'sklearn.externals.array_api_compat.torch' from '/home/vsts/work/1/s/sklearn/externals/array_api_compat/torch/__init__.py'> transformed_labels = Array([[1], [0], [1], [0]], dtype=array_api_strict.int64) xp = <module 'array_api_strict' from '/home/vsts/miniforge3/envs/testvenv/lib/python3.13/site-packages/array_api_strict/__init__.py'> y_true = Array([1, 0, 1, 0], dtype=array_api_strict.int64) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ obj = Array([[1], [0], [1], [0]], dtype=array_api_strict.int64) dtype = None, device = device(type='cpu'), copy = None, kwargs = {} def asarray( obj: ( Array | bool | int | float | complex | NestedSequence[bool | int | float | complex] | SupportsBufferProtocol ), /, *, dtype: DType | None = None, device: Device | None = None, copy: bool | None = None, **kwargs: Any, ) -> Array: # torch.asarray does not respect input->output device propagation # https://github.com/pytorch/pytorch/issues/150199 if device is None and isinstance(obj, torch.Tensor): device = obj.device > return torch.asarray(obj, dtype=dtype, device=device, copy=copy, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ E RuntimeError: could not retrieve buffer from object copy = None device = device(type='cpu') dtype = None obj = Array([[1], [0], [1], [0]], dtype=array_api_strict.int64)

Slightly off-topic, does any other metric allow mixed array input support, or just the ones in this PR? (just to help me tackle #32755)

I don't think any other metrics handle strings other than the ones in this PR.

Also the code snippet you shared seems to suggest that the namespace is torch while the array is from array-api-strict. If we want to handle such combinations and move_to handles this sort of a scenario, I think we will need to use it.

Yeah that was a separate point to the first one. Obviously array api strict to torch is more about tests passing, but it does also demonstrate that it is possible/we cover the case where y_true / transformed_labels is not numpy

lucyleeow · 2025-12-03T04:47:16Z

I know CI is passing, but locally I get the following test error for test_probabilistic_metrics_array_api with log_loss and d2_log_loss_score:

_________________________________________________________ test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-True-log_loss] _________________________________________________________

prob_metric = <function log_loss at 0x7f182e735260>, str_y_true = True, use_sample_weight = True, array_namespace = 'array_api_strict', device_ = array_api_strict.Device('device1'), dtype_name = 'float32'

    @pytest.mark.parametrize(
        "prob_metric", [log_loss,]
    )
    @pytest.mark.parametrize("str_y_true", [False, True])
    @pytest.mark.parametrize("use_sample_weight", [False, True])
    @pytest.mark.parametrize(
        "array_namespace, device_, dtype_name", yield_namespace_device_dtype_combinations()
    )
    def test_probabilistic_metrics_array_api(
        prob_metric, str_y_true, use_sample_weight, array_namespace, device_, dtype_name
    ):
        """Test that :func:`brier_score_loss`, :func:`log_loss`, func:`d2_brier_score`
        and :func:`d2_log_loss_score` work correctly with the array API for binary
        and mutli-class inputs.
        """
        xp = _array_api_for_tests(array_namespace, device_)
        sample_weight = np.array([1, 2, 3, 1]) if use_sample_weight else None
    
        # binary case
        extra_kwargs = {}
        if str_y_true:
            y_true_np = np.array(["yes", "no", "yes", "no"])
            y_true_xp_or_np = np.asarray(y_true_np)
            if "brier" in prob_metric.__name__:
                # `brier_score_loss` and `d2_brier_score` require specifying the
                # `pos_label`
                extra_kwargs["pos_label"] = "yes"
        else:
            y_true_np = np.array([1, 0, 1, 0])
            y_true_xp_or_np = xp.asarray(y_true_np, device=device_)
    
        y_prob_np = np.array([0.5, 0.2, 0.7, 0.6], dtype=dtype_name)
        y_prob_xp = xp.asarray(y_prob_np, device=device_)
        metric_score_np = prob_metric(
            y_true_np, y_prob_np, sample_weight=sample_weight, **extra_kwargs
        )
        with config_context(array_api_dispatch=True):
>           metric_score_xp = prob_metric(
                y_true_xp_or_np, y_prob_xp, sample_weight=sample_weight, **extra_kwargs
            )

sklearn/metrics/tests/test_classification.py:3677: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/utils/_param_validation.py:218: in wrapper
    return func(*args, **kwargs)
sklearn/metrics/_classification.py:3381: in log_loss
    return _log_loss(
sklearn/metrics/_classification.py:3396: in _log_loss
    loss = -xp.sum(xlogy(transformed_labels, y_pred), axis=1)
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/scipy/special/_support_alternative_backends.py:167: in wrapped
    return f(*args, **kwargs)
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/scipy/special/_support_alternative_backends.py:76: in __xlogy
    temp = x * xp.log(y)
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/_array_object.py:858: in __mul__
    other = self._check_allowed_dtypes(other, "numeric", "__mul__")
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/_array_object.py:215: in _check_allowed_dtypes
    res_dtype = _result_type(self.dtype, other.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

type1 = array_api_strict.int64, type2 = array_api_strict.float32

    def _result_type(type1: DType, type2: DType) -> DType:
        if (type1, type2) in _promotion_table:
            return _promotion_table[type1, type2]
>       raise TypeError(f"{type1} and {type2} cannot be type promoted together")
E       TypeError: array_api_strict.int64 and array_api_strict.float32 cannot be type promoted together

I updated array-api-strict and array-api-compat and used main. But is it just me or can anyone else reproduce? cc @OmarManzoor @ogrisel

Note in _one_hot_encoding_binary_target we do:

scikit-learn/sklearn/metrics/_classification.py

Line 3557 in 473fef0

y_true_pos = xp_y_true.asarray(y_true == pos_label, dtype=xp_y_true.int64)

Relevant comment: #32422 (comment)

ogrisel · 2025-12-03T09:57:42Z

I cannot reproduce on the current main:

Details

pytest sklearn/metrics/tests/test_classification.py -vlk "test_probabilistic_metrics_array_api and strict"
======================================================================================================= test session starts =======================================================================================================
platform darwin -- Python 3.13.7, pytest-9.0.1, pluggy-1.6.0 -- /Users/ogrisel/miniforge3/envs/dev/bin/python3.13
cachedir: .pytest_cache
rootdir: /Users/ogrisel/code/scikit-learn
configfile: pyproject.toml
plugins: anyio-4.11.0, xdist-3.8.0, timeout-2.4.0, run-parallel-0.7.0, cov-7.0.0
collected 519 items / 487 deselected / 32 selected                                                                                                                                                                                
Collected 0 items to run in parallel

sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-False-brier_score_loss] PASSED                                                                   [  3%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-False-log_loss] PASSED                                                                           [  6%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-False-d2_brier_score] PASSED                                                                     [  9%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-False-d2_log_loss_score] PASSED                                                                  [ 12%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-True-brier_score_loss] PASSED                                                                    [ 15%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-True-log_loss] PASSED                                                                            [ 18%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-True-d2_brier_score] PASSED                                                                      [ 21%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-True-d2_log_loss_score] PASSED                                                                   [ 25%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-False-brier_score_loss] PASSED                                                                    [ 28%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-False-log_loss] PASSED                                                                            [ 31%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-False-d2_brier_score] PASSED                                                                      [ 34%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-False-d2_log_loss_score] PASSED                                                                   [ 37%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-True-brier_score_loss] PASSED                                                                     [ 40%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-True-log_loss] PASSED                                                                             [ 43%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-True-d2_brier_score] PASSED                                                                       [ 46%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-True-d2_log_loss_score] PASSED                                                                    [ 50%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-False-brier_score_loss] PASSED                                                                   [ 53%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-False-log_loss] PASSED                                                                           [ 56%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-False-d2_brier_score] PASSED                                                                     [ 59%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-False-d2_log_loss_score] PASSED                                                                  [ 62%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-True-brier_score_loss] PASSED                                                                    [ 65%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-True-log_loss] PASSED                                                                            [ 68%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-True-d2_brier_score] PASSED                                                                      [ 71%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-True-d2_log_loss_score] PASSED                                                                   [ 75%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-False-brier_score_loss] PASSED                                                                    [ 78%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-False-log_loss] PASSED                                                                            [ 81%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-False-d2_brier_score] PASSED                                                                      [ 84%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-False-d2_log_loss_score] PASSED                                                                   [ 87%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-True-brier_score_loss] PASSED                                                                     [ 90%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-True-log_loss] PASSED                                                                             [ 93%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-True-d2_brier_score] PASSED                                                                       [ 96%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-True-d2_log_loss_score] PASSED                                                                    [100%]

=============================================================================================== 32 passed, 487 deselected in 0.53s ================================================================================================

ogrisel · 2025-12-03T10:00:16Z

@lucyleeow could you please run pytest with the -l flag (to display the values of local variables in the traceback) and update your comment?

lucyleeow · 2025-12-03T10:47:10Z

Details


_______________________________ test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-True-d2_log_loss_score] ________________________________

prob_metric = <function d2_log_loss_score at 0x7f4c45045800>, str_y_true = True, use_sample_weight = True, array_namespace = 'array_api_strict'
device_ = array_api_strict.Device('device1'), dtype_name = 'float32'

    @pytest.mark.parametrize(
        "prob_metric", [brier_score_loss, log_loss, d2_brier_score, d2_log_loss_score]
    )
    @pytest.mark.parametrize("str_y_true", [False, True])
    @pytest.mark.parametrize("use_sample_weight", [False, True])
    @pytest.mark.parametrize(
        "array_namespace, device_, dtype_name", yield_namespace_device_dtype_combinations()
    )
    def test_probabilistic_metrics_array_api(
        prob_metric, str_y_true, use_sample_weight, array_namespace, device_, dtype_name
    ):
        """Test that :func:`brier_score_loss`, :func:`log_loss`, func:`d2_brier_score`
        and :func:`d2_log_loss_score` work correctly with the array API for binary
        and mutli-class inputs.
        """
        xp = _array_api_for_tests(array_namespace, device_)
        sample_weight = np.array([1, 2, 3, 1]) if use_sample_weight else None
    
        # binary case
        extra_kwargs = {}
        if str_y_true:
            y_true_np = np.array(["yes", "no", "yes", "no"])
            y_true_xp_or_np = np.asarray(y_true_np)
            if "brier" in prob_metric.__name__:
                # `brier_score_loss` and `d2_brier_score` require specifying the
                # `pos_label`
                extra_kwargs["pos_label"] = "yes"
        else:
            y_true_np = np.array([1, 0, 1, 0])
            y_true_xp_or_np = xp.asarray(y_true_np, device=device_)
    
        y_prob_np = np.array([0.5, 0.2, 0.7, 0.6], dtype=dtype_name)
        y_prob_xp = xp.asarray(y_prob_np, device=device_)
        metric_score_np = prob_metric(
            y_true_np, y_prob_np, sample_weight=sample_weight, **extra_kwargs
        )
        with config_context(array_api_dispatch=True):
>           metric_score_xp = prob_metric(
                y_true_xp_or_np, y_prob_xp, sample_weight=sample_weight, **extra_kwargs
            )

array_namespace = 'array_api_strict'
device_    = array_api_strict.Device('device1')
dtype_name = 'float32'
extra_kwargs = {}
metric_score_np = 0.34612621977432545
prob_metric = <function d2_log_loss_score at 0x7f4c45045800>
sample_weight = array([1, 2, 3, 1])
str_y_true = True
use_sample_weight = True
xp         = <module 'array_api_strict' from '/home/lucy/miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/__init__.py'>
y_prob_np  = array([0.5, 0.2, 0.7, 0.6], dtype=float32)
y_prob_xp  = Array([0.5,
       0.2,
       0.7,
       0.6], dtype=array_api_strict.float32, device=array_api_strict.Device('device1'))
y_true_np  = array(['yes', 'no', 'yes', 'no'], dtype='<U3')
y_true_xp_or_np = array(['yes', 'no', 'yes', 'no'], dtype='<U3')

sklearn/metrics/tests/test_classification.py:3674: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/utils/_param_validation.py:218: in wrapper
    return func(*args, **kwargs)
        args       = (array(['yes', 'no', 'yes', 'no'], dtype='<U3'), Array([0.5,
       0.2,
       0.7,
       0.6], dtype=array_api_strict.float32, device=array_api_strict.Device('device1')))
        func       = <function d2_log_loss_score at 0x7f4c45045760>
        func_sig   = <Signature (y_true, y_pred, *, sample_weight=None, labels=None)>
        global_skip_validation = False
        kwargs     = {'sample_weight': array([1, 2, 3, 1])}
        parameter_constraints = {'labels': ['array-like', None], 'sample_weight': ['array-like', None], 'y_pred': ['array-like'], 'y_true': ['array-like']}
        params     = {'labels': None, 'sample_weight': array([1, 2, 3, 1]), 'y_pred': Array([0.5,
       0.2,
       0.7,
       0.6], dtyp...i_strict.float32, device=array_api_strict.Device('device1')), 'y_true': array(['yes', 'no', 'yes', 'no'], dtype='<U3')}
        prefer_skip_nested_validation = True
        to_ignore  = ['self', 'cls']
sklearn/metrics/_classification.py:3871: in d2_log_loss_score
    numerator = _log_loss(
        _          = True
        device_    = array_api_strict.Device('device1')
        labels     = None
        sample_weight = Array([1,
       2,
       3,
       1], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
        transformed_labels = Array([[0,
        1],
       [1,
        0],
       [0,
        1],
       [1,
        0]], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
        xp         = <module 'array_api_strict' from '/home/lucy/miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/__init__.py'>
        y_pred     = Array([[0.5       ,
        0.5       ],
       [0.8       ,
        0.2       ],
       [0.3       ,
        0.7       ],
       [0.39999998,
        0.6       ]], dtype=array_api_strict.float32, device=array_api_strict.Device('device1'))
        y_pred_null = Array([[0.42857143,
        0.57142857],
       [0.42857143,
        0.57142857],
       [0.42857143,
        0.57142857],
       [0.42857143,
        0.57142857]], dtype=array_api_strict.float64, device=array_api_strict.Device('device1'))
        y_true     = array(['yes', 'no', 'yes', 'no'], dtype='<U3')
sklearn/metrics/_classification.py:3396: in _log_loss
    loss = -xp.sum(xlogy(transformed_labels, y_pred), axis=1)
        _          = True
        device_    = array_api_strict.Device('device1')
        eps        = 1.1920928955078125e-07
        normalize  = False
        sample_weight = Array([1,
       2,
       3,
       1], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
        transformed_labels = Array([[0,
        1],
       [1,
        0],
       [0,
        1],
       [1,
        0]], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
        xp         = <module 'array_api_strict' from '/home/lucy/miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/__init__.py'>
        y_pred     = Array([[0.5       ,
        0.5       ],
       [0.8       ,
        0.2       ],
       [0.3       ,
        0.7       ],
       [0.39999998,
        0.6       ]], dtype=array_api_strict.float32, device=array_api_strict.Device('device1'))
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/scipy/special/_support_alternative_backends.py:167: in wrapped
    return f(*args, **kwargs)
        args       = (Array([[0,
        1],
       [1,
        0],
       [0,
        1],
       [1,
        0]], dtype=array_api_strict.i...,
       [0.39999998,
        0.6       ]], dtype=array_api_strict.float32, device=array_api_strict.Device('device1')))
        f          = <function _xlogy.<locals>.__xlogy at 0x7f4c444259e0>
        f_name     = 'xlogy'
        kwargs     = {}
        n_array_args = 2
        xp         = <module 'array_api_strict' from '/home/lucy/miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/__init__.py'>
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/scipy/special/_support_alternative_backends.py:76: in __xlogy
    temp = x * xp.log(y)
        x          = Array([[0,
        1],
       [1,
        0],
       [0,
        1],
       [1,
        0]], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
        xp         = <module 'array_api_strict' from '/home/lucy/miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/__init__.py'>
        y          = Array([[0.5       ,
        0.5       ],
       [0.8       ,
        0.2       ],
       [0.3       ,
        0.7       ],
       [0.39999998,
        0.6       ]], dtype=array_api_strict.float32, device=array_api_strict.Device('device1'))
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/_array_object.py:858: in __mul__
    other = self._check_allowed_dtypes(other, "numeric", "__mul__")
        other      = Array([[-0.6931472 ,
        -0.6931472 ],
       [-0.22314353,
        -1.609438  ],
       [-1.2039728 ,
        -0....
       [-0.9162908 ,
        -0.5108256 ]], dtype=array_api_strict.float32, device=array_api_strict.Device('device1'))
        self       = Array([[0,
        1],
       [1,
        0],
       [0,
        1],
       [1,
        0]], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/_array_object.py:215: in _check_allowed_dtypes
    res_dtype = _result_type(self.dtype, other.dtype)
        dtype_category = 'numeric'
        op         = '__mul__'
        other      = Array([[-0.6931472 ,
        -0.6931472 ],
       [-0.22314353,
        -1.609438  ],
       [-1.2039728 ,
        -0....
       [-0.9162908 ,
        -0.5108256 ]], dtype=array_api_strict.float32, device=array_api_strict.Device('device1'))
        self       = Array([[0,
        1],
       [1,
        0],
       [0,
        1],
       [1,
        0]], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

type1 = array_api_strict.int64, type2 = array_api_strict.float32

    def _result_type(type1: DType, type2: DType) -> DType:
        if (type1, type2) in _promotion_table:
            return _promotion_table[type1, type2]
>       raise TypeError(f"{type1} and {type2} cannot be type promoted together")
E       TypeError: array_api_strict.int64 and array_api_strict.float32 cannot be type promoted together

type1      = array_api_strict.int64
type2      = array_api_strict.float32

../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/_dtypes.py:229: TypeError

Env:

array-api-compat          1.12.0                   pypi_0    pypi
array-api-strict          2.4.1                    pypi_0    pypi

lesteve · 2025-12-03T14:04:06Z

Not super helpful I know but this is the kind of thing where I would look at a CI build log environment e.g. this macOS arm from a recent main branch and play the game of "7 differences" compared to your environment

It could be scipy version, it could be numpy, who knows ...

ogrisel · 2025-12-03T14:39:43Z

@lucyleeow since the problem seems to come from scipy's xlogy, can you try to see if you are running the latest scipy version?

I am running scipy 1.16.3 and cannot reproduce the problem.

lesteve · 2025-12-03T17:43:33Z

@lucyleeow since the problem seems to come from scipy's xlogy, can you try to see if you are running the latest scipy version?

If it's scipy xlogy it reminds me a lot of #32552 which was happening for scipy 1.15 indeed.

lucyleeow · 2025-12-04T00:58:04Z

#32552 was exactly it! Thanks team!

FEA Add array API support for d2_brier_score and d2_log_loss_score

7946cf5

github-actions bot added the module:metrics label Oct 7, 2025

Add changelog

035b464

OmarManzoor added Array API CUDA CI labels Oct 7, 2025

github-actions bot removed the CUDA CI label Oct 7, 2025

OmarManzoor added 3 commits October 7, 2025 23:25

Minor improvements

4a3126e

Merge branch 'main' into array-api-d2-classification

2b25fed

Some fixes

6ec49dc

OmarManzoor added the CUDA CI label Oct 8, 2025

github-actions bot removed the CUDA CI label Oct 8, 2025

Add sample_weight checks

fefa9f3

OmarManzoor commented Oct 8, 2025

View reviewed changes

OmarManzoor added 3 commits October 8, 2025 11:48

Fix typo

c386e37

Improve docstring

a6f00ce

Another minor comment fix

ffe1933

ogrisel reviewed Oct 8, 2025

View reviewed changes

ogrisel added the CUDA CI label Oct 8, 2025

github-actions bot removed the CUDA CI label Oct 8, 2025

OmarManzoor added 4 commits October 8, 2025 14:56

Refactoring as suggested by Olivier

9eff892

Improve docstrings

11d2dbf

Some further improvement

b68a309

Merge branch 'main' into array-api-d2-classification

c6a03b6

ogrisel reviewed Oct 8, 2025

View reviewed changes

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved

ogrisel reviewed Oct 8, 2025

View reviewed changes

Further refactor

529823b

ogrisel reviewed Oct 8, 2025

View reviewed changes

OmarManzoor added the CUDA CI label Oct 8, 2025

github-actions bot removed the CUDA CI label Oct 8, 2025

OmarManzoor changed the title ~~FEA Add array API support for d2_brier_score and d2_log_loss_score~~ FEA Add array API support for brier_score_loss, log_loss, d2_brier_score and d2_log_loss_score Oct 8, 2025

virchan approved these changes Oct 8, 2025

View reviewed changes

ogrisel merged commit f4161e7 into scikit-learn:main Oct 9, 2025
42 checks passed

OmarManzoor deleted the array-api-d2-classification branch October 9, 2025 16:14

This was referenced Oct 10, 2025

ENH: Make brier_score_loss Array API compatible #31191

Closed

Make more of the "tools" of scikit-learn Array API compatible #26024

Open

StefanieSenger mentioned this pull request Oct 22, 2025

Array API test failure for probabilistic metrics with scipy==1.15.0 #32552

Closed

virchan mentioned this pull request Oct 27, 2025

FEA Add array API support to LabelBinarizer(sparse_output=False) for numeric labels #32582

Merged

Tunahanyrd pushed a commit to Tunahanyrd/scikit-learn that referenced this pull request Oct 28, 2025

FEA Add array API support for brier_score_loss, log_loss, d2_brier_sc…

8e563db

…ore and d2_log_loss_score (scikit-learn#32422)

lucyleeow reviewed Nov 21, 2025

View reviewed changes

This was referenced Nov 26, 2025

TST Add array API continuous metric common tests #32793

Draft

FEA Add array API support to coverage_error #32626

Open

TST Add common test for mixed array API inputs for metrics #32755

Merged

lucyleeow reviewed Dec 1, 2025

View reviewed changes

lucyleeow reviewed Dec 3, 2025

View reviewed changes

lucyleeow mentioned this pull request Jan 8, 2026

Fix log_loss and d2_log_loss_score type promotion mismatch #33022

Merged

4 tasks

StefanieSenger mentioned this pull request Jan 13, 2026

FEA Add array API support for average_precision_score #32909

Merged

14 tasks

		if is_y_true_array_api:
		y_true = _convert_to_numpy(y_true, xp=xp_y_true)

-    if is_y_true_array_api:
-        y_true = _convert_to_numpy(y_true, xp=xp_y_true)
+    # For classification metrics, both array API compatible and non array
+    # API compatible inputs are allowed for y_true: in particular arrays
+    # that store class labels as Python string with an object dtype cannot
+    # be represented with non-NumPy namespaces. To avoid having to maintain
+    # two code branches, we always convert y_true to NumPy and move the
+    # integer encoded output of LabelBinarizer.transform back to the y_prob
+    # namespace, irrespective of the original y_true namespace.
+    if is_y_true_array_api:
+        y_true = _convert_to_numpy(y_true, xp=xp_y_true)

	xp, _, device = get_namespace_and_device(y_pred)
	y_true, sample_weight = move_to(y_true, sample_weight, xp=xp, device=device)

Uh oh!

Conversation

OmarManzoor commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OmarManzoor Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OmarManzoor Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OmarManzoor Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OmarManzoor commented Oct 8, 2025

Uh oh!

virchan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OmarManzoor commented Oct 9, 2025

Uh oh!

virchan commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

OmarManzoor commented Oct 7, 2025 •

edited

Loading

github-actions bot commented Oct 7, 2025 •

edited

Loading

OmarManzoor Oct 8, 2025 •

edited

Loading

ogrisel Oct 8, 2025 •

edited

Loading

ogrisel Oct 8, 2025 •

edited

Loading

OmarManzoor Oct 8, 2025 •

edited

Loading

OmarManzoor Oct 8, 2025 •

edited

Loading

ogrisel commented Oct 9, 2025 •

edited

Loading

OmarManzoor Dec 3, 2025 •

edited

Loading

ogrisel Dec 3, 2025 •

edited

Loading

lucyleeow Dec 4, 2025 •

edited

Loading

OmarManzoor Dec 4, 2025 •

edited

Loading

lucyleeow commented Dec 3, 2025 •

edited

Loading

ogrisel commented Dec 3, 2025 •

edited

Loading