Skip to content

FEA Add array API support for brier_score_loss, log_loss, d2_brier_score and d2_log_loss_score#32422

Merged
ogrisel merged 17 commits intoscikit-learn:mainfrom
OmarManzoor:array-api-d2-classification
Oct 9, 2025
Merged

FEA Add array API support for brier_score_loss, log_loss, d2_brier_score and d2_log_loss_score#32422
ogrisel merged 17 commits intoscikit-learn:mainfrom
OmarManzoor:array-api-d2-classification

Conversation

@OmarManzoor
Copy link
Copy Markdown
Contributor

@OmarManzoor OmarManzoor commented Oct 7, 2025

Reference Issues/PRs

Towards: #26024

What does this implement/fix? Explain your changes.

Adds array API support for

  • brier_score_loss
  • log_loss
  • d2_brier_score
  • d2_log_loss_score

Any other comments?

CC: @ogrisel

@github-actions
Copy link
Copy Markdown

github-actions bot commented Oct 7, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: fcf056a. Link to the linter CI: here

check_consistent_length(y_prob, y_true, sample_weight)
if sample_weight is not None:
_check_sample_weight(sample_weight, y_true, force_float_dtype=False)
_check_sample_weight(sample_weight, y_prob, force_float_dtype=False)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this because we follow the namespace and device of y_prob as y_true can or cannot be on the xp namespace and device

check_consistent_length(y_prob, y_true, sample_weight)
if sample_weight is not None:
_check_sample_weight(sample_weight, y_true, force_float_dtype=False)
_check_sample_weight(sample_weight, y_prob, force_float_dtype=False)
Copy link
Copy Markdown
Contributor Author

@OmarManzoor OmarManzoor Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: Updated this because we follow the namespace and device of y_prob as y_true can or cannot be on the xp namespace and device.

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @OmarManzoor. Here is a pass of feedback.

Maybe @lucyleeow would also be interested in reviewing and express her opinion on my suggestions below.

y_prob_sum = xp.sum(y_prob, axis=1)

if not xp.all(
xpx.isclose(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to contribute an xpx.allclose upstream.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be done but for now it seems like a straightforward change to add an additional xp.all. What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using xp.all is ok for now, I was just suggesting a follow-up improvement.

Comment on lines +231 to +232
if is_y_true_array_api:
y_true = _convert_to_numpy(y_true, xp=xp_y_true)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if is_y_true_array_api:
y_true = _convert_to_numpy(y_true, xp=xp_y_true)
# For classification metrics, both array API compatible and non array
# API compatible inputs are allowed for y_true: in particular arrays
# that store class labels as Python string with an object dtype cannot
# be represented with non-NumPy namespaces. To avoid having to maintain
# two code branches, we always convert y_true to NumPy and move the
# integer encoded output of LabelBinarizer.transform back to the y_prob
# namespace, irrespective of the original y_true namespace.
if is_y_true_array_api:
y_true = _convert_to_numpy(y_true, xp=xp_y_true)

Copy link
Copy Markdown
Member

@ogrisel ogrisel Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we could improve the readability of the _validate_multiclass_probabilistic_prediction function by extracting the namespace aware one-hot encoding of y_true into its own private helper:

def _one_hot_multiclass_target(y_true, target_xp, target_device, labels=None):
    # Ensure that y_true is numpy, call the LabelBinarizer, perform labels consistency
    # checks and move the result to the target namespace and device.
    ...
    return transformed_labels

We could similarly extract the matching code for the binary case out of _validate_binary_probabilistic_prediction:

def _one_hot_binary_target(y_true, target_xp, target_device, pos_label=None):
    ...
    return transformed_labels

Those helpers might also to be reusable to improve array API support for other classification metrics for which y_pred is not probabilistic (e.g. ROC AUC, average precision, confusion matrix based metrics and so on...)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OmarManzoor @ogrisel
To clarify, is the intention that all classification metrics should support string y_true (when it is numpy). E.g., for accuracy_score

xp, _, device = get_namespace_and_device(y_pred)
y_true, sample_weight = move_to(y_true, sample_weight, xp=xp, device=device)

we would want to change this to something similar-ish to what is in _one_hot_encoding_binary_target, like:

xp_y_true = get_namespace(y_true)
y_true = xp.asarray(y_true, dytpe=xp_y_true.int64)

y_true, sample_weight = move_to(y_true, sample_weight, xp=xp, device=device) 

or even

xp_y_true = get_namespace(y_true)
if _is_numpy_namespace(xp_y_true):
    y_true = xp.asarray(y_true, dytpe=xp_y_true.int64)

y_true, sample_weight = move_to(y_true, sample_weight, xp=xp, device=device) 

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just un-resolving for visibility)

@ogrisel ogrisel added the CUDA CI label Oct 8, 2025
@github-actions github-actions bot removed the CUDA CI label Oct 8, 2025
)
y_proba_null = np.average(transformed_labels, axis=0, weights=sample_weight)
y_proba_null = np.tile(y_proba_null, (len(y_true), 1))
transformed_labels = xp.astype(transformed_labels, y_proba.dtype, copy=False)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add a dtype param to the one hot encoding helpers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that would then create confusions when we don't need such a change. I think it might make sense to do this where it's required.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but let's keep that possibility in mind if we repeat this .astype calls in future reuses of the _one_hot_encoding_binary/multiclass_target functions.

# `LabelBinarizer` and then transfer the integer encoded output back to the
# target namespace and device.
if is_y_true_array_api:
y_true = _convert_to_numpy(y_true, xp=xp_y_true)
Copy link
Copy Markdown
Member

@ogrisel ogrisel Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in a follow-up PR, we could change the label binarizer to spare this forced NumPy conversion (when using numeric class labels). This would involve adding array API support to LabelBinarizer when sparse_output=False.

But we can probably do that in a follow-up PR.

Note that this discussion caused https://github.com/scikit-learn/scikit-learn/pull/30439/files#r1875958580 to stall in the past but I think we can decouple the concerns.

Copy link
Copy Markdown
Contributor Author

@OmarManzoor OmarManzoor Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it can be modified to allow the array api when the inputs are already numeric and compatible. However not sure how much benefit we can get from that considering it's basically mainly used for encoding labels.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I wouldn't be opposed to include the changes of #30439 into this PR and also add array API support for Brier score since they are related functions with shared private helpers.

Copy link
Copy Markdown
Contributor Author

@OmarManzoor OmarManzoor Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe after merging this one we can complete log_loss and brier_score in one PR? But we can do it in this PR too. As you would prefer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However not sure how much benefit we can get from that considering it's basically mainly used for encoding labels.

The main benefit would be cleaner/simpler code.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe after merging this one we can complete log_loss and brier_score in one PR? But we can do it in this PR too. As you would prefer.

No strong opinion.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it would be interesting to see the impact of not converting to numpy when running the benchmarks of #32422 (comment) which uses integer class values in y_true.

@OmarManzoor
Copy link
Copy Markdown
Contributor Author

A couple of quick benchmarks

n_samples = 1000000
n_classes = 10
Avg execution_time for d2_log_loss_score Numpy (main): 0.5043204307556153
Avg execution_time for d2_brier_score Numpy (main): 0.35432188510894774

Avg execution_time for d2_log_loss_score Numpy (current branch): 0.5363539457321167
Avg execution_time for d2_brier_score Numpy (current branch): 0.36591079235076907

Avg execution_time for d2_log_loss_score Pytorch Cuda (current branch): 0.1941575288772583
Avg execution_time for d2_brier_score Pytorch Cuda (current branch): 0.14546685218811034

Approximate speedup of cuda with respect to main:
d2_log_loss_score: 2.5x
d2_brier_score: 2.4x

Details
from time import time

import numpy as np
import torch as xp
from tqdm import tqdm

from sklearn._config import config_context
from sklearn.metrics import d2_brier_score, d2_log_loss_score

n_samples = 1000000
n_classes = 10

d2_log_times = []
d2_brier_times = []
for _ in tqdm(range(10), desc="Numpy (branch)"):
    y_prob = np.random.rand(n_samples, n_classes).astype(np.float64)
    y_prob = y_prob / y_prob.sum(axis=1, keepdims=True)
    y_true = np.random.randint(low=0, high=10, size=(n_samples,))

    start = time()
    d2_log_loss_score(y_true, y_prob)
    d2_log_times.append(time() - start)

    start = time()
    d2_brier_score(y_true, y_prob)
    d2_brier_times.append(time() - start)

avg_d2_log_time = sum(d2_log_times) / 10
avg_d2_brier_time = sum(d2_brier_times) / 10
print(f"Avg execution_time for d2_log_loss_score Numpy (branch): {avg_d2_log_time}")
print(f"Avg execution_time for d2_brier_score Numpy (branch): {avg_d2_brier_time}")


d2_log_times = []
d2_brier_times = []
for _ in tqdm(range(10), desc="Pytorch Cuda (branch)"):
    y_prob = np.random.rand(n_samples, n_classes).astype(np.float64)
    y_prob = y_prob / y_prob.sum(axis=1, keepdims=True)
    y_true = np.random.randint(low=0, high=10, size=(n_samples,))

    with config_context(array_api_dispatch=True):
        y_prob = xp.asarray(y_prob, device="cuda")
        y_true = xp.asarray(y_true, device="cuda")
        start = time()
        d2_log_loss_score(y_true, y_prob)
        d2_log_times.append(time() - start)

        start = time()
        d2_brier_score(y_true, y_prob)
        d2_brier_times.append(time() - start)

avg_d2_log_time = sum(d2_log_times) / 10
avg_d2_brier_time = sum(d2_brier_times) / 10
print(
    f"Avg execution_time for d2_log_loss_score Pytorch Cuda (branch): {avg_d2_log_time}"
)
print(
    f"Avg execution_time for d2_brier_score Pytorch Cuda (branch): {avg_d2_brier_time}"
)

@github-actions github-actions bot removed the CUDA CI label Oct 8, 2025
@OmarManzoor OmarManzoor changed the title FEA Add array API support for d2_brier_score and d2_log_loss_score FEA Add array API support for brier_score_loss, log_loss, d2_brier_score and d2_log_loss_score Oct 8, 2025
Copy link
Copy Markdown
Member

@virchan virchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks, @OmarManzoor!

@ogrisel ogrisel merged commit f4161e7 into scikit-learn:main Oct 9, 2025
42 checks passed
@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Oct 9, 2025

Thanks @OmarManzoor and @virchan. I just merged. Any of you would be interested in a follow-up PR to tackle https://github.com/scikit-learn/scikit-learn/pull/32422/files#r2413404620?

@OmarManzoor OmarManzoor deleted the array-api-d2-classification branch October 9, 2025 16:14
@OmarManzoor
Copy link
Copy Markdown
Contributor Author

I can try it out but if @virchan wants to then he is welcome to do so.

@virchan
Copy link
Copy Markdown
Member

virchan commented Oct 10, 2025

Yea, I'd like to work on adding array API support to LabelBinarizer for sparse_output=False.

@pytest.mark.parametrize(
"array_namespace, device_, dtype_name", yield_namespace_device_dtype_combinations()
)
def test_probabilistic_metrics_array_api(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OmarManzoor what do you think about adding this check to test_common.py and adding a check_array_api_binary_continuous_classification_metric

For context was working on #32755, and was looking at our array API tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we test for string y_true in the common tests but if you want to refactor this into the common tests that is fine.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't have one for continuous, y_score, metrics at all.
But yes the string is also something not tested either.

Should be reasonable to refactor. And then it's all in one place for future ranking metrics

try:
pos_label = _check_pos_label_consistency(pos_label, y_true)
except ValueError:
classes = np.unique(y_true)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use xp and not np? I would think that there could be a case where the input is xp and the classes are not {0,1} {-1,1}, which would cause ValueError here?

In this case, I don't think we have tested this scenario. If so, I think this case should probably be included in common testing #32755, as I want to be able to capture all cases.

It may be easier to fix in that PR or in a separate PR and indicate the test is coming in #32755

cc @OmarManzoor

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use xp here, but we would need to use xp_y_true as xp will result in errors when y_true consists of strings. But yes we can fix this in the other PR you are working on or if required earlier I can create a separate PR for this one.

)

transformed_labels = lb.transform(y_true)
transformed_labels = target_xp.asarray(transformed_labels, device=target_device)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why asarray here and not move_to ? Similar question for _one_hot_encoding_binary_target

@OmarManzoor @ogrisel

Copy link
Copy Markdown
Contributor Author

@OmarManzoor OmarManzoor Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move_to was added more recently. It doesn't matter much though I think, we can use either.

Copy link
Copy Markdown
Member

@ogrisel ogrisel Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that for numpy to any xp conversions, both xp.asarray and dlpack via move_to should yield similar outcomes, as I don't think any dlpack enabled namespace will drop the __array__ protocol / numpy compat.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that transformed_labels is not numpy though?

Also I don't think asarray works when transformed_labels is array api strict, from #32755 : https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=83092&view=logs&j=dde5042c-7464-5d47-9507-31bdd2ee0a3a&t=4bd2dad8-62b3-5bf9-08a5-a9880c530c94 :

Details
../1/s/sklearn/metrics/_classification.py:229: in _one_hot_encoding_multiclass_target
    transformed_labels = target_xp.asarray(transformed_labels, device=target_device)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        _          = True
        labels     = None
        lb         = LabelBinarizer()
        target_device = device(type='cpu')
        target_xp  = <module 'sklearn.externals.array_api_compat.torch' from '/home/vsts/work/1/s/sklearn/externals/array_api_compat/torch/__init__.py'>
        transformed_labels = Array([[1],
       [0],
       [1],
       [0]], dtype=array_api_strict.int64)
        xp         = <module 'array_api_strict' from '/home/vsts/miniforge3/envs/testvenv/lib/python3.13/site-packages/array_api_strict/__init__.py'>
        y_true     = Array([1, 0, 1, 0], dtype=array_api_strict.int64)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

obj = Array([[1],
       [0],
       [1],
       [0]], dtype=array_api_strict.int64)
dtype = None, device = device(type='cpu'), copy = None, kwargs = {}

    def asarray(
        obj: (
        Array
            | bool | int | float | complex
            | NestedSequence[bool | int | float | complex]
            | SupportsBufferProtocol
        ),
        /,
        *,
        dtype: DType | None = None,
        device: Device | None = None,
        copy: bool | None = None,
        **kwargs: Any,
    ) -> Array:
        # torch.asarray does not respect input->output device propagation
        # https://github.com/pytorch/pytorch/issues/150199
        if device is None and isinstance(obj, torch.Tensor):
            device = obj.device
>       return torch.asarray(obj, dtype=dtype, device=device, copy=copy, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E       RuntimeError: could not retrieve buffer from object

copy       = None
device     = device(type='cpu')
dtype      = None
obj        = Array([[1],
       [0],
       [1],
       [0]], dtype=array_api_strict.int64)

Copy link
Copy Markdown
Member

@lucyleeow lucyleeow Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly off-topic, does any other metric allow mixed array input support, or just the ones in this PR? (just to help me tackle #32755)

Copy link
Copy Markdown
Contributor Author

@OmarManzoor OmarManzoor Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any other metrics handle strings other than the ones in this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the code snippet you shared seems to suggest that the namespace is torch while the array is from array-api-strict. If we want to handle such combinations and move_to handles this sort of a scenario, I think we will need to use it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that was a separate point to the first one. Obviously array api strict to torch is more about tests passing, but it does also demonstrate that it is possible/we cover the case where y_true / transformed_labels is not numpy

@lucyleeow
Copy link
Copy Markdown
Member

lucyleeow commented Dec 3, 2025

I know CI is passing, but locally I get the following test error for test_probabilistic_metrics_array_api with log_loss and d2_log_loss_score:

_________________________________________________________ test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-True-log_loss] _________________________________________________________

prob_metric = <function log_loss at 0x7f182e735260>, str_y_true = True, use_sample_weight = True, array_namespace = 'array_api_strict', device_ = array_api_strict.Device('device1'), dtype_name = 'float32'

    @pytest.mark.parametrize(
        "prob_metric", [log_loss,]
    )
    @pytest.mark.parametrize("str_y_true", [False, True])
    @pytest.mark.parametrize("use_sample_weight", [False, True])
    @pytest.mark.parametrize(
        "array_namespace, device_, dtype_name", yield_namespace_device_dtype_combinations()
    )
    def test_probabilistic_metrics_array_api(
        prob_metric, str_y_true, use_sample_weight, array_namespace, device_, dtype_name
    ):
        """Test that :func:`brier_score_loss`, :func:`log_loss`, func:`d2_brier_score`
        and :func:`d2_log_loss_score` work correctly with the array API for binary
        and mutli-class inputs.
        """
        xp = _array_api_for_tests(array_namespace, device_)
        sample_weight = np.array([1, 2, 3, 1]) if use_sample_weight else None
    
        # binary case
        extra_kwargs = {}
        if str_y_true:
            y_true_np = np.array(["yes", "no", "yes", "no"])
            y_true_xp_or_np = np.asarray(y_true_np)
            if "brier" in prob_metric.__name__:
                # `brier_score_loss` and `d2_brier_score` require specifying the
                # `pos_label`
                extra_kwargs["pos_label"] = "yes"
        else:
            y_true_np = np.array([1, 0, 1, 0])
            y_true_xp_or_np = xp.asarray(y_true_np, device=device_)
    
        y_prob_np = np.array([0.5, 0.2, 0.7, 0.6], dtype=dtype_name)
        y_prob_xp = xp.asarray(y_prob_np, device=device_)
        metric_score_np = prob_metric(
            y_true_np, y_prob_np, sample_weight=sample_weight, **extra_kwargs
        )
        with config_context(array_api_dispatch=True):
>           metric_score_xp = prob_metric(
                y_true_xp_or_np, y_prob_xp, sample_weight=sample_weight, **extra_kwargs
            )

sklearn/metrics/tests/test_classification.py:3677: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/utils/_param_validation.py:218: in wrapper
    return func(*args, **kwargs)
sklearn/metrics/_classification.py:3381: in log_loss
    return _log_loss(
sklearn/metrics/_classification.py:3396: in _log_loss
    loss = -xp.sum(xlogy(transformed_labels, y_pred), axis=1)
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/scipy/special/_support_alternative_backends.py:167: in wrapped
    return f(*args, **kwargs)
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/scipy/special/_support_alternative_backends.py:76: in __xlogy
    temp = x * xp.log(y)
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/_array_object.py:858: in __mul__
    other = self._check_allowed_dtypes(other, "numeric", "__mul__")
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/_array_object.py:215: in _check_allowed_dtypes
    res_dtype = _result_type(self.dtype, other.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

type1 = array_api_strict.int64, type2 = array_api_strict.float32

    def _result_type(type1: DType, type2: DType) -> DType:
        if (type1, type2) in _promotion_table:
            return _promotion_table[type1, type2]
>       raise TypeError(f"{type1} and {type2} cannot be type promoted together")
E       TypeError: array_api_strict.int64 and array_api_strict.float32 cannot be type promoted together

I updated array-api-strict and array-api-compat and used main. But is it just me or can anyone else reproduce? cc @OmarManzoor @ogrisel

Note in _one_hot_encoding_binary_target we do:

y_true_pos = xp_y_true.asarray(y_true == pos_label, dtype=xp_y_true.int64)

Relevant comment: #32422 (comment)

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Dec 3, 2025

I cannot reproduce on the current main:

Details
pytest sklearn/metrics/tests/test_classification.py -vlk "test_probabilistic_metrics_array_api and strict"
======================================================================================================= test session starts =======================================================================================================
platform darwin -- Python 3.13.7, pytest-9.0.1, pluggy-1.6.0 -- /Users/ogrisel/miniforge3/envs/dev/bin/python3.13
cachedir: .pytest_cache
rootdir: /Users/ogrisel/code/scikit-learn
configfile: pyproject.toml
plugins: anyio-4.11.0, xdist-3.8.0, timeout-2.4.0, run-parallel-0.7.0, cov-7.0.0
collected 519 items / 487 deselected / 32 selected                                                                                                                                                                                
Collected 0 items to run in parallel

sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-False-brier_score_loss] PASSED                                                                   [  3%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-False-log_loss] PASSED                                                                           [  6%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-False-d2_brier_score] PASSED                                                                     [  9%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-False-d2_log_loss_score] PASSED                                                                  [ 12%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-True-brier_score_loss] PASSED                                                                    [ 15%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-True-log_loss] PASSED                                                                            [ 18%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-True-d2_brier_score] PASSED                                                                      [ 21%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-False-True-d2_log_loss_score] PASSED                                                                   [ 25%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-False-brier_score_loss] PASSED                                                                    [ 28%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-False-log_loss] PASSED                                                                            [ 31%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-False-d2_brier_score] PASSED                                                                      [ 34%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-False-d2_log_loss_score] PASSED                                                                   [ 37%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-True-brier_score_loss] PASSED                                                                     [ 40%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-True-log_loss] PASSED                                                                             [ 43%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-True-d2_brier_score] PASSED                                                                       [ 46%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_1-float64-True-True-d2_log_loss_score] PASSED                                                                    [ 50%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-False-brier_score_loss] PASSED                                                                   [ 53%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-False-log_loss] PASSED                                                                           [ 56%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-False-d2_brier_score] PASSED                                                                     [ 59%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-False-d2_log_loss_score] PASSED                                                                  [ 62%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-True-brier_score_loss] PASSED                                                                    [ 65%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-True-log_loss] PASSED                                                                            [ 68%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-True-d2_brier_score] PASSED                                                                      [ 71%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-False-True-d2_log_loss_score] PASSED                                                                   [ 75%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-False-brier_score_loss] PASSED                                                                    [ 78%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-False-log_loss] PASSED                                                                            [ 81%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-False-d2_brier_score] PASSED                                                                      [ 84%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-False-d2_log_loss_score] PASSED                                                                   [ 87%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-True-brier_score_loss] PASSED                                                                     [ 90%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-True-log_loss] PASSED                                                                             [ 93%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-True-d2_brier_score] PASSED                                                                       [ 96%]
sklearn/metrics/tests/test_classification.py::test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-True-d2_log_loss_score] PASSED                                                                    [100%]

=============================================================================================== 32 passed, 487 deselected in 0.53s ================================================================================================

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Dec 3, 2025

@lucyleeow could you please run pytest with the -l flag (to display the values of local variables in the traceback) and update your comment?

@lucyleeow
Copy link
Copy Markdown
Member

lucyleeow commented Dec 3, 2025

Details

_______________________________ test_probabilistic_metrics_array_api[array_api_strict-device_2-float32-True-True-d2_log_loss_score] ________________________________

prob_metric = <function d2_log_loss_score at 0x7f4c45045800>, str_y_true = True, use_sample_weight = True, array_namespace = 'array_api_strict'
device_ = array_api_strict.Device('device1'), dtype_name = 'float32'

    @pytest.mark.parametrize(
        "prob_metric", [brier_score_loss, log_loss, d2_brier_score, d2_log_loss_score]
    )
    @pytest.mark.parametrize("str_y_true", [False, True])
    @pytest.mark.parametrize("use_sample_weight", [False, True])
    @pytest.mark.parametrize(
        "array_namespace, device_, dtype_name", yield_namespace_device_dtype_combinations()
    )
    def test_probabilistic_metrics_array_api(
        prob_metric, str_y_true, use_sample_weight, array_namespace, device_, dtype_name
    ):
        """Test that :func:`brier_score_loss`, :func:`log_loss`, func:`d2_brier_score`
        and :func:`d2_log_loss_score` work correctly with the array API for binary
        and mutli-class inputs.
        """
        xp = _array_api_for_tests(array_namespace, device_)
        sample_weight = np.array([1, 2, 3, 1]) if use_sample_weight else None
    
        # binary case
        extra_kwargs = {}
        if str_y_true:
            y_true_np = np.array(["yes", "no", "yes", "no"])
            y_true_xp_or_np = np.asarray(y_true_np)
            if "brier" in prob_metric.__name__:
                # `brier_score_loss` and `d2_brier_score` require specifying the
                # `pos_label`
                extra_kwargs["pos_label"] = "yes"
        else:
            y_true_np = np.array([1, 0, 1, 0])
            y_true_xp_or_np = xp.asarray(y_true_np, device=device_)
    
        y_prob_np = np.array([0.5, 0.2, 0.7, 0.6], dtype=dtype_name)
        y_prob_xp = xp.asarray(y_prob_np, device=device_)
        metric_score_np = prob_metric(
            y_true_np, y_prob_np, sample_weight=sample_weight, **extra_kwargs
        )
        with config_context(array_api_dispatch=True):
>           metric_score_xp = prob_metric(
                y_true_xp_or_np, y_prob_xp, sample_weight=sample_weight, **extra_kwargs
            )

array_namespace = 'array_api_strict'
device_    = array_api_strict.Device('device1')
dtype_name = 'float32'
extra_kwargs = {}
metric_score_np = 0.34612621977432545
prob_metric = <function d2_log_loss_score at 0x7f4c45045800>
sample_weight = array([1, 2, 3, 1])
str_y_true = True
use_sample_weight = True
xp         = <module 'array_api_strict' from '/home/lucy/miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/__init__.py'>
y_prob_np  = array([0.5, 0.2, 0.7, 0.6], dtype=float32)
y_prob_xp  = Array([0.5,
       0.2,
       0.7,
       0.6], dtype=array_api_strict.float32, device=array_api_strict.Device('device1'))
y_true_np  = array(['yes', 'no', 'yes', 'no'], dtype='<U3')
y_true_xp_or_np = array(['yes', 'no', 'yes', 'no'], dtype='<U3')

sklearn/metrics/tests/test_classification.py:3674: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
sklearn/utils/_param_validation.py:218: in wrapper
    return func(*args, **kwargs)
        args       = (array(['yes', 'no', 'yes', 'no'], dtype='<U3'), Array([0.5,
       0.2,
       0.7,
       0.6], dtype=array_api_strict.float32, device=array_api_strict.Device('device1')))
        func       = <function d2_log_loss_score at 0x7f4c45045760>
        func_sig   = <Signature (y_true, y_pred, *, sample_weight=None, labels=None)>
        global_skip_validation = False
        kwargs     = {'sample_weight': array([1, 2, 3, 1])}
        parameter_constraints = {'labels': ['array-like', None], 'sample_weight': ['array-like', None], 'y_pred': ['array-like'], 'y_true': ['array-like']}
        params     = {'labels': None, 'sample_weight': array([1, 2, 3, 1]), 'y_pred': Array([0.5,
       0.2,
       0.7,
       0.6], dtyp...i_strict.float32, device=array_api_strict.Device('device1')), 'y_true': array(['yes', 'no', 'yes', 'no'], dtype='<U3')}
        prefer_skip_nested_validation = True
        to_ignore  = ['self', 'cls']
sklearn/metrics/_classification.py:3871: in d2_log_loss_score
    numerator = _log_loss(
        _          = True
        device_    = array_api_strict.Device('device1')
        labels     = None
        sample_weight = Array([1,
       2,
       3,
       1], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
        transformed_labels = Array([[0,
        1],
       [1,
        0],
       [0,
        1],
       [1,
        0]], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
        xp         = <module 'array_api_strict' from '/home/lucy/miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/__init__.py'>
        y_pred     = Array([[0.5       ,
        0.5       ],
       [0.8       ,
        0.2       ],
       [0.3       ,
        0.7       ],
       [0.39999998,
        0.6       ]], dtype=array_api_strict.float32, device=array_api_strict.Device('device1'))
        y_pred_null = Array([[0.42857143,
        0.57142857],
       [0.42857143,
        0.57142857],
       [0.42857143,
        0.57142857],
       [0.42857143,
        0.57142857]], dtype=array_api_strict.float64, device=array_api_strict.Device('device1'))
        y_true     = array(['yes', 'no', 'yes', 'no'], dtype='<U3')
sklearn/metrics/_classification.py:3396: in _log_loss
    loss = -xp.sum(xlogy(transformed_labels, y_pred), axis=1)
        _          = True
        device_    = array_api_strict.Device('device1')
        eps        = 1.1920928955078125e-07
        normalize  = False
        sample_weight = Array([1,
       2,
       3,
       1], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
        transformed_labels = Array([[0,
        1],
       [1,
        0],
       [0,
        1],
       [1,
        0]], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
        xp         = <module 'array_api_strict' from '/home/lucy/miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/__init__.py'>
        y_pred     = Array([[0.5       ,
        0.5       ],
       [0.8       ,
        0.2       ],
       [0.3       ,
        0.7       ],
       [0.39999998,
        0.6       ]], dtype=array_api_strict.float32, device=array_api_strict.Device('device1'))
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/scipy/special/_support_alternative_backends.py:167: in wrapped
    return f(*args, **kwargs)
        args       = (Array([[0,
        1],
       [1,
        0],
       [0,
        1],
       [1,
        0]], dtype=array_api_strict.i...,
       [0.39999998,
        0.6       ]], dtype=array_api_strict.float32, device=array_api_strict.Device('device1')))
        f          = <function _xlogy.<locals>.__xlogy at 0x7f4c444259e0>
        f_name     = 'xlogy'
        kwargs     = {}
        n_array_args = 2
        xp         = <module 'array_api_strict' from '/home/lucy/miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/__init__.py'>
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/scipy/special/_support_alternative_backends.py:76: in __xlogy
    temp = x * xp.log(y)
        x          = Array([[0,
        1],
       [1,
        0],
       [0,
        1],
       [1,
        0]], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
        xp         = <module 'array_api_strict' from '/home/lucy/miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/__init__.py'>
        y          = Array([[0.5       ,
        0.5       ],
       [0.8       ,
        0.2       ],
       [0.3       ,
        0.7       ],
       [0.39999998,
        0.6       ]], dtype=array_api_strict.float32, device=array_api_strict.Device('device1'))
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/_array_object.py:858: in __mul__
    other = self._check_allowed_dtypes(other, "numeric", "__mul__")
        other      = Array([[-0.6931472 ,
        -0.6931472 ],
       [-0.22314353,
        -1.609438  ],
       [-1.2039728 ,
        -0....
       [-0.9162908 ,
        -0.5108256 ]], dtype=array_api_strict.float32, device=array_api_strict.Device('device1'))
        self       = Array([[0,
        1],
       [1,
        0],
       [0,
        1],
       [1,
        0]], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/_array_object.py:215: in _check_allowed_dtypes
    res_dtype = _result_type(self.dtype, other.dtype)
        dtype_category = 'numeric'
        op         = '__mul__'
        other      = Array([[-0.6931472 ,
        -0.6931472 ],
       [-0.22314353,
        -1.609438  ],
       [-1.2039728 ,
        -0....
       [-0.9162908 ,
        -0.5108256 ]], dtype=array_api_strict.float32, device=array_api_strict.Device('device1'))
        self       = Array([[0,
        1],
       [1,
        0],
       [0,
        1],
       [1,
        0]], dtype=array_api_strict.int64, device=array_api_strict.Device('device1'))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

type1 = array_api_strict.int64, type2 = array_api_strict.float32

    def _result_type(type1: DType, type2: DType) -> DType:
        if (type1, type2) in _promotion_table:
            return _promotion_table[type1, type2]
>       raise TypeError(f"{type1} and {type2} cannot be type promoted together")
E       TypeError: array_api_strict.int64 and array_api_strict.float32 cannot be type promoted together

type1      = array_api_strict.int64
type2      = array_api_strict.float32

../../../miniconda3/envs/skl-array-api/lib/python3.13/site-packages/array_api_strict/_dtypes.py:229: TypeError

Env:

array-api-compat          1.12.0                   pypi_0    pypi
array-api-strict          2.4.1                    pypi_0    pypi

@lesteve
Copy link
Copy Markdown
Member

lesteve commented Dec 3, 2025

Not super helpful I know but this is the kind of thing where I would look at a CI build log environment e.g. this macOS arm from a recent main branch and play the game of "7 differences" compared to your environment

It could be scipy version, it could be numpy, who knows ...

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Dec 3, 2025

@lucyleeow since the problem seems to come from scipy's xlogy, can you try to see if you are running the latest scipy version?

I am running scipy 1.16.3 and cannot reproduce the problem.

@lesteve
Copy link
Copy Markdown
Member

lesteve commented Dec 3, 2025

@lucyleeow since the problem seems to come from scipy's xlogy, can you try to see if you are running the latest scipy version?

If it's scipy xlogy it reminds me a lot of #32552 which was happening for scipy 1.15 indeed.

@lucyleeow
Copy link
Copy Markdown
Member

#32552 was exactly it! Thanks team!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants