FIX Improve checking for string and multilabel inputs for classification metrics by lucyleeow · Pull Request #33086 · scikit-learn/scikit-learn

lucyleeow · 2026-01-15T10:14:33Z

Reference Issues/PRs

Fixes #33045

What does this implement/fix? Explain your changes.

Mixed string and number inputs do not error for the following thresholded classification metrics:

accuracy_score
hamming_loss
zero_one_loss
matthews_corrcoef
confusion_matrix - only when labels is set

when the mixed string and number inputs are:

lists (e.g., ['a', 'b'], [1, 0])
string is a numpy array that is NOT object dtype e.g., np.array(['a','b'])

Details

Mixed string and number inputs are dealt with via _union1d (see #18192)

scikit-learn/sklearn/metrics/_classification.py

Lines 154 to 165 in 6c084a8

    
               unique_values = _union1d(y_true, y_pred, xp) 
        
           except TypeError as e: 
        
               # We expect y_true and y_pred to be of the same data type. 
        
               # If `y_true` was provided to the classifier as strings, 
        
               # `y_pred` given by the classifier will also be encoded with 
        
               # strings. So we raise a meaningful error 
        
               raise TypeError( 
        
                   "Labels in y_true and y_pred should be of the same type. " 
        
                   f"Got y_true={xp.unique(y_true)} and " 
        
                   f"y_pred={xp.unique(y_pred)}. Make sure that the " 
        
                   "predictions provided by the classifier coincides with " 
        
                   "the true labels."

However this only works for numpy arrays, where the string is specifically specified to be object type:

scikit-learn/sklearn/metrics/tests/test_common.py

Lines 1928 to 1933 in 0707e62

    
           def test_metrics_consistent_type_error(metric_name): 
        
               # check that an understable message is raised when the type between y_true 
        
               # and y_pred mismatch 
        
               rng = np.random.RandomState(42) 
        
               y1 = np.array(["spam"] * 3 + ["eggs"] * 2, dtype=object) 
        
               y2 = rng.randint(0, 2, size=y1.size)

This test does not pass if the inputs are list or of dtype object is not forced. np.union1d first concatenates the arrays and then performs unique on the result. If the concatenated array is of dtype object, unique gives TypeError, however if object is not forced to be object, e.g.:

np.union1d(np.array([1,2]), np.array(['a','b']))

the resulting concatenation is of type 'U' and no error is raised when unique is run. (I am not sure why this case was not considered initially, but I think it would not be unusual/rare for the input to be a string numpy array that is NOT of object type)

The other classification metrics do raise an error, due to a call to unique_labels (see #33045 for details).

This PR enforces:

thresholded metrics should raise an error if passed inconsistent types of inputs.

for all thresholded metrics for all input types by running unique_labels inside _check_targets. This is an opinionated change so let me explain:

_check_targets is ONLY used within thresholded metrics in _classification.py - thus a non backwards compatible change is less of an issue.
- for the metrics where unique_labels is also called, unique_labels is always called after _check_targets
_check_targets and unique_labels raise errors for similar problematic input combinations, except:
- unique_labels checks label indicator inputs are of the same size (_check_targets does not)
  - this means that the classification metrics (which previously did not run unique_labels) will have this extra check, which I think is a fix.
- (less importantly) _check_targets checks target length consistency and does not allow continuous input or any multioutput target types
unique_labels is used elsewhere in the code (e.g., LinearDiscriminantAnalysis) - thus the changes made are backwards compatible and no changes in functionality was made

Cons of this approach:

The error messages in _check_targets are more specific for classification metrics vs those in unique_labels. I added unique_labels right at the end of _check_targets, so all the current _check_targets will be raised in priority. The only difference is that for mixed str and int inputs, the error message from unique_labels is raised instead, but I have tried to improve this message.

Pros:

Lets us remove _union1d from _array_api.py as this was the only place this function was used
Passing y types to unique_labels reduces duplicate type_of_target calls (and thus duplicate unique calls - I know we cache unique but this is only relevant for numpy arrays, not necessarily other array API supported arrays)

I did also consider simply changing the logic in:

scikit-learn/sklearn/metrics/_classification.py

Lines 154 to 165 in 6c084a8

    
               unique_values = _union1d(y_true, y_pred, xp) 
        
           except TypeError as e: 
        
               # We expect y_true and y_pred to be of the same data type. 
        
               # If `y_true` was provided to the classifier as strings, 
        
               # `y_pred` given by the classifier will also be encoded with 
        
               # strings. So we raise a meaningful error 
        
               raise TypeError( 
        
                   "Labels in y_true and y_pred should be of the same type. " 
        
                   f"Got y_true={xp.unique(y_true)} and " 
        
                   f"y_pred={xp.unique(y_pred)}. Make sure that the " 
        
                   "predictions provided by the classifier coincides with " 
        
                   "the true labels."

such that list and non object numpy inputs are also handled, but went with the above approach. Happy to take other opinions though.

AI usage disclosure

I used AI assistance for:

Code generation (e.g., when writing an implementation or fixing a bug)
Test/benchmark generation
Documentation (including examples)
Research and understanding

Any other comments?

lucyleeow · 2026-01-15T10:26:29Z

    # all of length 3
    EXAMPLES = [
-        (IND, np.array([[0, 1, 1], [1, 0, 0], [0, 0, 1]])),
+        (IND, np.array([[0, 1], [1, 0], [0, 0]])),


Changed because this test would now fail for the (IND, IND) combination (as the two IND have different sizes).

I checked and all other combinations involving IND would error (expected is None), thus I think it is safe to amend this.

Now this example is basically the same as the next entry, so maybe we can remove this line?

lucyleeow · 2026-01-15T10:29:08Z

    if len(set(isinstance(label, str) for label in ys_labels)) > 1:
-        raise ValueError("Mix of label input types (string and number)")
+        msg_details = "Got " + " ".join([f"{xp.unique(y)}" for y in ys])
+        raise ValueError(f"Mix of label input types (string and number); {msg_details}")


We could raise a TypeError here, like what _union1d does, but I think it is better to keep as ValueError as its more of an input value problem as e.g., the predictions/y2 don't match true labels/y1. Also this makes this backwards compatible.

lucyleeow · 2026-01-15T10:32:43Z

-    ys_types = set(type_of_target(x) for x in ys)
-    if ys_types == {"binary", "multiclass"}:
-        ys_types = {"multiclass"}
+    if ys_types is None:


This was just to avoid a duplicate type_of_target (and thus unique) call (again, we cache, but this is only relevant for arrays with metadata attr) - but happy to remove as this is a public function.

The idea is that passing in ys_types is an optimisation right? Is it really such a slow down/speed up to do this?

Sorry I missed this. I did not do any benchmarking, and again since we do cache, it is only for specific array inputs where it would actually matter. Probably not a big difference overall, but it also did not seem like such a significant change to incorporate it...

lucyleeow · 2026-01-15T11:04:36Z

Ping @ogrisel 🙏

ogrisel

I think there is a problem with unique_labels when array API dispatch is enabled (even with NumPy only inputs). See below:

ogrisel · 2026-01-30T10:50:27Z

Labeling this as array API because the remaining problems are array API support related.

lucyleeow · 2026-02-03T09:42:36Z

Thanks @ogrisel ! Tests added! I think this is also a good reminder to add array API tests even for small functions that we convert in PRs, and not rely on the fact that it's tested as part of another function.

lucyleeow · 2026-04-01T07:18:53Z

Gentle ping @ogrisel @lorentzenchr, I think this is ready for another review 🙏

ogrisel

Thanks @lucyleeow. Just minor feedback but otherwise LGTM.

lucyleeow · 2026-04-07T06:32:52Z

Thanks @betatim ! I think this is ready for another look 🙏 thank you!

lucyleeow · 2026-04-28T04:34:19Z

Gentle ping @betatim

lorentzenchr · 2026-04-28T06:42:54Z

+    "input_kind",
+    ["array_of_str_objects", "array_of_fixed_width_strings", "list_of_str_objects"],


Suggested change

"input_kind",

["array_of_str_objects", "array_of_fixed_width_strings", "list_of_str_objects"],

"y_1",

[

np.array(["spam"] * 3 + ["eggs"] * 2, dtype=object), # str object

np.array(["spam"] * 3 + ["eggs"] * 2), # fixed width string

["spam"] * 3 + ["eggs"] * 2,

],

Could save the if else logic inside.

lorentzenchr · 2026-04-28T06:47:22Z

+    if y_type == "binary":
+        if unique_labels_.shape[0] > 2:
+            y_type = "multiclass"


Should we rename the values of y_type?
I find the logic strange. y_type was detected to be binary but it has more than 2 unique values. Then it is a false positive detection - or just a bad name.

Looking at the old code and other instances where this type of code occurs e.g., label_binarize:

scikit-learn/sklearn/preprocessing/_label.py

Lines 598 to 607 in 78d1885

if y_type == "binary":

if n_classes == 1:

if sparse_output:

return _align_api_if_sparse(sp.csr_array((n_samples, 1), dtype=int))

else:

Y = xp.zeros((n_samples, 1), dtype=int_dtype_)

Y += neg_label

return Y

elif n_classes >= 3:

y_type = "multiclass"

I think it would be the (rare) edge case where there are actually >2 classes (e.g., a, b, c) , and y_pred contains 2 classes (e.g., a b) and y_true contains 2 classes but a different 2 classes (e.g., a, c).

lorentzenchr · 2026-05-01T10:52:47Z

@lucyleeow Can I merge?

lucyleeow · 2026-05-01T13:53:34Z

@lorentzenchr yes thanks!
I'll look into #33086 (comment) separately

lucyleeow added 2 commits January 15, 2026 16:45

wip

7b53956

update unique labels, and test check target

d97311a

github-actions Bot added module:metrics module:utils labels Jan 15, 2026

lucyleeow added 2 commits January 15, 2026 21:19

improve check target set handling

caf57e0

fix unique label error message

0fc1726

lucyleeow commented Jan 15, 2026

View reviewed changes

whats new

6719a22

lucyleeow mentioned this pull request Jan 16, 2026

TST Add common test for mixed array API inputs for metrics #32755

Merged

ogrisel self-requested a review January 19, 2026 13:54

ogrisel reviewed Jan 30, 2026

View reviewed changes

Comment thread sklearn/metrics/_classification.py Outdated

Comment thread sklearn/utils/multiclass.py

ogrisel added the Array API label Jan 30, 2026

github-project-automation Bot added this to Array API Jan 30, 2026

ogrisel moved this to In Progress in Array API Jan 30, 2026

lucyleeow added 6 commits February 2, 2026 19:06

Merge branch 'main' into str_int_metrics

da91e31

fix unique array, add tests

0e5214b

fix comment

e558205

add object to test

a0ea6fd

fix test

1f66072

review

36cfb31

lorentzenchr reviewed Feb 27, 2026

View reviewed changes

Comment thread doc/whats_new/upcoming_changes/sklearn.metrics/33086.fix.rst Outdated

Comment thread doc/whats_new/upcoming_changes/sklearn.metrics/33086.fix.rst Outdated

Comment thread sklearn/utils/tests/test_multiclass.py Outdated

Comment thread sklearn/utils/tests/test_multiclass.py Outdated

lucyleeow added 5 commits February 28, 2026 11:24

Merge branch 'main' into str_int_metrics

9dad79b

review

c1eb7cb

rm pytorch, improve test

36d8c70

merge main

8e1e494

rm pytorch test

3c0b7f0

update for _array_api_for_tests

e3b1994

ogrisel approved these changes Apr 1, 2026

View reviewed changes

Comment thread sklearn/metrics/tests/test_common.py Outdated

Comment thread sklearn/utils/tests/test_multiclass.py Outdated

Comment thread sklearn/utils/multiclass.py Outdated

betatim reviewed Apr 1, 2026

View reviewed changes

Comment thread sklearn/utils/tests/test_multiclass.py Outdated

betatim reviewed Apr 1, 2026

View reviewed changes

Comment thread sklearn/utils/multiclass.py Outdated

lucyleeow added 3 commits April 2, 2026 11:31

Merge branch 'main' into str_int_metrics

109adf0

review

3a2ef78

fix typo

af7df7a

Merge branch 'main' into str_int_metrics

dcb47ec

lorentzenchr approved these changes Apr 28, 2026

View reviewed changes

lucyleeow added 2 commits May 1, 2026 17:10

Merge branch 'main' into str_int_metrics

628556f

review

ff3dbe8

lorentzenchr changed the title ~~Improve checking for string and multilabel inputs for classification metrics~~ ENH Improve checking for string and multilabel inputs for classification metrics May 1, 2026

lorentzenchr changed the title ~~ENH Improve checking for string and multilabel inputs for classification metrics~~ FIX Improve checking for string and multilabel inputs for classification metrics May 1, 2026

lorentzenchr merged commit 78d1885 into scikit-learn:main May 1, 2026
38 checks passed

github-project-automation Bot moved this from In Progress to Done in Array API May 1, 2026

lucyleeow deleted the str_int_metrics branch May 2, 2026 00:53

lucyleeow mentioned this pull request May 27, 2026

TST Order array API metric tests alphabetically in test_common.py #34133

Merged

This was referenced Jun 8, 2026

fix: Update lock and remove python limit fo pylate and colbert_engine embeddings-benchmark/mteb#4783

Merged

Sklearn 1.9 is not compatible with ZeroShotClassification embeddings-benchmark/mteb#4784

Closed

tejasnaladala mentioned this pull request Jun 9, 2026

fix: Support scikit-learn 1.9 in ZeroShotClassification embeddings-benchmark/mteb#4790

Merged

	unique_values = _union1d(y_true, y_pred, xp)
	except TypeError as e:
	# We expect y_true and y_pred to be of the same data type.
	# If `y_true` was provided to the classifier as strings,
	# `y_pred` given by the classifier will also be encoded with
	# strings. So we raise a meaningful error
	raise TypeError(
	"Labels in y_true and y_pred should be of the same type. "
	f"Got y_true={xp.unique(y_true)} and "
	f"y_pred={xp.unique(y_pred)}. Make sure that the "
	"predictions provided by the classifier coincides with "
	"the true labels."

	def test_metrics_consistent_type_error(metric_name):
	# check that an understable message is raised when the type between y_true
	# and y_pred mismatch
	rng = np.random.RandomState(42)
	y1 = np.array(["spam"] * 3 + ["eggs"] * 2, dtype=object)
	y2 = rng.randint(0, 2, size=y1.size)

		"input_kind",
		["array_of_str_objects", "array_of_fixed_width_strings", "list_of_str_objects"],

	if y_type == "binary":
	if n_classes == 1:
	if sparse_output:
	return _align_api_if_sparse(sp.csr_array((n_samples, 1), dtype=int))
	else:
	Y = xp.zeros((n_samples, 1), dtype=int_dtype_)
	Y += neg_label
	return Y
	elif n_classes >= 3:
	y_type = "multiclass"

Uh oh!

Uh oh!

Conversation

lucyleeow commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

AI usage disclosure

Any other comments?

Uh oh!

lucyleeow Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

betatim Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

lucyleeow Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

lucyleeow Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

betatim Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

lucyleeow May 2, 2026

Choose a reason for hiding this comment

Uh oh!

lucyleeow commented Jan 15, 2026

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Jan 30, 2026

Uh oh!

lucyleeow commented Feb 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucyleeow commented Apr 1, 2026

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucyleeow commented Apr 7, 2026

Uh oh!

lucyleeow commented Apr 28, 2026

Uh oh!

Uh oh!

lorentzenchr Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

lorentzenchr Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

lucyleeow May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lorentzenchr commented May 1, 2026

Uh oh!

lucyleeow commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

lucyleeow commented Jan 15, 2026 •

edited

Loading

lucyleeow Jan 15, 2026 •

edited

Loading

lucyleeow May 2, 2026 •

edited

Loading