Skip to content

FIX Improve checking for string and multilabel inputs for classification metrics#33086

Merged
lorentzenchr merged 23 commits into
scikit-learn:mainfrom
lucyleeow:str_int_metrics
May 1, 2026
Merged

FIX Improve checking for string and multilabel inputs for classification metrics#33086
lorentzenchr merged 23 commits into
scikit-learn:mainfrom
lucyleeow:str_int_metrics

Conversation

@lucyleeow

@lucyleeow lucyleeow commented Jan 15, 2026

Copy link
Copy Markdown
Member

Reference Issues/PRs

Fixes #33045

What does this implement/fix? Explain your changes.

Mixed string and number inputs do not error for the following thresholded classification metrics:

  • accuracy_score
  • hamming_loss
  • zero_one_loss
  • matthews_corrcoef
  • confusion_matrix - only when labels is set

when the mixed string and number inputs are:

  • lists (e.g., ['a', 'b'], [1, 0])
  • string is a numpy array that is NOT object dtype e.g., np.array(['a','b'])
Details

Mixed string and number inputs are dealt with via _union1d (see #18192)

unique_values = _union1d(y_true, y_pred, xp)
except TypeError as e:
# We expect y_true and y_pred to be of the same data type.
# If `y_true` was provided to the classifier as strings,
# `y_pred` given by the classifier will also be encoded with
# strings. So we raise a meaningful error
raise TypeError(
"Labels in y_true and y_pred should be of the same type. "
f"Got y_true={xp.unique(y_true)} and "
f"y_pred={xp.unique(y_pred)}. Make sure that the "
"predictions provided by the classifier coincides with "
"the true labels."

However this only works for numpy arrays, where the string is specifically specified to be object type:

def test_metrics_consistent_type_error(metric_name):
# check that an understable message is raised when the type between y_true
# and y_pred mismatch
rng = np.random.RandomState(42)
y1 = np.array(["spam"] * 3 + ["eggs"] * 2, dtype=object)
y2 = rng.randint(0, 2, size=y1.size)

This test does not pass if the inputs are list or of dtype object is not forced. np.union1d first concatenates the arrays and then performs unique on the result. If the concatenated array is of dtype object, unique gives TypeError, however if object is not forced to be object, e.g.:

np.union1d(np.array([1,2]), np.array(['a','b']))

the resulting concatenation is of type 'U' and no error is raised when unique is run. (I am not sure why this case was not considered initially, but I think it would not be unusual/rare for the input to be a string numpy array that is NOT of object type)

The other classification metrics do raise an error, due to a call to unique_labels (see #33045 for details).

This PR enforces:

thresholded metrics should raise an error if passed inconsistent types of inputs.

for all thresholded metrics for all input types by running unique_labels inside _check_targets. This is an opinionated change so let me explain:

  • _check_targets is ONLY used within thresholded metrics in _classification.py - thus a non backwards compatible change is less of an issue.
    • for the metrics where unique_labels is also called, unique_labels is always called after _check_targets
  • _check_targets and unique_labels raise errors for similar problematic input combinations, except:
    • unique_labels checks label indicator inputs are of the same size (_check_targets does not)
      • this means that the classification metrics (which previously did not run unique_labels) will have this extra check, which I think is a fix.
    • (less importantly) _check_targets checks target length consistency and does not allow continuous input or any multioutput target types
  • unique_labels is used elsewhere in the code (e.g., LinearDiscriminantAnalysis) - thus the changes made are backwards compatible and no changes in functionality was made

Cons of this approach:

  • The error messages in _check_targets are more specific for classification metrics vs those in unique_labels. I added unique_labels right at the end of _check_targets, so all the current _check_targets will be raised in priority. The only difference is that for mixed str and int inputs, the error message from unique_labels is raised instead, but I have tried to improve this message.

Pros:

  • Lets us remove _union1d from _array_api.py as this was the only place this function was used
  • Passing y types to unique_labels reduces duplicate type_of_target calls (and thus duplicate unique calls - I know we cache unique but this is only relevant for numpy arrays, not necessarily other array API supported arrays)

I did also consider simply changing the logic in:

unique_values = _union1d(y_true, y_pred, xp)
except TypeError as e:
# We expect y_true and y_pred to be of the same data type.
# If `y_true` was provided to the classifier as strings,
# `y_pred` given by the classifier will also be encoded with
# strings. So we raise a meaningful error
raise TypeError(
"Labels in y_true and y_pred should be of the same type. "
f"Got y_true={xp.unique(y_true)} and "
f"y_pred={xp.unique(y_pred)}. Make sure that the "
"predictions provided by the classifier coincides with "
"the true labels."

such that list and non object numpy inputs are also handled, but went with the above approach. Happy to take other opinions though.

AI usage disclosure

I used AI assistance for:

  • Code generation (e.g., when writing an implementation or fixing a bug)
  • Test/benchmark generation
  • Documentation (including examples)
  • Research and understanding

Any other comments?

# all of length 3
EXAMPLES = [
(IND, np.array([[0, 1, 1], [1, 0, 0], [0, 0, 1]])),
(IND, np.array([[0, 1], [1, 0], [0, 0]])),

@lucyleeow lucyleeow Jan 15, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed because this test would now fail for the (IND, IND) combination (as the two IND have different sizes).

I checked and all other combinations involving IND would error (expected is None), thus I think it is safe to amend this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this example is basically the same as the next entry, so maybe we can remove this line?

if len(set(isinstance(label, str) for label in ys_labels)) > 1:
raise ValueError("Mix of label input types (string and number)")
msg_details = "Got " + " ".join([f"{xp.unique(y)}" for y in ys])
raise ValueError(f"Mix of label input types (string and number); {msg_details}")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could raise a TypeError here, like what _union1d does, but I think it is better to keep as ValueError as its more of an input value problem as e.g., the predictions/y2 don't match true labels/y1. Also this makes this backwards compatible.

ys_types = set(type_of_target(x) for x in ys)
if ys_types == {"binary", "multiclass"}:
ys_types = {"multiclass"}
if ys_types is None:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just to avoid a duplicate type_of_target (and thus unique) call (again, we cache, but this is only relevant for arrays with metadata attr) - but happy to remove as this is a public function.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that passing in ys_types is an optimisation right? Is it really such a slow down/speed up to do this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this. I did not do any benchmarking, and again since we do cache, it is only for specific array inputs where it would actually matter. Probably not a big difference overall, but it also did not seem like such a significant change to incorporate it...

@lucyleeow

Copy link
Copy Markdown
Member Author

Ping @ogrisel 🙏

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a problem with unique_labels when array API dispatch is enabled (even with NumPy only inputs). See below:

Comment thread sklearn/metrics/_classification.py Outdated
Comment thread sklearn/utils/multiclass.py
@ogrisel

ogrisel commented Jan 30, 2026

Copy link
Copy Markdown
Member

Labeling this as array API because the remaining problems are array API support related.

@ogrisel ogrisel moved this to In Progress in Array API Jan 30, 2026
@lucyleeow

Copy link
Copy Markdown
Member Author

Thanks @ogrisel ! Tests added! I think this is also a good reminder to add array API tests even for small functions that we convert in PRs, and not rely on the fact that it's tested as part of another function.

Comment thread doc/whats_new/upcoming_changes/sklearn.metrics/33086.fix.rst Outdated
Comment thread doc/whats_new/upcoming_changes/sklearn.metrics/33086.fix.rst Outdated
Comment thread sklearn/utils/tests/test_multiclass.py Outdated
Comment thread sklearn/utils/tests/test_multiclass.py Outdated
@lucyleeow

Copy link
Copy Markdown
Member Author

Gentle ping @ogrisel @lorentzenchr, I think this is ready for another review 🙏

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lucyleeow. Just minor feedback but otherwise LGTM.

Comment thread sklearn/metrics/tests/test_common.py Outdated
Comment thread sklearn/utils/tests/test_multiclass.py Outdated
Comment thread sklearn/utils/multiclass.py Outdated
Comment thread sklearn/utils/tests/test_multiclass.py Outdated
Comment thread sklearn/utils/multiclass.py Outdated
@lucyleeow

Copy link
Copy Markdown
Member Author

Thanks @betatim ! I think this is ready for another look 🙏 thank you!

@lucyleeow

Copy link
Copy Markdown
Member Author

Gentle ping @betatim

Comment thread doc/whats_new/upcoming_changes/sklearn.metrics/33086.fix.rst Outdated
Comment thread sklearn/metrics/tests/test_common.py Outdated
Comment on lines +1975 to +1976
"input_kind",
["array_of_str_objects", "array_of_fixed_width_strings", "list_of_str_objects"],

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"input_kind",
["array_of_str_objects", "array_of_fixed_width_strings", "list_of_str_objects"],
"y_1",
[
np.array(["spam"] * 3 + ["eggs"] * 2, dtype=object), # str object
np.array(["spam"] * 3 + ["eggs"] * 2), # fixed width string
["spam"] * 3 + ["eggs"] * 2,
],

Could save the if else logic inside.

Comment on lines +160 to +162
if y_type == "binary":
if unique_labels_.shape[0] > 2:
y_type = "multiclass"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename the values of y_type?
I find the logic strange. y_type was detected to be binary but it has more than 2 unique values. Then it is a false positive detection - or just a bad name.

@lucyleeow lucyleeow May 2, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the old code and other instances where this type of code occurs e.g., label_binarize:

if y_type == "binary":
if n_classes == 1:
if sparse_output:
return _align_api_if_sparse(sp.csr_array((n_samples, 1), dtype=int))
else:
Y = xp.zeros((n_samples, 1), dtype=int_dtype_)
Y += neg_label
return Y
elif n_classes >= 3:
y_type = "multiclass"

I think it would be the (rare) edge case where there are actually >2 classes (e.g., a, b, c) , and y_pred contains 2 classes (e.g., a b) and y_true contains 2 classes but a different 2 classes (e.g., a, c).

@lorentzenchr

Copy link
Copy Markdown
Member

@lucyleeow Can I merge?

@lucyleeow

Copy link
Copy Markdown
Member Author

@lorentzenchr yes thanks!
I'll look into #33086 (comment) separately

@lorentzenchr lorentzenchr changed the title Improve checking for string and multilabel inputs for classification metrics ENH Improve checking for string and multilabel inputs for classification metrics May 1, 2026
@lorentzenchr lorentzenchr changed the title ENH Improve checking for string and multilabel inputs for classification metrics FIX Improve checking for string and multilabel inputs for classification metrics May 1, 2026
@lorentzenchr lorentzenchr merged commit 78d1885 into scikit-learn:main May 1, 2026
38 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Array API May 1, 2026
@lucyleeow lucyleeow deleted the str_int_metrics branch May 2, 2026 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Mixed string/numeric when input is list for classification metrics

4 participants