Skip to content

Mixed string/numeric when input is list for classification metrics #33045

Description

@lucyleeow

Warning

This is not a good first (or second) issue to contribute. If you are interested in contributing to scikit-learn please have a look at our contributing doc and in particular the section Issues for new contributors.

Noticed while working on #32755

Classification metrics (CLASSIFICATION_METRICS from test_common.py, they are all defined in sklearn/metrics/_classification.py, but not all metrics in _classification.py fall in this list) vary in how they handle mixed string and numeric inputs (e.g., y_true is string and y_pred is numeric).

  • precision_recall_fscore_support and friends (precision_score, recall_score, f1_score, fbeta_score and related jaccard_score) - raises ValueError:
    • via _check_set_wise_labels, which calls unique_labels, which does not allow "mix of string and integer labels“
    • note that these metrics do take a pos_label parameter, so theoretically when y_true is string, we could use y_true == pos_label to convert to [1,0]. This would not make sense if y_pred was string.
  • confusion_matrix (and metrics that use it - balanced_accuracy_score, cohen_kappa_score) - raises ValueError ONLY when labels=None:
    • labels=None: calls unique_labels like above
    • labels is set - AFAICT no error but the confusion matrix would only contain 0. e.g., if y_true is string and labels is set to all possible unique values in y_true, we would do the following conversion:

if need_index_conversion:
label_to_ind = {label: index for index, label in enumerate(labels)}
y_pred = np.array([label_to_ind.get(label, n_labels + 1) for label in y_pred])
y_true = np.array([label_to_ind.get(label, n_labels + 1) for label in y_true])

Since none of the string labels are present in numeric y_pred, y_pred would be converted to array that consists only of n_labels + 1 values

  • multilabel_confusion_matrix - always raises raises ValueError (again via unique_labels):
    • unlike confusion_matrix, even when labels is given, we call unique_labels(y_true, y_pred) to get all labels present, and if any are not provided in labels they are added.

The following all DO NOT error:

  • accuracy_score - no error but result would always be 0 (as score = y_true == y_pred would always be 0). Note not relevant for multilabel cases as input needs to be label indicator matrix.
  • hamming_loss - no error but result would always be 1. Again not relevant for multilabel.
  • zero_one_loss - no error but result would always be 1 (when normalize=False). Again not relevant for multilabel.
  • matthews_corrcoef - no error but result would always be 0. Note that we transform y_true and y_pred using LabelEncoder fit on y_true and y_pred concat (meaning y_true and y_pred will be numbers, but not contain any numbers in common), before we pass to confusion_matrix.

For these classification metrics which require y_pred to be thresholded predictions, I don't think it makes sense to have mixed string and numeri (e.g., how would you match ['apple', 'orange', 'apple] to [2,3,2] ?). Indeed, for the metrics that do not error, the result is the 'worst' value.
For metrics that accept pos_label the way forward is less clear. Note that for several continuous classification metrics we do use something like y_true == pos_label, thus allowing y_true to be string.

Note that I will open a separate issue around this for continuous classification metrics (i.e. those in _ranking.py).

cc @ogrisel

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions