Skip to content

Better error message when passing un-sortable data to the Encoders #12621

@readyready15728

Description

@readyready15728

Description

scikit-learn's handling of errors where an unexpected / unusable value appears in input leaves something to be desired. Errors are cryptic and confusing.

Steps/Code to Reproduce

Example:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Simulate missing value in data
feature_with_missing_value = pd.Series([1, 2, 3, '?', 42, 69])
LabelEncoder().fit_transform(feature_with_missing_value.values)

Something similar also happened when I absentmindedly concatenated two CSV files without ensuring only one header appeared at the top of the file and attempted to use LabelEncoder with the second header sandwiched in the middle.

I have also encountered this issue I believe with pipelines and missing values before but this was a while ago and I eventually figured out what was happening so unfortunately I can't replicate that error.

Expected Results

A reasonable error message that addresses the issue in a clear and direct manner. Here is an example of what that would look like:

https://www.kaggle.com/c/titanic/discussion/26976

Error in predict.randomForest(rf, extractFeatures(test)) : missing values in newdata

Because scikit-learn's algorithms currently only accept numerical input (AFAIK), any non-numerical data should be treated as missing values or otherwise seen as aberrant.

Actual Results

Traceback (most recent call last):
  File "poop.py", line 5, in <module>
    LabelEncoder().fit_transform(feature_with_missing_value)
  File "/usr/local/lib/python3.5/dist-packages/sklearn/preprocessing/label.py", line 112, in fit_transform
    self.classes_, y = np.unique(y, return_inverse=True)
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraysetops.py", line 223, in unique
    return _unique1d(ar, return_index, return_inverse, return_counts)
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraysetops.py", line 280, in _unique1d
    perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
TypeError: unorderable types: str() < int()

Versions

Home:

Linux-4.4.0-137-generic-i686-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]
NumPy 1.14.2
SciPy 1.0.1
Scikit-Learn 0.19.1

Google Cloud:

System
------
   machine: Linux-4.14.33+-x86_64-with-debian-9.5
    python: 3.5.3 (default, Sep 27 2018, 17:25:39)  [GCC 6.3.0 20170516]
executable: /usr/bin/python3
BLAS
----
  lib_dirs:
    macros:
cblas_libs: cblas
Python deps
-----------
setuptools: 40.6.2
     scipy: 1.1.0
       pip: 9.0.1
    Cython: None
     numpy: 1.15.4
    pandas: 0.23.4
   sklearn: 0.20.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions