-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Better error message when passing un-sortable data to the Encoders #12621
Description
Description
scikit-learn's handling of errors where an unexpected / unusable value appears in input leaves something to be desired. Errors are cryptic and confusing.
Steps/Code to Reproduce
Example:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Simulate missing value in data
feature_with_missing_value = pd.Series([1, 2, 3, '?', 42, 69])
LabelEncoder().fit_transform(feature_with_missing_value.values)Something similar also happened when I absentmindedly concatenated two CSV files without ensuring only one header appeared at the top of the file and attempted to use LabelEncoder with the second header sandwiched in the middle.
I have also encountered this issue I believe with pipelines and missing values before but this was a while ago and I eventually figured out what was happening so unfortunately I can't replicate that error.
Expected Results
A reasonable error message that addresses the issue in a clear and direct manner. Here is an example of what that would look like:
https://www.kaggle.com/c/titanic/discussion/26976
Error in predict.randomForest(rf, extractFeatures(test)) : missing values in newdata
Because scikit-learn's algorithms currently only accept numerical input (AFAIK), any non-numerical data should be treated as missing values or otherwise seen as aberrant.
Actual Results
Traceback (most recent call last):
File "poop.py", line 5, in <module>
LabelEncoder().fit_transform(feature_with_missing_value)
File "/usr/local/lib/python3.5/dist-packages/sklearn/preprocessing/label.py", line 112, in fit_transform
self.classes_, y = np.unique(y, return_inverse=True)
File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraysetops.py", line 223, in unique
return _unique1d(ar, return_index, return_inverse, return_counts)
File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraysetops.py", line 280, in _unique1d
perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
TypeError: unorderable types: str() < int()
Versions
Home:
Linux-4.4.0-137-generic-i686-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.14.2
SciPy 1.0.1
Scikit-Learn 0.19.1
Google Cloud:
System
------
machine: Linux-4.14.33+-x86_64-with-debian-9.5
python: 3.5.3 (default, Sep 27 2018, 17:25:39) [GCC 6.3.0 20170516]
executable: /usr/bin/python3
BLAS
----
lib_dirs:
macros:
cblas_libs: cblas
Python deps
-----------
setuptools: 40.6.2
scipy: 1.1.0
pip: 9.0.1
Cython: None
numpy: 1.15.4
pandas: 0.23.4
sklearn: 0.20.0