Handling of missing values in the CategoricalEncoder

Currently the CategoricalEncoder doesn't handle missing values (you get an error about unorderable types or about nan being an unknown category for numerical types). 
This came up in an issue about imputing missing values for categoricals (https://github.com/scikit-learn/scikit-learn/issues/2888), but independently of whether we add such abilities to Imputer, we should also discuss how CategoricalEncoder itself could handle missing values in different ways.

Possible ways to deal missing values (np.nan or None):

* Raise an error when missing values are present:
    * This is still a good default I think (we should only make sure to provide a better error message than currently is raised)
* Ignore missing values (treat as unknown):
    * This would give a row of all zeros for dummy encoding, and would no be implemented for ordinal encoding.
    * In this way, it is similar in behaviour as unknown categories with `handle_unknown='ignore'`, apart from the fact it can also occur in the training data.
* Regard missing value as a separate category
    * For ordinal encoding this would give an additional integer, for dummy encoding an additional column.
    * Something similar is available in `pd.get_dummies` if you specify `dummy_na=True` keyword.
    * Implementation-wise, a problem that would occur is that if your categories consist of a couple of strings values and a missing value (np.nan or None), it becomes unorderable, while in the CategoricalEncoder we normally sort the unique categories (as a possible solution, we could fallback in such a case to sort the non-missing ones first and then add np.nan in the end).
    * This would be similar to a an indicator feature
* Preserve as NaN:
    * from comment of @amueller (https://github.com/scikit-learn/scikit-learn/issues/2888#issuecomment-353128945), I suppose the idea would be to first see it as a separate category but before returning the result replace that category again with NaN (so it can be imputed after encoding). 
    * This might make sense only for ordinal encoding, unless we want a full row of NaNs for dummy case. 
      This option could actually also be a way to deal with the "imputing categorical features" problem (see also next bullet), as it allows an easier and more flexible combination of encoding / imputing.
* Impute missing values (eg with 'most_frequent' option):
    * Personally I think this one should be left to `Imputer` itself, but adding it here instead could limit the scope of `Imputer` to numerical features.


Those options (or a subset of them) could be added as an additional keyword to the CategoricalEncoder. Possible names: `handle_missing`, `handle_na`, `missing_values`

Related to discussions in https://github.com/scikit-learn/scikit-learn/issues/2888 and https://github.com/scikit-learn/scikit-learn/pull/9012#issuecomment-352207204

Example notebook on a toy dataframe showing the current problem with missing data in categorical features: http://nbviewer.jupyter.org/gist/jorisvandenbossche/736cead26ab65116ff4de18015b0b324

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling of missing values in the CategoricalEncoder #10465

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Handling of missing values in the CategoricalEncoder #10465

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions