You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the CategoricalEncoder doesn't handle missing values (you get an error about unorderable types or about nan being an unknown category for numerical types).
This came up in an issue about imputing missing values for categoricals (#2888), but independently of whether we add such abilities to Imputer, we should also discuss how CategoricalEncoder itself could handle missing values in different ways.
Possible ways to deal missing values (np.nan or None):
Raise an error when missing values are present:
This is still a good default I think (we should only make sure to provide a better error message than currently is raised)
Ignore missing values (treat as unknown):
This would give a row of all zeros for dummy encoding, and would no be implemented for ordinal encoding.
In this way, it is similar in behaviour as unknown categories with handle_unknown='ignore', apart from the fact it can also occur in the training data.
Regard missing value as a separate category
For ordinal encoding this would give an additional integer, for dummy encoding an additional column.
Something similar is available in pd.get_dummies if you specify dummy_na=True keyword.
Implementation-wise, a problem that would occur is that if your categories consist of a couple of strings values and a missing value (np.nan or None), it becomes unorderable, while in the CategoricalEncoder we normally sort the unique categories (as a possible solution, we could fallback in such a case to sort the non-missing ones first and then add np.nan in the end).
This would be similar to a an indicator feature
Preserve as NaN:
from comment of @amueller (Improve Imputer 'most_frequent' strategy #2888 (comment)), I suppose the idea would be to first see it as a separate category but before returning the result replace that category again with NaN (so it can be imputed after encoding).
This might make sense only for ordinal encoding, unless we want a full row of NaNs for dummy case.
This option could actually also be a way to deal with the "imputing categorical features" problem (see also next bullet), as it allows an easier and more flexible combination of encoding / imputing.
Impute missing values (eg with 'most_frequent' option):
Personally I think this one should be left to Imputer itself, but adding it here instead could limit the scope of Imputer to numerical features.
Those options (or a subset of them) could be added as an additional keyword to the CategoricalEncoder. Possible names: handle_missing, handle_na, missing_values
Currently the CategoricalEncoder doesn't handle missing values (you get an error about unorderable types or about nan being an unknown category for numerical types).
This came up in an issue about imputing missing values for categoricals (#2888), but independently of whether we add such abilities to Imputer, we should also discuss how CategoricalEncoder itself could handle missing values in different ways.
Possible ways to deal missing values (np.nan or None):
handle_unknown='ignore', apart from the fact it can also occur in the training data.pd.get_dummiesif you specifydummy_na=Truekeyword.This option could actually also be a way to deal with the "imputing categorical features" problem (see also next bullet), as it allows an easier and more flexible combination of encoding / imputing.
Imputeritself, but adding it here instead could limit the scope ofImputerto numerical features.Those options (or a subset of them) could be added as an additional keyword to the CategoricalEncoder. Possible names:
handle_missing,handle_na,missing_valuesRelated to discussions in #2888 and #9012 (comment)
Example notebook on a toy dataframe showing the current problem with missing data in categorical features: http://nbviewer.jupyter.org/gist/jorisvandenbossche/736cead26ab65116ff4de18015b0b324