[MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder#13833
[MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder#13833NicolasHug wants to merge 10 commits intoscikit-learn:masterfrom
Conversation
jnothman
left a comment
There was a problem hiding this comment.
Is it appropriate to just add the infrequent categories to those dropped (in OHE at least)? Or is there some fundamental difference in their handling that I'm not considering now?
|
|
||
| from .base import _transform_selected | ||
| from .label import _encode, _encode_check_unknown | ||
| from .label import _encode, _encode_check_unknown, _encode_numpy |
There was a problem hiding this comment.
can't we just use _encode rather than _encode_numpy?
| be dropped for each feature. None if all the transformed features will | ||
| be retained. | ||
|
|
||
| infrequent_indices_: list of arrays of shape(n_infrequent_categories) |
There was a problem hiding this comment.
| infrequent_indices_: list of arrays of shape(n_infrequent_categories) | |
| infrequent_indices_ : list of arrays of shape (n_infrequent_categories,) |
| be retained. | ||
|
|
||
| infrequent_indices_: list of arrays of shape(n_infrequent_categories) | ||
| ``infrequent_indices_[i]`` contains a list of indices in |
There was a problem hiding this comment.
| ``infrequent_indices_[i]`` contains a list of indices in | |
| ``infrequent_indices_[i]`` contains an array of indices in |
sklearn/preprocessing/_encoders.py
Outdated
| self.drop_idx_ = self._compute_drop_idx() | ||
|
|
||
| # check if user wants to manually drop a feature that is | ||
| # infrequent: this is not allowed |
There was a problem hiding this comment.
it doesn't seem natural to drop some features that are infrequent, and not drop some.
it would also make the code even more complex.
I don't think it is appropriate. with drop=first, that would result in treating the first category as infrequent. |
|
Minor feedback: wouldn't it be more appropriate to always bucket the infrequent categories in 0? This way, regardless of the cardinality of a categorical variable, we always know in which bucket the infrequent categories end up. |
|
implementation wise, we would still need to remap the frequent categories anyway as for user interface: then drop='first' becomes ambiguous |
|
can you fix the merge conflicts? |
|
We do not have a proper place to put the infrequent category for OrdinalEncoder. No matter where we put it, it doesn’t keep lexicon ordering of the original features. |
|
Can you please elaborate Thomas? |
|
I have clothing size categories (which have a natural order): XS, S, M, L, ReallyLarge. Currently, our OrdinalEncoder will map: Lets say we are okay with this, now let ReallyLarge and XS be "infrequent", what integer value do we encode the infrequent category? There are been ideas to do: Or "Infrequent" itself does not have an ordering when compared to the other categories since it is a combination of categories. |
|
Yes, in the context of truly ordinal categories (whether or not we are
supporting lexical ordering) it doesn't make a lot of sense. Absorbing that
category into a neighbour is more appropriate.
|
…frequent_categories
|
Conflicts should be resolved @amueller |
jnothman
left a comment
There was a problem hiding this comment.
I'm a bit confused by the combination of the unchecked TODO list and the "MRG" designation. Can you clarify what work you still intend to do here?
| be dropped for each feature. None if all the transformed features will | ||
| be retained. | ||
|
|
||
| infrequent_indices_: list of arrays of shape(n_infrequent_categories) |
There was a problem hiding this comment.
| infrequent_indices_: list of arrays of shape(n_infrequent_categories) | |
| infrequent_indices_ : list of arrays of shape (n_infrequent_categories,) |
|
The TODO list is up to date. Given that working on encoders is a) extremely complex and b) subject to never-ending discussions (#12893), I want to make sure I have initial approval on the design and features of the PR before putting more work into it. |
jnothman
left a comment
There was a problem hiding this comment.
There don't seem to be parameter docstrings for max_levels. Makes it hard to review that it's correct.
|
I added a docstring @jnothman . The general expected features are illustrated in |
|
Per #14563 (comment), OrdinalEncoder should provide options for representing dropped categories including "merge-down", "merge-up", "extra" (better names welcome). |
|
I want to wait for #14972 before getting back to this. The code is way too complicated IMHO. |
|
Superseded |
Fixes: #12153
Closes #12264
Not strictly MRG but I need input.
This PR adds a
max_levelsparameter to OneHotEncoder and OrdinalEncoder.Categories with the last amount of support are considered 'infrequent' and are mapped into a specific column.
TODO:
min_frequencyparameterhandle_unkownto address Handle Error Policy in OrdinalEncoder #13488 (comment)ping @jnothman @ogrisel @amueller @thomasjpfan, please LMK if this is going in the right direction.
(You can look at
test_infrequent_categories_sanity()for some sort of guide to this new feature.)