ENH Add 'if_binary' option to drop argument of OneHotEncoder#16245
ENH Add 'if_binary' option to drop argument of OneHotEncoder#16245rth merged 27 commits intoscikit-learn:masterfrom
Conversation
glemaitre
left a comment
There was a problem hiding this comment.
Thanks. It looks pretty good,
|
A bit related to categories="dtype" in #15396 I could imagine The alternative could be to make this breaking change in 1.0. WDYT? cc @thomasjpfan |
Do you mean |
Yes, never mind my comment for this PR, particularly since this doesn't add a new parameter. The comment was more about how to make this the default in the long term. |
|
@thomasjpfan would be great if you could have a look ;) |
rth
left a comment
There was a problem hiding this comment.
Thanks for the PR @rushabh-v !
So this currently drops the first collumn for any input with 2 classes, not just binary,
>>> OneHotEncoder(drop=None).fit_transform([['a'], ['b'], ['a']]).A
array([[1., 0.],
[0., 1.],
[1., 0.]])
>>> OneHotEncoder(drop='if_binary').fit_transform([['a'], ['b'], ['a']]).A
array([[0.],
[1.],
[0.]])I would expect to this only take effect for actually binary column, not any input with 2 classes. You could add it as a counter example in unit tests.
Enabling it if the drop='if_binary' the column is of dtype bool, might be easier. For int/float I suppose one could apply it if np.unique(X[col]) is [0, 1], but it's a bit more expensive. Let's wait for input from other reviewers, before making changes.
@rth how would you do to tag a feature as "binary"? I would somehow assume that 2 categories features should be close to a binary feature, isn't it? |
|
I think here we are assuming that two categories means binary. The point of
this is that as long as there are only two categories, encoding both is
merely redundant.
|
OK that also works. Though, I think binary features has a very specific meaning, as either a column with binary dtype or a column with [0, 1]. A column with ["yes", "no"] or ["true", "false"], can be considered binary, but IMO it's up to the initial preprocessing to convert it to binary (also I have rarely seen this in practice). It might be best to avoid that vocabulary in this PR. Say if you have a column with countries It might be OK to keep the name
(since this PR also drops columns with 1 features I think). WDYT? |
No, it doesn't drop columns with 1 category. |
Thanks for the confirmation. It is currently behaving strangely in that case though which probably needs fixing, >>> OneHotEncoder().fit_transform([['a'], ['a'], ['a']]).A
array([[1.],
[1.],
[1.]])
>>> OneHotEncoder(drop='if_binary').fit_transform([['a'], ['a'], ['a']]).A
array([[0.],
[0.],
[0.]])In general just to clarify, implementation wise is this expected to behave identically to |
|
In every case when there is a binary feature, But IMO, for features with one category |
The above example with >>> OneHotEncoder(drop='first').fit_transform([['a'], ['a'], ['a']]).A
array([], shape=(3, 0), dtype=float64)In terms of implementation I think we might want to change the signature of >>> est = OneHotEncoder(drop='if_binary')
>>> est.fit(np.atleast_2d(['a', 'b', 'c']))
OneHotEncoder(categories='auto', drop='if_binary',
dtype=<class 'numpy.float64'>, handle_unknown='error',
sparse=True)
>>> est.drop_idx_
array([0, 0, 0])In general we should compute |
|
|
|
And even if we change the logic of |
Yes, it may have some additional cost but basically the logic for drop should be,
so we cannot fix drop_idx_ in transform, as the determination of "binary" features need to happen based on train data in fit. |
|
What I meant was that we can edit the And in case of correct me if I am wrong! |
Yes, that is what I mean. Something like: in fit, self.drop_idx_ = self._compute_drop_idx(self.categories)with def _compute_drop_idx(self, categories)
...
elif (isinstance(self.drop, str) and self.drop == 'first'):
return np.array([0 if len(col_cats) > 2 else None
for col_cats in categories], dtype=object)and then in |
|
So right now I am considering implementing it with object array just to check if it works fine or not. then we will change it if ints are required. |
|
I'm still gettig a 500 from codecov, not sure what's going on. Not merging just yet in case @thomasjpfan wants to have a look? |
thomasjpfan
left a comment
There was a problem hiding this comment.
inverse_transform needs to be updated and tested with the new 'if_binary' option.
|
Another small decision: if we look across the codebase, do we tend to use underscores, spaces, hyphens or other separators for words in param values? Should it be |
+1 for 'if_binary' |
|
@thomasjpfan I have updated the |
thomasjpfan
left a comment
There was a problem hiding this comment.
I am happy with if_binary as the keyword option.
|
|
||
|
|
||
| def test_one_hot_encoder_inverse_if_binary(): | ||
| X = [['Male', 1], |
There was a problem hiding this comment.
We can make the input a numpy array:
X = np.array([['Male', 1],
['Female', 3],
['Female', 2]])and then
assert_array_equal(ohe.inverse_transform(X_tr), X)
sklearn/preprocessing/_encoders.py
Outdated
| else: | ||
| n_transformed_features = sum(len(cats) - 1 | ||
| for cats in self.categories_) | ||
| if isinstance(self.drop, str) and self.drop == 'if_binary': |
There was a problem hiding this comment.
Please use elif if it is appropriate here
|
why it says |
Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>
|
All greens now! |
|
Thanks @rushabh-v and all reviewers! |
Fixes #12502