ENH Add 'if_binary' option to drop argument of OneHotEncoder by rushabh-v · Pull Request #16245 · scikit-learn/scikit-learn

rushabh-v · 2020-01-27T17:39:46Z

Fixes #12502

glemaitre

Thanks. It looks pretty good,

doc/whats_new/v0.23.rst

sklearn/preprocessing/_encoders.py

sklearn/preprocessing/tests/test_encoders.py

sklearn/preprocessing/_encoders.py

sklearn/preprocessing/tests/test_encoders.py

sklearn/preprocessing/_encoders.py

sklearn/preprocessing/tests/test_encoders.py

rth · 2020-01-28T12:53:49Z

A bit related to categories="dtype" in #15396 I could imagine if_binary=True to be the only useful option long term. To keep backward compatibility constraints, one possibility could be to implement if_binary='warn' and raise a deprecation warning indicating that the behaviour will change from if_binary=False to True by default in the future. Then some versions later remove if_binary altogether.

The alternative could be to make this breaking change in 1.0. WDYT? cc @thomasjpfan

rushabh-v · 2020-01-28T17:49:55Z

if_binary=True to be the only useful option long term.

Do you mean drop='if_binary'? @rth

rth · 2020-01-28T20:25:13Z

Do you mean drop='if_binary'

Yes, never mind my comment for this PR, particularly since this doesn't add a new parameter. The comment was more about how to make this the default in the long term.

amueller · 2020-01-28T20:26:00Z

@thomasjpfan would be great if you could have a look ;)

rth

Thanks for the PR @rushabh-v !

So this currently drops the first collumn for any input with 2 classes, not just binary,

>>> OneHotEncoder(drop=None).fit_transform([['a'], ['b'], ['a']]).A
array([[1., 0.],
       [0., 1.],
       [1., 0.]])
>>> OneHotEncoder(drop='if_binary').fit_transform([['a'], ['b'], ['a']]).A
array([[0.],
       [1.],
       [0.]])

I would expect to this only take effect for actually binary column, not any input with 2 classes. You could add it as a counter example in unit tests.

Enabling it if the drop='if_binary' the column is of dtype bool, might be easier. For int/float I suppose one could apply it if np.unique(X[col]) is [0, 1], but it's a bit more expensive. Let's wait for input from other reviewers, before making changes.

glemaitre · 2020-01-28T21:31:34Z

I would expect to this only take effect for actually binary column, not any input with 2 classes. You could add it as a counter example in unit tests.

@rth how would you do to tag a feature as "binary"? I would somehow assume that 2 categories features should be close to a binary feature, isn't it?

jnothman · 2020-01-29T07:18:56Z

I think here we are assuming that two categories means binary. The point of this is that as long as there are only two categories, encoding both is merely redundant.

rth · 2020-01-29T09:13:03Z

how would you do to tag a feature as "binary"? I would somehow assume that 2 categories features should be close to a binary feature, isn't it?

I think here we are assuming that two categories means binary.

OK that also works. Though, I think binary features has a very specific meaning, as either a column with binary dtype or a column with [0, 1]. A column with ["yes", "no"] or ["true", "false"], can be considered binary, but IMO it's up to the initial preprocessing to convert it to binary (also I have rarely seen this in practice). It might be best to avoid that vocabulary in this PR. Say if you have a column with countries ['JP', 'DE'] that's definitely not a binary feature, it can suggest that other categories are possible just not present in the data.

It might be OK to keep the name drop="if_binary" but document it without using "binary feature" term,

'if_binary' : drop the first category for features that have 1 or 2 categories.

(since this PR also drops columns with 1 features I think). WDYT?

rushabh-v · 2020-01-29T11:26:54Z

since this PR also drops columns with 1 features I think

No, it doesn't drop columns with 1 category.

rth · 2020-01-29T13:09:52Z

No, it doesn't drop columns with 1 category.

Thanks for the confirmation. It is currently behaving strangely in that case though which probably needs fixing,

>>> OneHotEncoder().fit_transform([['a'], ['a'], ['a']]).A
array([[1.],
       [1.],
       [1.]])
>>> OneHotEncoder(drop='if_binary').fit_transform([['a'], ['a'], ['a']]).A
array([[0.],
       [0.],
       [0.]])

In general just to clarify, implementation wise is this expected to behave identically to drop='first' but only apply with 2 classes (in which case we might as well just reuse the drop='first' implementation conditionally), or is there additional logic added in?

rushabh-v · 2020-01-29T16:32:20Z

In every case when there is a binary feature, drop='if_binary' will work the same as drop='first'. But for drop=None the implementation is different. And that's why in the above example outputs for drop='if_binary' and drop=None are different.

But IMO, for features with one category drop=None should also return all 0s.

rth · 2020-01-29T17:17:22Z

In every case when there is a binary feature, drop='if_binary' will work the same as drop='first'. But for drop=None the implementation is different. And that's why in the above example outputs for drop='if_binary' and drop=None are different.

The above example with drop='first' returns an empty array which more what I would expect,

>>> OneHotEncoder(drop='first').fit_transform([['a'], ['a'], ['a']]).A
array([], shape=(3, 0), dtype=float64)

In terms of implementation I think we might want to change the signature of _compute_drop_idx to take X as input and make the computation there (r otherwise rethink where we compute self.drop_idx_, as currently it is not correct in some cases,

>>> est = OneHotEncoder(drop='if_binary')
>>> est.fit(np.atleast_2d(['a', 'b', 'c']))
OneHotEncoder(categories='auto', drop='if_binary',
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              sparse=True)
>>> est.drop_idx_
array([0, 0, 0])

In general we should compute drop_idx_ then apply them, instead of adding custom logic after it is computed. I have not looked at the OHE code enough to say what would be the fastest to get there.

rushabh-v · 2020-01-29T17:26:46Z

drop_idx_ contains index of the category to be dropped from each feature. What should we put there in case of non-binary features then?

rushabh-v · 2020-01-29T17:44:05Z

And even if we change the logic of _compute_drop_idx(Like if we put None or something in case of non-binary features), then I think the logic of dropping would become more expensive in case of drop='if_binary'. Can't we correct drop_idx_ by putting some extra conditions or something?

rth · 2020-01-29T17:58:46Z

And even if we change the logic of compute_drop_idx, then I think the logic to drop would become more expensive in case of drop='if_binary'. Can't we correct drop_idx by putting some extra conditions or something?

Yes, it may have some additional cost but basically the logic for drop should be,

in fit compute the columns to drop and store them in drop_idx_
in transform drop according to drop_idx_

so we cannot fix drop_idx_ in transform, as the determination of "binary" features need to happen based on train data in fit.

rushabh-v · 2020-01-29T18:13:09Z

What I meant was that we can edit the _compute_drop_idx so that drop_idx_ would be correct every time.

And in case of drop=if_binary instead of using drop_idx_ we can take a temporary variable drop_bin = np.zeros(len(self.categories_), dtype=np.int_)(Because that's what it expects every time) inside transform. And use that for all the computations.

correct me if I am wrong!

rth · 2020-01-29T20:53:42Z

What I meant was that we can edit the compute_drop_idx so that drop_idx would be correct every time.

Yes, that is what I mean.

Something like: in fit,

self.drop_idx_ = self._compute_drop_idx(self.categories)

with

def _compute_drop_idx(self, categories)
    ...
    elif (isinstance(self.drop, str) and self.drop == 'first'):
        return np.array([0 if len(col_cats) > 2 else None
                         for col_cats in categories], dtype=object)

and then in transform mostly what we have now with a check for None in drop_idx_[i]. Though that would mean that drop_idx_ can be an array of objects, while previously we always had an array of ints which might need some adjustment and documentation. Or we can use an int -1 (or some other negative value) to indicate that a column should not be dropped but that's also suboptimal.

rushabh-v · 2020-01-30T05:38:09Z

So right now I am considering implementing it with object array just to check if it works fine or not. then we will change it if ints are required.

NicolasHug

LGTM, thanks @rushabh-v !

NicolasHug · 2020-02-04T16:04:32Z

I'm still gettig a 500 from codecov, not sure what's going on. Not merging just yet in case @thomasjpfan wants to have a look?

thomasjpfan

inverse_transform needs to be updated and tested with the new 'if_binary' option.

jnothman · 2020-02-04T20:18:37Z

Another small decision: if we look across the codebase, do we tend to use underscores, spaces, hyphens or other separators for words in param values? Should it be 'if_binary' or 'if binary' or 'ifbinary'??

rth · 2020-02-04T20:20:04Z

Should it be 'if_binary' or 'if binary' or 'ifbinary'??

+1 for 'if_binary'

rushabh-v · 2020-02-05T18:02:01Z

@thomasjpfan I have updated the inverse_transform and added the test, can you review, please?

thomasjpfan

I am happy with if_binary as the keyword option.

thomasjpfan · 2020-02-05T19:57:05Z

sklearn/preprocessing/tests/test_encoders.py



+def test_one_hot_encoder_inverse_if_binary():
+    X = [['Male', 1],


We can make the input a numpy array:

X = np.array([['Male', 1], ['Female', 3], ['Female', 2]])

and then

assert_array_equal(ohe.inverse_transform(X_tr), X)

jnothman · 2020-02-05T22:02:23Z

sklearn/preprocessing/_encoders.py

        else:
-            n_transformed_features = sum(len(cats) - 1
-                                         for cats in self.categories_)
+            if isinstance(self.drop, str) and self.drop == 'if_binary':


Please use elif if it is appropriate here

rushabh-v · 2020-02-06T15:09:59Z

why it says Cmd.exe exited with code '1'.?

sklearn/preprocessing/tests/test_encoders.py

Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>

rushabh-v · 2020-02-07T06:01:16Z

All greens now!

rth · 2020-02-07T09:18:17Z

Thanks @rushabh-v and all reviewers!

…learn#16245)

rushabh-v added 4 commits January 27, 2020 22:59

add drop='if_binary' option

a26bab4

remove trailing spaces

a445f29

add whatsnew

a644599

add isinstance of str condition

9efc1e1

rushabh-v requested review from amueller and jnothman January 27, 2020 19:10

glemaitre previously requested changes Jan 28, 2020

View reviewed changes

glemaitre changed the title ~~Add 'if_binary' option to drop argument of OneHotEncoder~~ ENH Add 'if_binary' option to drop argument of OneHotEncoder Jan 28, 2020

jnothman reviewed Jan 28, 2020

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved

rushabh-v added 2 commits January 28, 2020 16:20

make requested changes

fa81caa

solve linting errors

909bfab

rth reviewed Jan 28, 2020

View reviewed changes

Merge branch 'master' into ohe-binary

73ffb7d

rushabh-v added 3 commits February 3, 2020 21:38

fix linting

77b186e

make requested change

0d0cdba

remove trailing whitespacce

d69f34f

rushabh-v requested a review from NicolasHug February 4, 2020 15:24

NicolasHug approved these changes Feb 4, 2020

View reviewed changes

rushabh-v requested a review from thomasjpfan February 4, 2020 16:35

thomasjpfan reviewed Feb 4, 2020

View reviewed changes

rushabh-v added 3 commits February 5, 2020 19:47

add if_binary to inverse_transform and add it's test

5fc6160

fix linting

5173ddb

fix inverse_transform

7e1ccb3

rushabh-v requested a review from thomasjpfan February 5, 2020 19:00

thomasjpfan reviewed Feb 5, 2020

View reviewed changes

jnothman reviewed Feb 5, 2020

View reviewed changes

make requested changes

987f41b

jnothman approved these changes Feb 6, 2020

View reviewed changes

thomasjpfan reviewed Feb 7, 2020

View reviewed changes

sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved

Add dtype=object

bbf87fc

Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>

rth merged commit 7e7e115 into scikit-learn:master Feb 7, 2020

rushabh-v deleted the ohe-binary branch February 7, 2020 17:26

glemaitre mentioned this pull request Feb 11, 2020

EHN improve the documentation of OneHotEncoder for if_binary #16428

Merged

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 22, 2020

ENH Add 'if_binary' option to drop argument of OneHotEncoder (scikit-…

22efcfe

…learn#16245)

NicolasHug mentioned this pull request Mar 2, 2020

ENH Add check for non binary variables in OneHotEncoder. #16585

Merged

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020

ENH Add 'if_binary' option to drop argument of OneHotEncoder (scikit-…

6a42b5e

…learn#16245)



		def test_one_hot_encoder_inverse_if_binary():
		X = [['Male', 1],

Uh oh!

Conversation

rushabh-v commented Jan 27, 2020

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rth commented Jan 28, 2020

Uh oh!

rushabh-v commented Jan 28, 2020

Uh oh!

rth commented Jan 28, 2020

Uh oh!

amueller commented Jan 28, 2020

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jan 28, 2020

Uh oh!

jnothman commented Jan 29, 2020 via email

Uh oh!

rth commented Jan 29, 2020

Uh oh!

rushabh-v commented Jan 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Jan 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rushabh-v commented Jan 29, 2020

Uh oh!

rth commented Jan 29, 2020

Uh oh!

rushabh-v commented Jan 29, 2020

Uh oh!

rushabh-v commented Jan 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Jan 29, 2020

Uh oh!

rushabh-v commented Jan 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Jan 29, 2020

Uh oh!

rushabh-v commented Jan 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Feb 4, 2020

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Feb 4, 2020

Uh oh!

rth commented Feb 4, 2020

Uh oh!

rushabh-v commented Feb 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

rushabh-v commented Jan 29, 2020 •

edited

Loading

rth commented Jan 29, 2020 •

edited

Loading

rushabh-v commented Jan 29, 2020 •

edited

Loading

rushabh-v commented Jan 29, 2020 •

edited

Loading

rushabh-v commented Jan 30, 2020 •

edited

Loading

rushabh-v commented Feb 5, 2020 •

edited

Loading