[WIP] "other"/min_freq in OneHot and OrdinalEncoder by datajanko · Pull Request #12264 · scikit-learn/scikit-learn

datajanko · 2018-10-03T19:23:37Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Currently, adds the option to add a frequency threshold to OneHot- and OrdinalEncoder.
All categories below this threshold are determined, sorted and mapped to the first category.

What needs to be done?

Adds min_dfwith implementation to Ordinal- and OneHotEncoder
Example in examples/ folder
Documentation
Probably add more tests and remove some tests
add option to add a name of the other group -> What to do if not object/str? What happens if otheralready there?
With a threshold, encoders are not "really" invertible anymore -> add at least documentation?
Align if function names are appropriate

Any other comments?

Further points of extension:

Instead of min_freq add top_n categories. Moreover, one could use integers instead of floats in min_freq. top_nand min_freqcould interact
Allow an array of frequencies for each feature
One could provide a mapping, to group certain values in a category together. It might be though, that a different Encoder would be more suitable

amueller · 2018-10-03T20:45:44Z

thanks. Looks like you've got merge conflicts, though :-/

provide tests: - tests for different frequency values - otherwise tests similar to that of _encode

- adds min_freq keyword to ordinal and onehot encoder and adds the necessary calls to BaseEncoder - improves tests on _group_values - adds tests that ensure that fit does not alter the inputarray.

datajanko · 2018-10-04T08:50:19Z

So I was able to rebase, but encountered another error. Will push an update later

jorisvandenbossche · 2018-10-04T09:03:25Z

General design question: the docstring you added says "group low frequent categories together", but above you say "All categories below this threshold are ... mapped to the first category.".
I would understand the first as to combine all low frequent categories together in a separate category, which is different as the first category (which one is the first also depends on the sort order of the categories)

jnothman

I'm not sure why you want to do this before encoding. Can it be done in the _encode functions in any case?

Although this does get trickier when you try to do it in an OrdinalEncoder context.

jnothman · 2018-10-04T00:14:22Z

sklearn/preprocessing/_encoders.py

            0.20 and will be removed in 0.22.
            You can use the ``ColumnTransformer`` instead.

+    min_freq: float, default=0


space before colon, please

jnothman · 2018-10-04T00:14:29Z

sklearn/preprocessing/_encoders.py

            You can use the ``ColumnTransformer`` instead.

+    min_freq: float, default=0
+        group low frequent categories together


be more specific, please. This should describe what the parameter is.

datajanko · 2018-10-04T11:11:08Z

@jorisvandenbossche
Currently, because it's easier, I'm just selecting the first element as the key of the new groups (groups can be ints, so a string "other" will not always be feasible, the groups will be always sorted). That's why I chose this easy solution.

If one wants to add a new label, one has to check if the label is there already of find a label automatically. This can become complex. I'm open to suggestion on what to do here.

@jnothman
I think it can be done inside the _encode-function as well. However, I thought it is a good practice so separate the concerns, that's why I wanted to separate things here. Moreover, the logic in _encode will be more complex, specifically if one adds a top_n keyword. But it in that case it might even make sense encode the keywords according to rank. So there are arguments to add this into _encode.
If you prefer that, I'll just move everything there.

I'll update the documentation asap

jorisvandenbossche · 2018-10-04T11:16:30Z

I don't have the answer, but I only think that it should not be based on "easier" to implement. I think both ways are possible (although the one more complex than the other), and we should choose what behaviour we want based on what makes most sense from a machine learning point of view.

NicolasHug · 2019-04-17T21:55:01Z

@datajanko are you planning to work on this again soonish?

If not I'll give it a try ;)

datajanko · 2019-04-18T09:59:15Z

Currently, my schedule is quite rough, so please go for it. However, if I recall correctly, there was some helpful work on adding nan support in onehotencoder or ordinal encoder. I think using that would be the easiest way to implement the feature without changing too much. I don't know the status on the issue though and can't find it.

datajanko · 2019-04-18T20:25:00Z

You should not proceed here until missing values are added to the onehot encoder, see #13028 #11996
The idea was to treat the groups below the minimum frequencies as missing, thus reusing the code from there.

NicolasHug · 2019-04-19T18:32:29Z

Could you please expand a bit @datajanko please?

I don't understand why we need to wait for nan support here.

datajanko · 2019-04-20T09:21:57Z

So the simplest idea we had was: map all the low-frequency groups to NaN and then use the implementation with nan. This would mean a low implementation effort

Besides, I don't recall the details precisely, but I think I had some issues in my implementation related to nan values (could be related to non existing values in the test set). I just recall: wait for the nan implementation.

However, you are of course free to choose any approach you like and maybe I just oversaw an obvious direct solution here.

FedericoV · 2019-04-22T23:00:04Z

Hi @datajanko - are you planning to continue to work on this? I was trying to solve the exact same problem.

NicolasHug · 2019-04-22T23:32:38Z

I'm on it @FedericoV

FedericoV · 2019-04-22T23:37:37Z

Cool, let me know if you need me to test out a new branch @NicolasHug

FedericoV · 2019-06-03T21:12:41Z

Hi @NicolasHug - did you go ahead and make any headway on this or did you decide to abandon it for now?

NicolasHug · 2019-06-03T21:22:04Z

I implemented #13833. It's waiting for feedback. These things are much more complicated than they look

FedericoV · 2019-06-03T21:33:20Z

Oh awesome, thank you so much! I didn't mean to be impatient, I just hadn't seen your PR. Do you mind if I play with it a bit? I might be able to catch some bugs.

…

On Mon, Jun 3, 2019 at 2:24 PM Nicolas Hug ***@***.***> wrote: I implemented #13833 <#13833>. It's waiting for feedback. These things are much more complicated than they look — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12264?email_source=notifications&email_token=AAEZ25VXIXUCP4TJSION7Z3PYWDXDA5CNFSM4FY2ZX42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2XLTY#issuecomment-498431439>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEZ25XUGSSGNMNF6IC44M3PYWDXDANCNFSM4FY2ZX4Q> .

NicolasHug · 2019-06-03T21:48:16Z

Of course! that'd be very helpful

These things are much more complicated than they look

And don't worry that wasn't directed to you, more like a rant to myself ;)

amueller · 2022-03-25T20:14:01Z

Should we say closed via #16018? That one doesn't have OrdinalEncoder, though.

thomasjpfan · 2022-03-25T20:24:03Z

I am okay with closing. Only OneHotEncoder was requested in the original issue. If OrdinalEncoder is requested, it should be able to use the same code in #16018. (By moving the new methods for infrequent categories up to the parent _BaseEncoder class.)

Thank you @datajanko for looking into the issue!

J42994 added 2 commits October 4, 2018 07:39

add utility function to group low frequent values

b823d91

provide tests: - tests for different frequency values - otherwise tests similar to that of _encode

add min_freq to ordinal and onehot encoder

512a567

- adds min_freq keyword to ordinal and onehot encoder and adds the necessary calls to BaseEncoder - improves tests on _group_values - adds tests that ensure that fit does not alter the inputarray.

datajanko force-pushed the GroupingEncoder branch from 16113a7 to 512a567 Compare October 4, 2018 07:42

fixes errors concerned accessing group

b73942e

jnothman reviewed Oct 4, 2018

View reviewed changes

jorisvandenbossche mentioned this pull request Oct 4, 2018

Add "other" / min_frequency option to OneHotEncoder #12153

Closed

jnothman mentioned this pull request Mar 21, 2019

Handle Error Policy in OrdinalEncoder #13488

Closed

NicolasHug mentioned this pull request May 8, 2019

[MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder #13833

Closed

4 tasks

amueller added the Waiting for Reviewer label Aug 6, 2019

github-actions bot added the module:preprocessing label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:50

cmarmo added Superseded PR has been replace by a newer PR and removed Waiting for Reviewer labels Feb 5, 2022

thomasjpfan closed this Mar 25, 2022

Uh oh!

Conversation

datajanko commented Oct 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

amueller commented Oct 3, 2018

Uh oh!

datajanko commented Oct 4, 2018

Uh oh!

jorisvandenbossche commented Oct 4, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Oct 4, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman Oct 4, 2018

Choose a reason for hiding this comment

Uh oh!

datajanko commented Oct 4, 2018

Uh oh!

jorisvandenbossche commented Oct 4, 2018

Uh oh!

NicolasHug commented Apr 17, 2019

Uh oh!

datajanko commented Apr 18, 2019

Uh oh!

datajanko commented Apr 18, 2019

Uh oh!

NicolasHug commented Apr 19, 2019

Uh oh!

datajanko commented Apr 20, 2019

Uh oh!

FedericoV commented Apr 22, 2019

Uh oh!

NicolasHug commented Apr 22, 2019

Uh oh!

FedericoV commented Apr 22, 2019

Uh oh!

FedericoV commented Jun 3, 2019

Uh oh!

NicolasHug commented Jun 3, 2019

Uh oh!

FedericoV commented Jun 3, 2019 via email

Uh oh!

NicolasHug commented Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Mar 25, 2022

Uh oh!

thomasjpfan commented Mar 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

datajanko commented Oct 3, 2018 •

edited

Loading

NicolasHug commented Jun 3, 2019 •

edited

Loading