[WIP] "other"/min_freq in OneHot and OrdinalEncoder#12264
[WIP] "other"/min_freq in OneHot and OrdinalEncoder#12264datajanko wants to merge 3 commits intoscikit-learn:mainfrom
Conversation
|
thanks. Looks like you've got merge conflicts, though :-/ |
provide tests: - tests for different frequency values - otherwise tests similar to that of _encode
- adds min_freq keyword to ordinal and onehot encoder and adds the necessary calls to BaseEncoder - improves tests on _group_values - adds tests that ensure that fit does not alter the inputarray.
16113a7 to
512a567
Compare
|
So I was able to rebase, but encountered another error. Will push an update later |
|
General design question: the docstring you added says "group low frequent categories together", but above you say "All categories below this threshold are ... mapped to the first category.". |
jnothman
left a comment
There was a problem hiding this comment.
I'm not sure why you want to do this before encoding. Can it be done in the _encode functions in any case?
Although this does get trickier when you try to do it in an OrdinalEncoder context.
| 0.20 and will be removed in 0.22. | ||
| You can use the ``ColumnTransformer`` instead. | ||
|
|
||
| min_freq: float, default=0 |
| You can use the ``ColumnTransformer`` instead. | ||
|
|
||
| min_freq: float, default=0 | ||
| group low frequent categories together |
There was a problem hiding this comment.
be more specific, please. This should describe what the parameter is.
|
@jorisvandenbossche If one wants to add a new label, one has to check if the label is there already of find a label automatically. This can become complex. I'm open to suggestion on what to do here. @jnothman I'll update the documentation asap |
|
I don't have the answer, but I only think that it should not be based on "easier" to implement. I think both ways are possible (although the one more complex than the other), and we should choose what behaviour we want based on what makes most sense from a machine learning point of view. |
|
@datajanko are you planning to work on this again soonish? If not I'll give it a try ;) |
|
Currently, my schedule is quite rough, so please go for it. However, if I recall correctly, there was some helpful work on adding nan support in onehotencoder or ordinal encoder. I think using that would be the easiest way to implement the feature without changing too much. I don't know the status on the issue though and can't find it. |
|
Could you please expand a bit @datajanko please? I don't understand why we need to wait for nan support here. |
|
So the simplest idea we had was: map all the low-frequency groups to NaN and then use the implementation with nan. This would mean a low implementation effort Besides, I don't recall the details precisely, but I think I had some issues in my implementation related to nan values (could be related to non existing values in the test set). I just recall: wait for the nan implementation. However, you are of course free to choose any approach you like and maybe I just oversaw an obvious direct solution here. |
|
Hi @datajanko - are you planning to continue to work on this? I was trying to solve the exact same problem. |
|
I'm on it @FedericoV |
|
Cool, let me know if you need me to test out a new branch @NicolasHug |
|
Hi @NicolasHug - did you go ahead and make any headway on this or did you decide to abandon it for now? |
|
I implemented #13833. It's waiting for feedback. These things are much more complicated than they look |
|
Oh awesome, thank you so much! I didn't mean to be impatient, I just
hadn't seen your PR. Do you mind if I play with it a bit? I might be able
to catch some bugs.
…On Mon, Jun 3, 2019 at 2:24 PM Nicolas Hug ***@***.***> wrote:
I implemented #13833
<#13833>. It's waiting
for feedback. These things are much more complicated than they look
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12264?email_source=notifications&email_token=AAEZ25VXIXUCP4TJSION7Z3PYWDXDA5CNFSM4FY2ZX42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2XLTY#issuecomment-498431439>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEZ25XUGSSGNMNF6IC44M3PYWDXDANCNFSM4FY2ZX4Q>
.
|
|
Of course! that'd be very helpful
And don't worry that wasn't directed to you, more like a rant to myself ;) |
|
Should we say closed via #16018? That one doesn't have OrdinalEncoder, though. |
|
I am okay with closing. Only Thank you @datajanko for looking into the issue! |
Reference Issues/PRs
Fixes: #12153
What does this implement/fix? Explain your changes.
Currently, adds the option to add a frequency threshold to OneHot- and OrdinalEncoder.
All categories below this threshold are determined, sorted and mapped to the first category.
What needs to be done?
min_dfwith implementation to Ordinal- and OneHotEncoderexamples/folderothergroup -> What to do if not object/str? What happens ifotheralready there?Any other comments?
Further points of extension:
min_freqaddtop_ncategories. Moreover, one could use integers instead of floats inmin_freq.top_nandmin_freqcould interact