[WIP] Make LabelEncoder more friendly to new labels#3243
[WIP] Make LabelEncoder more friendly to new labels#3243mjbommar wants to merge 59 commits intoscikit-learn:masterfrom
Conversation
|
What about selecting a fixed "default" label? I assume this would be at On 5 June 2014 03:42, Michael Bommarito notifications@github.com wrote:
|
|
@jnothman, good idea. I would normally hit it with |
|
@jnothman, should I add a new example to preprocessing.rst that shows how to handle this? I think this issue of handling unseen categorical labels is a very common pitfall for people and I seem to run into it very often when teaching. |
|
Yes, I think an example would be helpful. |
|
@jnothman, another subtle point about Which do we want?
|
|
As a categorical label, NaN seems a bit strange altogether, given that it is a float. Is the option needed? But if it's there, yes, I'd say upcast to a float type (could use find_common_type). |
|
Well, this is the same issues that I think we know deterministically that the indices will be integer unless upcast, so do we need to use |
|
Oh perhaps not. I was thinking of the case where a smaller float is On 9 June 2014 08:33, Michael Bommarito notifications@github.com wrote:
|
|
But I'm not sure find_common_type helps there anyway On 9 June 2014 08:44, Joel Nothman jnothman@student.usyd.edu.au wrote:
|
|
OK, the version I have currently pushed has the proposed float/int logic. |
|
@jnothman, just wanted to see if you were waiting on anything from me on this. I think I've addressed your comments thus far but wanted to make sure. |
|
I've not got further than looking at the PR description! It's a busy week, and I'm overseas for the next two, so I'm avoiding promises to review atm. |
…transform with new labels
…bels=update w/ searchsorted
…g, cleaning after removing np.nan.
|
Cleanly rebased final PR pending. |
|
Closing for PR #3483. |
This PR intends to make
preprocessing.LabelEncodermore friendly for production/pipeline usage by adding anew_labelsconstructor argument.Instead of always raising ValueError for unseen/new labels in
transform, LabelEncoder may be initialized withnew_labelsas:"raise": current behavior, i.e., raise ValueError; to remain default behavior"nan": return np.nan for unseen/new labels"update": updateclasses_with new IDs[N, ..., N+m-1]formnew labels and assign"label": set newly seen labels to have fixed classnew_label_class=-1Tests and documentation updates included.
(edit: adding
"label"to list for quick summary)