[MRG+2] Refactor CategoricalEncoder into OneHotEncoder (with deprecated kwargs) and OrdinalEncoder#10523
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
f4893a2 to
6346358
Compare
|
In the meantime, I added deprecation warnings for the old behaviour, keywords and attributes. |
jnothman
left a comment
There was a problem hiding this comment.
Clarifying the difference between encoded_input=False/True, and what will happen to it in two versions' time, still needs elucidation in the docs.
| Parameter names mapped to their values. | ||
| """ | ||
| out = dict() | ||
| for key in self._get_param_names(): |
There was a problem hiding this comment.
This is getting a bit adventurous of you! Propose this separately. It's not exclusive to this change.
There was a problem hiding this comment.
It somehow is, in the sense that it is quite essential for this PR, as otherwise just showing the repr of the new OneHotEncoder shows deprecation warnings.
Did it happen before that keyword arguments were deprecated? How was it dealt with then?
I can certainly do it in a separate PR, but then that PR would be a blocker for this one IMO (which is not necessarily a problem, so fine for me to do that).
There was a problem hiding this comment.
I don't get this. Can you elaborate? If deprecated keyword arguments are used, they raise a DeprecationWarning during fit, right? Why would the repr do that?
There was a problem hiding this comment.
This is to ensure get_params() does not raise any deprecation warnings. See the bigger non-inline comment with an overview of the deprecation handling
sklearn/preprocessing/_encoders.py
Outdated
| The categories of each feature determined during fitting | ||
| (in order corresponding with output of ``transform``). | ||
|
|
||
| Deprecated Attributes |
There was a problem hiding this comment.
I'm sure Numpydoc doesn't handle this.
sklearn/preprocessing/_encoders.py
Outdated
| will be all zeros. In the inverse transform, an unknown category | ||
| will be denoted as None. | ||
|
|
||
| Deprecated Parameters |
There was a problem hiding this comment.
I'm sure Numpydoc doesn't handle this. Just use .. deprecated:: 0.20, yeah?
| is set to 'ignore' and an unknown category is encountered during | ||
| transform, the resulting one-hot encoded columns for this feature | ||
| will be all zeros. In the inverse transform, an unknown category | ||
| will be denoted as None. |
There was a problem hiding this comment.
(Perhaps add a note that this can be used to handle missing values)
sklearn/preprocessing/_encoders.py
Outdated
|
|
||
| encoded_input=False : categorical features that still need to be | ||
| encoded. | ||
| encoded_input=True : already integer encoded data, and the categories |
There was a problem hiding this comment.
"In the range 0...(n values - 1)"
sklearn/preprocessing/_encoders.py
Outdated
|
|
||
| The used categories can be found in the ``categories_`` attribute. | ||
|
|
||
| encoded_input : boolean |
There was a problem hiding this comment.
I think ordinal_input is better than encoded_input, especially seeing as we introduce an OrdinalEncoder...
sklearn/preprocessing/_encoders.py
Outdated
| encoded_input : boolean | ||
| How to interpret the input data: | ||
|
|
||
| encoded_input=False : categorical features that still need to be |
There was a problem hiding this comment.
Don't use : descr. The correct syntax for definition lists is:
encoded_input=False
categorical features ...
I think the default should be 'auto', not None, with the default changing to False in two versions.
Do you think this option will be useful (for efficiency, for error handling, or for quality assurance) after the deprecation is finished? If not, should it just be called legacy_mode?
Does this setting affect the handling of n_values? Can one use encoded_input=True and categories together?
This seems to mostly affect the handling of errors, in that even if a column isn't active in input, that value is not considered unknown....?
sklearn/preprocessing/_encoders.py
Outdated
| OneHotEncoder(categories='auto', dtype=<... 'numpy.float64'>, | ||
| encoded_input=True, handle_unknown='error', sparse=True) | ||
| >>> enc.transform([[0, 1, 1]]).toarray() | ||
| array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]]) |
There was a problem hiding this comment.
Didn't this used to only show active features? Shouldn't this have columns for [x0=0, x0=1, x1=0, x0=1, x0=2, x2=0, x2=1, x2=3]? Here it has 9 columns, not 8.
sklearn/preprocessing/_encoders.py
Outdated
| OneHotEncoder(categories='auto', dtype=<... 'numpy.float64'>, | ||
| encoded_input=True, handle_unknown='error', sparse=True) | ||
| >>> enc.transform([[0, 1, 1]]).toarray() | ||
| array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]]) |
There was a problem hiding this comment.
Didn't this used to only show active features? Shouldn't this have columns for [x0=0, x0=1, x1=0, x0=1, x0=2, x2=0, x2=1, x2=3]? Here it has 9 columns, not 8.
|
Some quick answers
See my (long) comment on the issue asking about this: #10521
Good question, this is one of the things I didn't really think through yet. In principle, you can pass
Yes I know, but wanted it to be very explicit now to see the consequences (for initial reviewing). I didn't bother yet fully updating the documentation as well. |
|
I think we may have previously handled such warnings in get_params using a
check_warnings context, IIRC, but it resulted in dangerous race conditions
in parallel processing.
We usually don't bother raising warnings for constructor parameter access:
if the user doesn't have that parameter set at fit, they are unlikely to
try to access it otherwise in saved code. I know, it's a bit dodgy.
I personally think legacy_mode='auto' is the way to go, switching on the
basis of the data whether to use only the new params/attributes, or the
old. Most people don't care about the differences between the old and new
model. It can either default to False in v0.22 then disappear in v0.23 or
0.24, or False can become the *only* behaviour in v0.22 before it
disappears in v0.23 or 0.24.
|
|
Last commit added the logic to deal with all different case (to raise a warning or not, to use legacy mode or not). Still have to clean-up a lot, so don't review in detail, but could already take a look at |
jnothman
left a comment
There was a problem hiding this comment.
I've not looked at fit/transform
sklearn/preprocessing/_encoders.py
Outdated
|
|
||
| def _handle_deprecations(self, X): | ||
|
|
||
| if self.categories != 'auto': |
There was a problem hiding this comment.
Let's make self.categories='legacy' by default, so that the user can select auto explicitly.
There was a problem hiding this comment.
Yesterday I made a commit to make the default categories=None for the same purpose, but only pushed it now (the last commit).
I personally find that cleaner, as 'legacy' would not give the correct meaning if you are using string data.
sklearn/preprocessing/_encoders.py
Outdated
| self.dtype = dtype | ||
| self.handle_unknown = handle_unknown | ||
|
|
||
| if n_values is not None: |
There was a problem hiding this comment.
why don't you just use self.n_values = n_values? The warning is done there...
There was a problem hiding this comment.
why don't you just use self.n_values = n_values? The warning is done there...
Because I also want to deprecated the access / writing of the attribute n_values on the class object.
There was a problem hiding this comment.
Huh? But doing self.n_values = n_values here will call the setter and raise the warning, just as you have done.
sklearn/preprocessing/_encoders.py
Outdated
|
|
||
| if self._legacy_mode: | ||
| # TODO not with _transform_selected ?? | ||
| self._fit_transform_old(X) |
| ``X[:, i]``. Each feature value should be | ||
| in ``range(n_values[i])`` | ||
|
|
||
| categorical_features : "all" or array of indices or mask |
There was a problem hiding this comment.
I'm not sure users will be happy that we are deprecating this! But I suppose that in the deprecation notice, we'll be able to point to ColumnTransformer?
There was a problem hiding this comment.
regarding categorical_features: in principle I can keep this as this is not related to the inherent behaviour of how the encoding works in OneHotEncoder (legacy or new behaviour), but: it makes the implementation more complex, is not done in any other transformer, and indeed can be replaced with ColumnTransformer.
sklearn/preprocessing/_encoders.py
Outdated
| self._legacy_mode = True | ||
|
|
||
| if self._deprecated_categorical_features != 'all': | ||
| self._legacy_mode = True |
There was a problem hiding this comment.
This isn't safe to do if categories has been set
There was a problem hiding this comment.
Yes, I know, but I don't want to implement this ability for the new behaviour.
And, either you were using OneHotEncoder already, and then categories was not set and this is OK, or either you did already update your usage (eg setting categories instead of n_values) but then you can directly update for this deprecated keyword as well, or either you are new to this class and then you shouldn't use it.
What I can do is detect that categories is set by the user (and not internally set), and in that case just raise a plain error here instead of a warning.
There was a problem hiding this comment.
Absolutely. Error if categorical_features is set and not legacy_mode (including if string input)
jnothman
left a comment
There was a problem hiding this comment.
I've not looked at fit/transform
|
Test failing? Is this waiting for reviews? |
|
This is waiting for your opinion about the general idea in #10521 (I would say, let's keep the general discussion there for now, would need to look back at this PR to know its actual status) |
|
OK, made some small updates:
|
|
Great work!
|
|
I'm happy with it as it is iirc |
|
I am also ok with deprecating I would also introduce a stub Even if |
|
This PR doesn't yet remove CategoricalEncoder. I'm okay with Olivier's suggestion: class CategoricalEncoder:
"Removed"
def __init__(*args, **kwargs):
raise RuntimeError('CategoricalEncoder briefly existed in 0.19dev. '
'Its functionality has been rolled into '
'OneHotEncoder and OrdinalEncoder. This stub '
'will be removed in version 0.21.')The only problem I see with it is that an ImportError would warn sooner. |
It does, but will add the stub as proposed above.
Yeah, I was also thinking about that. But I don't see a way to both support |
8966ad3 to
79f2e92
Compare
|
oh yeah... i always forget to load diff when looking for things
|
|
@amueller are you OK with going merging this as is? (thus, deprecating |
|
I'm okay with that, yes.
|
jnothman
left a comment
There was a problem hiding this comment.
Are we otherwise ready to give this a whirl?
sklearn/preprocessing/data.py
Outdated
|
|
||
| class CategoricalEncoder: | ||
| """ | ||
| CategoricalEncoder briefly existed in 0.19dev. Its functionality |
|
Great work !
|
|
Great work, @jorisvandenbossche! |
|
Yesssss !
Congratulations team
Sent from my phone. Please forgive typos and briefness.
…On Jun 21, 2018, 12:14, at 12:14, Joel Nothman ***@***.***> wrote:
Merged #10523.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#10523 (comment)
|
|
yes, I'm ok with this for now ;) Thank you for all the work, this is great! |
Possibly fixes #10521
This does split the CategoricalEncoder into two separate classes for onehot and ordinal encoding, and then integrates the onehot encoding into the existing OneHotEncoder.
I also moved them to a separate file, so for reviewing actual changes might be better to not view the first commit.
I also did not yet introduce deprecation warnings for the old kwargs / attributes (or computed the new attributes in the old setting), but I infer based on the passed data to fit whether it was accepted before (and in this case we should raise a deprecation warning) and if not directly use the new behaviour. This 'inferring' of the behaviour can be overwritten with a newly added a
encoded_input=True/Falsekeyword.The main drawback for new users of the OneHotEncoder is the 'pollution' with the old keywords and attributes in the docstring, repr of the object and tab completion.