You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support for strings, negative integers ( and anything that can be put in an object array )
Specify discrete values using the classes parameter instead of n_values.
Uses a LabelEncoder instance for each column.
Changes
Changes _transform_selected to _apply_selected giving it the ability to optionally not return transformed values.
_apply_selected can no longer accept lists. It has to be given a np.array object. This is done because the input can be of np.object type and it cannot be always cast as a whole to np.int or np.float type. The transformed and non-transformed parts of the array are converted to the specified type before returning.
@MechCoder@amueller np.in1d is not supported on older Numpy versions because merge sort was not supported for all data types prior to version 1.6. Should I write a custom implementation on in1d or raise a ValueError on object arrays when a lower numpy version is installed ?
Apologies if this comment is not helpful. Isn't the name OneHotEncoder pretty famous in the data domain. And it would be good to retain that? (Maybe in addition to the newly introduced CategoricalEncoder?)
I don't think it's "famous". I think most people don't know what it means. People from R are very confused that you need to do anything to work with categorical data.
Somewhat along the lines of @rvraghav93 (and why reviewing the whole thing might not be so interesting until this is resolved): It seems as if this is functionally a superset of the current OneHotEncoder. It is unclear to me what benefit lies in the name change. It's great that the interface and mechanism will be simplified, but I don't see why we should confuse or upset existing OneHotEncoder (and especially OneHotEncoder('auto')) users for the sake of a changing data structure.
@jnothman Because the attributes of OneHotEncoder like n_values_ and active_features_ don't hold any meaning for this implementation and it would be unnecessarily complex to support them exactly how they behave now.
The point is that the vast majority of users of OneHotEncoder will not actually access those attributes, so your "unnecessarily complex" is actually just a few lines of code in a property getter that will disappear after a deprecation period. Deprecating the whole class, on the other hand, makes life hard for many users. I don't think it's hard for us as the developers to accept that burden rather than passing one onto a large set of user-code maintainers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Following the discussion with @amueller and @MechCoder
New Features in
CategoricalEncoderclassesparameter instead ofn_values.LabelEncoderinstance for each column.Changes
_transform_selectedto_apply_selectedgiving it the ability to optionally not return transformed values._apply_selectedcan no longer accept lists. It has to be given anp.arrayobject. This is done because the input can be ofnp.objecttype and it cannot be always cast as a whole tonp.intornp.floattype. The transformed and non-transformed parts of the array are converted to the specified type before returning.