Skip to content

Can't provide feature indices for OneHotEncoder in pipeline #8539

@amueller

Description

@amueller

Let's say I want to apply a transformation only to some features in a pipeline, such as imputation or one-hot-encoding (or scaling, which currently doesn't support this).
I could provide the indices of the columns I want to transform. But if there are any previous steps in the pipeline, they might re-arrange the features in some arbitrary way (like OneHotEncoder does).

Example

import numpy as np

# assume the second feature is categorical and the third is continuous
X = [[np.NaN, np.NaN, 5], [np.NaN, 1, 3], [np.NaN, 1, np.NaN]]

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer, OneHotEncoder

pipe = make_pipeline(Imputer(strategy='most_frequent'), OneHotEncoder(categorical_features=[1], sparse=False))

pipe.fit_transform(X)

array([[ 0., 1., 1.],
[ 1., 0., 1.],
[ 1., 0., 1.]])

desired outcome:

array([[ 1., 5.],
[ 1., 3.],
[ 1., 3.]])

Even if each output feature corresponds to exactly one input feature, and we knew which that was, there would be no way to specify this in OneHotEncoder. This might look constructed but is a pretty obvious use-case in which you have per-column meta-data.

The only solution I see is by keeping along a column index (or column names) and allow to pass that.
Given my experience of .iloc vs .loc in pandas, I'm not entirely happy with the prospect.

cc @mfeurer

Conceptually somewhat related to #8480 and scikit-learn/enhancement_proposals#5 as they deal with feature meta-data.

[and then I introduced hierarchical indices over columns into scikit-learn.... not]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions