-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Can't provide feature indices for OneHotEncoder in pipeline #8539
Description
Let's say I want to apply a transformation only to some features in a pipeline, such as imputation or one-hot-encoding (or scaling, which currently doesn't support this).
I could provide the indices of the columns I want to transform. But if there are any previous steps in the pipeline, they might re-arrange the features in some arbitrary way (like OneHotEncoder does).
Example
import numpy as np
# assume the second feature is categorical and the third is continuous
X = [[np.NaN, np.NaN, 5], [np.NaN, 1, 3], [np.NaN, 1, np.NaN]]
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer, OneHotEncoder
pipe = make_pipeline(Imputer(strategy='most_frequent'), OneHotEncoder(categorical_features=[1], sparse=False))
pipe.fit_transform(X)array([[ 0., 1., 1.],
[ 1., 0., 1.],
[ 1., 0., 1.]])
desired outcome:
array([[ 1., 5.],
[ 1., 3.],
[ 1., 3.]])
Even if each output feature corresponds to exactly one input feature, and we knew which that was, there would be no way to specify this in OneHotEncoder. This might look constructed but is a pretty obvious use-case in which you have per-column meta-data.
The only solution I see is by keeping along a column index (or column names) and allow to pass that.
Given my experience of .iloc vs .loc in pandas, I'm not entirely happy with the prospect.
cc @mfeurer
Conceptually somewhat related to #8480 and scikit-learn/enhancement_proposals#5 as they deal with feature meta-data.
[and then I introduced hierarchical indices over columns into scikit-learn.... not]