-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Missing features removal with SimpleImputer #16426
Copy link
Copy link
Closed
Labels
EnhancementModerateAnything that requires some knowledge of conventions and best practicesAnything that requires some knowledge of conventions and best practicesmodule:impute
Description
Code sample
In the sample code below, a column is removed from the dataset during the pipeline
>>> from sklearn.impute import SimpleImputer
>>> import numpy as np
>>> imp = SimpleImputer()
>>> imp.fit([[0, np.nan], [1, np.nan]])
>>> imp.transform([[0, np.nan], [1, 1]])
array([[0.],
[1.]])Problem description
Currently sklearn.impute.SimpleImputer silently removes features that are np.nan on every training sample.
This may cause further issues on pipelines because the dataset's shape has changed, e.g.
dataset[:, columns_to_impute_with_median] = imp.fit_transform(dataset[:, columns_to_impute_with_median])Possible solutions
For the problematic features, either keep their values if valid or impute the fill_value during transform. I suggest adding a new parameter to trigger this behaviour with a warning highlighting the referred features.
As I'm willing to implement this feature, I look forward advices.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
EnhancementModerateAnything that requires some knowledge of conventions and best practicesAnything that requires some knowledge of conventions and best practicesmodule:impute