Skip to content

[MRG] Parameter for stacking missing indicator into imputer#12583

Merged
jnothman merged 53 commits intoscikit-learn:masterfrom
DanilBaibak:Parameter-For-Stacking-MissingIndicator-Into-Imputer
Apr 9, 2019
Merged

[MRG] Parameter for stacking missing indicator into imputer#12583
jnothman merged 53 commits intoscikit-learn:masterfrom
DanilBaibak:Parameter-For-Stacking-MissingIndicator-Into-Imputer

Conversation

@DanilBaibak
Copy link
Copy Markdown
Contributor

Fixes #11886

A new parameter add_indicator was added to SimpleImputer that allow simply stacking a MissingIndicator transform into the output of the imputer's transform.

@amueller
Copy link
Copy Markdown
Member

Sweet. Can you add an entry to the 0.21 whatsnew? Thanks!

Copy link
Copy Markdown
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice feature ! Thanks. A few comments below.

@jeremiedbb
Copy link
Copy Markdown
Member

Hum I found a case when it will fail. When one column is full of missing value, the SimpleImputer will drop that column (except for "constant" strategy) and the MissingIndicator will have a feature mismatch at transform time.

First, you can add it to the test, i.e. add one column full of missing values, to make the test cover more cases.

Then, to fix this, one possibility is to not fit the MissingIndicator in fit but fit_transform it instead in transform. The MissingIndicator does not hold any information that the SimpleImputer already have, so it does not even need to be an attribute of the SimpleImputer. @amuller @jnothman what do you think ?

@DanilBaibak
Copy link
Copy Markdown
Contributor Author

@jeremiedbb, good catch with fully missing column! I adjusted code and added tests. What do you think about it now?

@amueller
Copy link
Copy Markdown
Member

I think transform shouldn't modify the estimator. Why can't we fit in fit and call transform after dropping columns?

I don't like the current solution since it adds a constant 1 column which is not useful imho.

@jeremiedbb
Copy link
Copy Markdown
Member

jeremiedbb commented Nov 20, 2018

I don't like the current solution since it adds a constant 1 column which is not useful imho.

I agree. In the case of a column full of missing values, we should also drop the column full of 1 from the output of missingIndicator.

@DanilBaibak
Copy link
Copy Markdown
Contributor Author

@amueller and @jeremiedbb, I see your point and agree. But the current solution does the same if you do it like this:

pipeline = make_pipeline(
    make_union(
        SimpleImputer(missing_values=marker),
        MissingIndicator(missing_values=marker)
    )
)

pipeline.fit_transform(X_test)

In theory, if we drop the column full of 1, people who now use make_union can affected. What do you think?

@amueller
Copy link
Copy Markdown
Member

That's a good point. We need to decide to either:

  1. make it inconsistent and drop the 1s here
  2. keep the redundant 1s in both places
  3. change the behavior in MissingIndicator to always drop constant columns.

I'm somewhat leaning towards 3, though it requires a deprecation cycle.

@DanilBaibak
Copy link
Copy Markdown
Contributor Author

Seems, auto drop for a constant column not always good choice. Here's a small sample:

X_train = np.array([
    [1, 1, np.nan],
    [np.nan, 2, 6],
    [2, np.nan, 3],
    [3, 3, 9]
])

X_test = np.array([
    [np.nan, 1, 5],
    [np.nan, 2, np.nan],
    [np.nan, np.nan, 3]
])

pipeline = make_pipeline(
    make_union(
        SimpleImputer(),
        MissingIndicator()
    )
)
pipeline.fit(X_train)

pipeline.transform(X_train)
pipeline.transform(X_test)

For X_test we have a whole column of 1, but not for X_train.

I would vote for the # 2, because, to be honest, it's hard to imagine, that users will try to work with a dataset, that contains whole column of empty values. It would be unuseful even with "constant" strategy for SimpleImputer.

@jeremiedbb
Copy link
Copy Markdown
Member

@DanilBaibak I think Joel meant in another PR, once this one is done

@DanilBaibak
Copy link
Copy Markdown
Contributor Author

DanilBaibak commented Apr 1, 2019

Ok! Just added it in hot pursuit 😄

Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good following merge of #13491

Copy link
Copy Markdown
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Thanks.

@jnothman jnothman merged commit 2252e1f into scikit-learn:master Apr 9, 2019
@jnothman
Copy link
Copy Markdown
Member

jnothman commented Apr 9, 2019

Thanks @banilo!!

@DanilBaibak
Copy link
Copy Markdown
Contributor Author

Glad to help 😊

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add_indicator switch in imputers

6 participants