Skip to content

[MRG] skip_complete flag for IterativeImputer#14806

Merged
glemaitre merged 14 commits intoscikit-learn:masterfrom
sergeyf:iterativeimputer_dont_skip
Sep 9, 2019
Merged

[MRG] skip_complete flag for IterativeImputer#14806
glemaitre merged 14 commits intoscikit-learn:masterfrom
sergeyf:iterativeimputer_dont_skip

Conversation

@sergeyf
Copy link
Copy Markdown
Contributor

@sergeyf sergeyf commented Aug 25, 2019

Reference Issues/PRs

Fixes #13773 and #14383

What does this implement/fix? Explain your changes.

Before, IterativeImputer had some unintuitive behavior when there were features with no missing values at fit but not at transform. This PR turns off this behavior by default, and adds a flag (skip_complete) to turn it back on. A test is added, the docstring is updated.

This change also required a few minor changes to fix edge cases that were impossible before.

Paging @jnothman and @glemaitre.

"random"
A random order for each round.

skip_non_missing_features : boolean, optional (default=False)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have:

    skip : 'none' (default), 'complete'
        Which features to use the initial imputation for, instead of iterative imputation.
        'complete' avoids learning an iterative imputation model for features that
        have no missing values in fitting, so will be efficient if you expect those features
        not require imputation. 

We could consider supporting a list of feature indices too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like boolean a little better because it's easy to remember what the options are. People could easily confuse 'none' for None.

As for supporting feature indices: I can't really think of when I'd want to do this and how exactly it would work. Let's keep it simple for now.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Aug 26, 2019 via email

@sergeyf
Copy link
Copy Markdown
Contributor Author

sergeyf commented Aug 26, 2019

When you know that some features will never have missing values is when you would specify them explicitly.

Doesn't skip_non_missing_features=True do this automatically? It finds them for you so you don't have to specify them yourself!

@sergeyf
Copy link
Copy Markdown
Contributor Author

sergeyf commented Aug 26, 2019

To be clear about my reluctance to add more bells/whistles. Partially, I think the current version solves the problem we have, and partially I don't have a ton of time to make this more complex/fancy. We can always add more stuff later if we need it?

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Aug 26, 2019 via email

@sergeyf
Copy link
Copy Markdown
Contributor Author

sergeyf commented Aug 26, 2019

Right. How about skip_complete but still as boolean?

@sergeyf sergeyf changed the title [MRG] skip_non_missing_features flag for IterativeImputer [MRG] skip_complete flag for IterativeImputer Aug 26, 2019
Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM!

train = [[1], [2]]
imputer = IterativeImputer()
imputer.fit(train)
assert imputer.n_iter_ == 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test it with missing value too?

for a_, b_, loc_, scale_
in zip(a, b, mus, sigmas)]
# get posterior samples if there is at least one missing value
if np.sum(missing_row_mask) > 0:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the function appear quite nested. Might be more readable if we handle the == 0 case with a return.

Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM!

@sergeyf
Copy link
Copy Markdown
Contributor Author

sergeyf commented Aug 27, 2019

Thanks - I've addressed the comments!

@sergeyf
Copy link
Copy Markdown
Contributor Author

sergeyf commented Sep 3, 2019

@jnothman Anything else here?

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the feature. Just some cosmetic changes.
Could you add some entries in what's new. Maybe we should have 2 entries:

  • one for skip_complete;
  • one for the resolution with a single feature.

sergeyf and others added 7 commits September 9, 2019 08:31
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@sergeyf
Copy link
Copy Markdown
Contributor Author

sergeyf commented Sep 9, 2019

@glemaitre Thanks, everything you suggested/asked for is done!

@glemaitre glemaitre merged commit 9f7d3f9 into scikit-learn:master Sep 9, 2019
@glemaitre
Copy link
Copy Markdown
Member

Thanks @sergeyf

@LuxMiranda
Copy link
Copy Markdown

My hero! @sergeyf

@sergeyf sergeyf deleted the iterativeimputer_dont_skip branch September 9, 2019 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IterativeImputer behaviour on missing nan's in fit data

4 participants