Skip to content

ENH Add inverse_transform feature to SimpleImputer#17612

Merged
jnothman merged 17 commits intoscikit-learn:masterfrom
postmalloc:impute_inverse_transform
Jun 25, 2020
Merged

ENH Add inverse_transform feature to SimpleImputer#17612
jnothman merged 17 commits intoscikit-learn:masterfrom
postmalloc:impute_inverse_transform

Conversation

@postmalloc
Copy link
Copy Markdown
Contributor

@postmalloc postmalloc commented Jun 16, 2020

Reference Issues/PRs

Fixes #17590

What does this implement/fix? Explain your changes.

Implements inverse_transform feature in SimpleImputer. The behavior of this method
is such that it returns a new array with the imputed values replaced with the original
missing_values.

  • Add inverse_transform method to SimpleImputer
  • Add test cases for inverse_transform

Any other comments?

Can this feature be implemented for other imputers? Also, I would like to evaluate the performance once the approach is finalized.

@postmalloc postmalloc marked this pull request as draft June 16, 2020 15:05
@postmalloc postmalloc marked this pull request as ready for review June 16, 2020 17:31
@postmalloc postmalloc changed the title [WIP] Add inverse_transform feature to SimpleImputer [MRG] Add inverse_transform feature to SimpleImputer Jun 17, 2020
Copy link
Copy Markdown
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good !
Can you add an entry in doc/whats_new/0.24.rst ?

postmalloc and others added 3 commits June 18, 2020 04:32
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
@postmalloc
Copy link
Copy Markdown
Contributor Author

This is looking good !
Can you add an entry in doc/whats_new/0.24.rst ?

Thanks Tom, updated whats_new.

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start

assert idx == idx_order


def test_simple_imputation_inverse_transform():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you parametrize the test with different missing_values type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing for two missing_values, [-1, np.nan]

[6, 7, np.nan, -1],
[8, 9, 0, np.nan]
])
X_2 = np.array([
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try 2 more arrays: one where all columns have missing data and one where there is missing data one column every two

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this!
I overlooked a major scenario where features get dropped for being completely empty in the transform step. The process of regenerating original data is not very elegant when columns are dropped resulting in a different shape. I think changes suggested by PR #16695 would make things easier here.
The logic I wrote right now is iterative. It works only when both the fitting data and transform data are compatible i.e., both have values missing in same features. Let me know if there's a better way to handle this limitation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I mean was more

    X_3 = np.array([
        [1, missing_value, 5, 9],
        [missing_value, 4, missing_value, missing_value],
        [2, missing_value, 7, missing_value],
        [missing_value, 3, missing_value, 8]
    ])

    X_4 = np.array([
        [1, 1, 1, 3],
        [missing_value, 2, missing_value, 1],
        [2, 3, 3, 4],
        [missing_value, 4, missing_value, 2]
    ])

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I update the test cases

postmalloc and others added 4 commits June 18, 2020 10:15
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@postmalloc
Copy link
Copy Markdown
Contributor Author

Hi @glemaitre, do let me know if this can be implemented better

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of changes more.


Returns
-------
original_X : ndarray, shape (n_samples, n_features)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
original_X : ndarray, shape (n_samples, n_features)
original_X : ndarray of shape (n_samples, n_features)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it X_original


missing_feature_count = len(self.indicator_.features_)

# Split the augmented array into imputed array and its missing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for the comment

# Split the augmented array into imputed array and its missing
# indicator mask.
feature_count = X.shape[1] - missing_feature_count
imputed_arr = X[:, :feature_count].copy()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
imputed_arr = X[:, :feature_count].copy()
array_imputed = X[:, :feature_count].copy()

feature_count = X.shape[1] - missing_feature_count
imputed_arr = X[:, :feature_count].copy()
missing_mask = X[:, feature_count:].astype(np.bool)
orig_cols = len(self.statistics_)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
orig_cols = len(self.statistics_)
n_features_original = len(self.statistics_)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference with what you called feature_count originally?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feature_count is the count of features that not completely empty. And orig_cols is the original number of features (including empty features). Changed their names to make it clearer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I see

# Below, we iteratively regenerate the original array
# by keeping track of features eliminated in `transform` step
# for being completely empty.
imputed_ptr, orig_ptr = 0, 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is weird to use ptr because we don't have pointers actually. Do you mean col for column.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use idx if we deal with indices thought.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I actually mean column indices. Changed the names now.

Comment on lines +1388 to +1391
@pytest.mark.parametrize(
"missing_value",
[-1, np.nan]
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.parametrize(
"missing_value",
[-1, np.nan]
)
@pytest.mark.parametrize("missing_value", [-1, np.nan])

imputer = SimpleImputer(missing_values=missing_value, strategy='mean',
add_indicator=True)

X_1_trans = imputer.fit_transform(X_1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterate over X_1 ... X_4 in a for loop

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case for X_2 relies on X_1, so kept them separate. Added a loop over X_3 and X_4.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the difference. You create an imputer, transform, inverse_transform, and check the original X with the twice transformed array no?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @glemaitre that leaving these two out of the loop looks inexplicable to the reader.

Copy link
Copy Markdown
Contributor Author

@postmalloc postmalloc Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X_2 is different in the sense that we only perform transform on it using the imputer that is fit on X_1. For X_3 and X_4, we do a fresh fit_transform. The purpose of X_2 was to test how the imputer would perform on data that is not used for fitting.

assert_array_equal(X_3_orig, X_3)
assert_array_equal(X_4_orig, X_4)

with pytest.raises(ValueError, match="add_indicator=True"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should put this in a separate test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this into a separate test

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would match for Got 'add_indicator={self.add_indicator}' to be sure that we replace by the expected value

[6, 7, np.nan, -1],
[8, 9, 0, np.nan]
])
X_2 = np.array([
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I mean was more

    X_3 = np.array([
        [1, missing_value, 5, 9],
        [missing_value, 4, missing_value, missing_value],
        [2, missing_value, 7, missing_value],
        [missing_value, 3, missing_value, 8]
    ])

    X_4 = np.array([
        [1, 1, 1, 3],
        [missing_value, 2, missing_value, 1],
        [2, 3, 3, 4],
        [missing_value, 4, missing_value, 2]
    ])

@postmalloc
Copy link
Copy Markdown
Contributor Author

@glemaitre Thank you for the thorough review! I incorporated the suggested changes.

X_original[:, self.indicator_.features_] = missing_mask
full_mask = X_original.astype(np.bool)

imputed_idx, orig_idx = 0, 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use original_idx

add_indicator=True)

X_1_trans = imputer.fit_transform(X_1)
X_1_orig = imputer.inverse_transform(X_1_trans)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orig is not the right name -> inv_trans instead


for X in [X_3, X_4]:
X_trans = imputer.fit_transform(X)
X_orig = imputer.inverse_transform(X_trans)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inv_trans

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nitpicking. Almost good to go.

Comment on lines +1446 to +1448
imputer = SimpleImputer(missing_values=missing_value,
strategy="mean")
X_1_trans = imputer.fit_transform(X_1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these line outside from the raises. It will be more explicit to know which line is raising the error

assert_array_equal(X_3_orig, X_3)
assert_array_equal(X_4_orig, X_4)

with pytest.raises(ValueError, match="add_indicator=True"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would match for Got 'add_indicator={self.add_indicator}' to be sure that we replace by the expected value

@postmalloc
Copy link
Copy Markdown
Contributor Author

Made the changes!

Copy link
Copy Markdown
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise this LGTM!

imputer = SimpleImputer(missing_values=missing_value, strategy='mean',
add_indicator=True)

X_1_trans = imputer.fit_transform(X_1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @glemaitre that leaving these two out of the loop looks inexplicable to the reader.

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
@postmalloc
Copy link
Copy Markdown
Contributor Author

Thanks @jnothman! Committed your suggestion.

@jnothman jnothman changed the title [MRG] Add inverse_transform feature to SimpleImputer ENH Add inverse_transform feature to SimpleImputer Jun 25, 2020
@jnothman jnothman merged commit 0a3ab41 into scikit-learn:master Jun 25, 2020
@jnothman
Copy link
Copy Markdown
Member

Thanks @d3b0unce!

@postmalloc
Copy link
Copy Markdown
Contributor Author

Thank you @TomDLT, @glemaitre, and @jnothman! This was my first contribution to scikit-learn ever. Made a ton of rookie mistakes. Thank you for patiently correcting me. :)

@glemaitre
Copy link
Copy Markdown
Member

@d3b0unce Congrats!!! You are ready for the next one :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MissingIndicator.inverse_transform and SimpleImputer.inverse_transform needed

4 participants