ENH Add inverse_transform feature to SimpleImputer by postmalloc · Pull Request #17612 · scikit-learn/scikit-learn

postmalloc · 2020-06-16T14:20:53Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Implements inverse_transform feature in SimpleImputer. The behavior of this method
is such that it returns a new array with the imputed values replaced with the original
missing_values.

Add inverse_transform method to SimpleImputer
Add test cases for inverse_transform

Any other comments?

Can this feature be implemented for other imputers? Also, I would like to evaluate the performance once the approach is finalized.

sklearn/impute/tests/test_impute.py

sklearn/impute/_base.py

TomDLT

This is looking good !
Can you add an entry in doc/whats_new/0.24.rst ?

sklearn/impute/_base.py

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

postmalloc · 2020-06-18T05:06:12Z

This is looking good !
Can you add an entry in doc/whats_new/0.24.rst ?

Thanks Tom, updated whats_new.

glemaitre

This is a good start

doc/whats_new/v0.24.rst

sklearn/impute/_base.py

glemaitre · 2020-06-18T09:57:35Z

sklearn/impute/tests/test_impute.py

        assert idx == idx_order
+
+
+def test_simple_imputation_inverse_transform():


Could you parametrize the test with different missing_values type.

Testing for two missing_values, [-1, np.nan]

glemaitre · 2020-06-18T09:59:04Z

sklearn/impute/tests/test_impute.py

+        [6, 7, np.nan, -1],
+        [8, 9, 0, np.nan]
+    ])
+    X_2 = np.array([


I would try 2 more arrays: one where all columns have missing data and one where there is missing data one column every two

Thanks for spotting this!
I overlooked a major scenario where features get dropped for being completely empty in the transform step. The process of regenerating original data is not very elegant when columns are dropped resulting in a different shape. I think changes suggested by PR #16695 would make things easier here.
The logic I wrote right now is iterative. It works only when both the fitting data and transform data are compatible i.e., both have values missing in same features. Let me know if there's a better way to handle this limitation.

what I mean was more

X_3 = np.array([ [1, missing_value, 5, 9], [missing_value, 4, missing_value, missing_value], [2, missing_value, 7, missing_value], [missing_value, 3, missing_value, 8] ]) X_4 = np.array([ [1, 1, 1, 3], [missing_value, 2, missing_value, 1], [2, 3, 3, 4], [missing_value, 4, missing_value, 2] ])

Thank you. I update the test cases

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

…ats_new

postmalloc · 2020-06-23T05:24:53Z

Hi @glemaitre, do let me know if this can be implemented better

glemaitre

A couple of changes more.

glemaitre · 2020-06-23T13:03:57Z

sklearn/impute/_base.py

+
+        Returns
+        -------
+        original_X : ndarray, shape (n_samples, n_features)


Suggested change

original_X : ndarray, shape (n_samples, n_features)

original_X : ndarray of shape (n_samples, n_features)

I would call it X_original

sklearn/impute/_base.py

glemaitre · 2020-06-23T13:04:52Z

sklearn/impute/_base.py

+
+        missing_feature_count = len(self.indicator_.features_)
+
+        # Split the augmented array into imputed array and its missing


no need for the comment

glemaitre · 2020-06-23T13:05:09Z

sklearn/impute/_base.py

+        # Split the augmented array into imputed array and its missing
+        # indicator mask.
+        feature_count = X.shape[1] - missing_feature_count
+        imputed_arr = X[:, :feature_count].copy()


Suggested change

imputed_arr = X[:, :feature_count].copy()

array_imputed = X[:, :feature_count].copy()

glemaitre · 2020-06-23T13:05:25Z

sklearn/impute/_base.py

+        feature_count = X.shape[1] - missing_feature_count
+        imputed_arr = X[:, :feature_count].copy()
+        missing_mask = X[:, feature_count:].astype(np.bool)
+        orig_cols = len(self.statistics_)


Suggested change

orig_cols = len(self.statistics_)

n_features_original = len(self.statistics_)

What is the difference with what you called feature_count originally?

feature_count is the count of features that not completely empty. And orig_cols is the original number of features (including empty features). Changed their names to make it clearer.

glemaitre · 2020-06-23T13:12:08Z

sklearn/impute/_base.py

+        # Below, we iteratively regenerate the original array
+        # by keeping track of features eliminated in `transform` step
+        # for being completely empty.
+        imputed_ptr, orig_ptr = 0, 0


It is weird to use ptr because we don't have pointers actually. Do you mean col for column.

you can use idx if we deal with indices thought.

Yes, I actually mean column indices. Changed the names now.

glemaitre · 2020-06-23T13:13:14Z

sklearn/impute/tests/test_impute.py

+@pytest.mark.parametrize(
+    "missing_value",
+    [-1, np.nan]
+)


Suggested change

@pytest.mark.parametrize(

"missing_value",

[-1, np.nan]

)

@pytest.mark.parametrize("missing_value", [-1, np.nan])

glemaitre · 2020-06-23T13:14:11Z

sklearn/impute/tests/test_impute.py

+    imputer = SimpleImputer(missing_values=missing_value, strategy='mean',
+                            add_indicator=True)
+
+    X_1_trans = imputer.fit_transform(X_1)


Iterate over X_1 ... X_4 in a for loop

The test case for X_2 relies on X_1, so kept them separate. Added a loop over X_3 and X_4.

I don't see the difference. You create an imputer, transform, inverse_transform, and check the original X with the twice transformed array no?

I agree with @glemaitre that leaving these two out of the loop looks inexplicable to the reader.

X_2 is different in the sense that we only perform transform on it using the imputer that is fit on X_1. For X_3 and X_4, we do a fresh fit_transform. The purpose of X_2 was to test how the imputer would perform on data that is not used for fitting.

glemaitre · 2020-06-23T13:14:45Z

sklearn/impute/tests/test_impute.py

+    assert_array_equal(X_3_orig, X_3)
+    assert_array_equal(X_4_orig, X_4)
+
+    with pytest.raises(ValueError, match="add_indicator=True"):


You should put this in a separate test

Moved this into a separate test

I would match for Got 'add_indicator={self.add_indicator}' to be sure that we replace by the expected value

glemaitre · 2020-06-23T13:16:50Z

sklearn/impute/tests/test_impute.py

+        [6, 7, np.nan, -1],
+        [8, 9, 0, np.nan]
+    ])
+    X_2 = np.array([


what I mean was more

X_3 = np.array([ [1, missing_value, 5, 9], [missing_value, 4, missing_value, missing_value], [2, missing_value, 7, missing_value], [missing_value, 3, missing_value, 8] ]) X_4 = np.array([ [1, 1, 1, 3], [missing_value, 2, missing_value, 1], [2, 3, 3, 4], [missing_value, 4, missing_value, 2] ])

postmalloc · 2020-06-25T07:07:55Z

@glemaitre Thank you for the thorough review! I incorporated the suggested changes.

glemaitre · 2020-06-25T07:26:12Z

sklearn/impute/_base.py

+        X_original[:, self.indicator_.features_] = missing_mask
+        full_mask = X_original.astype(np.bool)
+
+        imputed_idx, orig_idx = 0, 0


use original_idx

glemaitre · 2020-06-25T07:28:15Z

sklearn/impute/tests/test_impute.py

+                            add_indicator=True)
+
+    X_1_trans = imputer.fit_transform(X_1)
+    X_1_orig = imputer.inverse_transform(X_1_trans)


orig is not the right name -> inv_trans instead

glemaitre · 2020-06-25T07:28:39Z

sklearn/impute/tests/test_impute.py

+
+    for X in [X_3, X_4]:
+        X_trans = imputer.fit_transform(X)
+        X_orig = imputer.inverse_transform(X_trans)


inv_trans

glemaitre

Only nitpicking. Almost good to go.

glemaitre · 2020-06-25T07:30:32Z

sklearn/impute/tests/test_impute.py

+        imputer = SimpleImputer(missing_values=missing_value,
+                                strategy="mean")
+        X_1_trans = imputer.fit_transform(X_1)


move these line outside from the raises. It will be more explicit to know which line is raising the error

glemaitre · 2020-06-25T07:31:44Z

sklearn/impute/tests/test_impute.py

+    assert_array_equal(X_3_orig, X_3)
+    assert_array_equal(X_4_orig, X_4)
+
+    with pytest.raises(ValueError, match="add_indicator=True"):


I would match for Got 'add_indicator={self.add_indicator}' to be sure that we replace by the expected value

postmalloc · 2020-06-25T08:48:28Z

Made the changes!

jnothman

Otherwise this LGTM!

jnothman · 2020-06-25T12:20:16Z

sklearn/impute/tests/test_impute.py

+    imputer = SimpleImputer(missing_values=missing_value, strategy='mean',
+                            add_indicator=True)
+
+    X_1_trans = imputer.fit_transform(X_1)


I agree with @glemaitre that leaving these two out of the loop looks inexplicable to the reader.

sklearn/impute/tests/test_impute.py

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

postmalloc · 2020-06-25T14:09:27Z

Thanks @jnothman! Committed your suggestion.

jnothman · 2020-06-25T14:14:03Z

Thanks @d3b0unce!

postmalloc · 2020-06-25T14:17:47Z

Thank you @TomDLT, @glemaitre, and @jnothman! This was my first contribution to scikit-learn ever. Made a ton of rookie mistakes. Thank you for patiently correcting me. :)

glemaitre · 2020-06-25T14:19:36Z

@d3b0unce Congrats!!! You are ready for the next one :P

Added inverse_transform method to SimpleImputer

4c78402

github-actions bot added the module:impute label Jun 16, 2020

postmalloc added 2 commits June 16, 2020 14:35

Fix linting errors

6f79624

Put back tilde

7480de0

postmalloc marked this pull request as draft June 16, 2020 15:05

Unit test for impute inverse_transform

c13af81

postmalloc marked this pull request as ready for review June 16, 2020 17:31

TomDLT reviewed Jun 16, 2020

View reviewed changes

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

sklearn/impute/_base.py Outdated Show resolved Hide resolved

sklearn/impute/_base.py Outdated Show resolved Hide resolved

sklearn/impute/_base.py Outdated Show resolved Hide resolved

postmalloc added 3 commits June 17, 2020 08:18

Add extra test case; remove for loop

2708ffc

Update test cases; Add limitation note in docstring

29516c0

Renamed variables

79623d4

postmalloc changed the title ~~[WIP] Add inverse_transform feature to SimpleImputer~~ [MRG] Add inverse_transform feature to SimpleImputer Jun 17, 2020

TomDLT approved these changes Jun 17, 2020

View reviewed changes

sklearn/impute/_base.py Outdated Show resolved Hide resolved

sklearn/impute/_base.py Outdated Show resolved Hide resolved

postmalloc and others added 3 commits June 18, 2020 04:32

Change docstring wording for clarity

8f5cb74

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

Update docstring in impute_transform

61b8ce4

Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>

Updated whats_new; fixed linting

d927ee2

glemaitre reviewed Jun 18, 2020

View reviewed changes

postmalloc and others added 4 commits June 18, 2020 10:15

Update sklearn/impute/_base.py - fit_transform docstring

2f35d13

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/impute/_base.py

b2fc6ca

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/impute/_base.py

5f876d7

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Changes to take care of missing features; More test cases; Updated wh…

dbbd4d1

…ats_new

glemaitre reviewed Jun 23, 2020

View reviewed changes

Refactor variable names and test cases

3ea9f0e

glemaitre reviewed Jun 25, 2020

View reviewed changes

Change variable names; refactor exception test case

06be9fb

jnothman approved these changes Jun 25, 2020

View reviewed changes

jnothman reviewed Jun 25, 2020

View reviewed changes

sklearn/impute/tests/test_impute.py Outdated Show resolved Hide resolved

Add a note in test case for clarity

12c8da3

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

jnothman changed the title ~~[MRG] Add inverse_transform feature to SimpleImputer~~ ENH Add inverse_transform feature to SimpleImputer Jun 25, 2020

jnothman merged commit 0a3ab41 into scikit-learn:master Jun 25, 2020

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jul 17, 2020

ENH Add inverse_transform feature to SimpleImputer (scikit-learn#17612)

d1596f0

txntxn mentioned this pull request Aug 21, 2020

MissingIndicator.inverse_transform and SimpleImputer.inverse_transform needed #17590

Closed

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

ENH Add inverse_transform feature to SimpleImputer (scikit-learn#17612)

74fd7bf

		assert idx == idx_order


		def test_simple_imputation_inverse_transform():

	original_X : ndarray, shape (n_samples, n_features)
	original_X : ndarray of shape (n_samples, n_features)


		missing_feature_count = len(self.indicator_.features_)

		# Split the augmented array into imputed array and its missing

	imputed_arr = X[:, :feature_count].copy()
	array_imputed = X[:, :feature_count].copy()

	orig_cols = len(self.statistics_)
	n_features_original = len(self.statistics_)

Uh oh!

Conversation

postmalloc commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomDLT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

postmalloc commented Jun 18, 2020

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

postmalloc commented Jun 23, 2020

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

postmalloc Jun 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

postmalloc commented Jun 16, 2020 •

edited

Loading

postmalloc Jun 25, 2020 •

edited

Loading