ENH Add inverse_transform feature to SimpleImputer#17612
ENH Add inverse_transform feature to SimpleImputer#17612jnothman merged 17 commits intoscikit-learn:masterfrom
Conversation
TomDLT
left a comment
There was a problem hiding this comment.
This is looking good !
Can you add an entry in doc/whats_new/0.24.rst ?
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Co-authored-by: Tom Dupré la Tour <tom.dupre-la-tour@m4x.org>
Thanks Tom, updated whats_new. |
sklearn/impute/tests/test_impute.py
Outdated
| assert idx == idx_order | ||
|
|
||
|
|
||
| def test_simple_imputation_inverse_transform(): |
There was a problem hiding this comment.
Could you parametrize the test with different missing_values type.
There was a problem hiding this comment.
Testing for two missing_values, [-1, np.nan]
| [6, 7, np.nan, -1], | ||
| [8, 9, 0, np.nan] | ||
| ]) | ||
| X_2 = np.array([ |
There was a problem hiding this comment.
I would try 2 more arrays: one where all columns have missing data and one where there is missing data one column every two
There was a problem hiding this comment.
Thanks for spotting this!
I overlooked a major scenario where features get dropped for being completely empty in the transform step. The process of regenerating original data is not very elegant when columns are dropped resulting in a different shape. I think changes suggested by PR #16695 would make things easier here.
The logic I wrote right now is iterative. It works only when both the fitting data and transform data are compatible i.e., both have values missing in same features. Let me know if there's a better way to handle this limitation.
There was a problem hiding this comment.
what I mean was more
X_3 = np.array([
[1, missing_value, 5, 9],
[missing_value, 4, missing_value, missing_value],
[2, missing_value, 7, missing_value],
[missing_value, 3, missing_value, 8]
])
X_4 = np.array([
[1, 1, 1, 3],
[missing_value, 2, missing_value, 1],
[2, 3, 3, 4],
[missing_value, 4, missing_value, 2]
])
There was a problem hiding this comment.
Thank you. I update the test cases
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
|
Hi @glemaitre, do let me know if this can be implemented better |
sklearn/impute/_base.py
Outdated
|
|
||
| Returns | ||
| ------- | ||
| original_X : ndarray, shape (n_samples, n_features) |
There was a problem hiding this comment.
| original_X : ndarray, shape (n_samples, n_features) | |
| original_X : ndarray of shape (n_samples, n_features) |
sklearn/impute/_base.py
Outdated
|
|
||
| missing_feature_count = len(self.indicator_.features_) | ||
|
|
||
| # Split the augmented array into imputed array and its missing |
sklearn/impute/_base.py
Outdated
| # Split the augmented array into imputed array and its missing | ||
| # indicator mask. | ||
| feature_count = X.shape[1] - missing_feature_count | ||
| imputed_arr = X[:, :feature_count].copy() |
There was a problem hiding this comment.
| imputed_arr = X[:, :feature_count].copy() | |
| array_imputed = X[:, :feature_count].copy() |
sklearn/impute/_base.py
Outdated
| feature_count = X.shape[1] - missing_feature_count | ||
| imputed_arr = X[:, :feature_count].copy() | ||
| missing_mask = X[:, feature_count:].astype(np.bool) | ||
| orig_cols = len(self.statistics_) |
There was a problem hiding this comment.
| orig_cols = len(self.statistics_) | |
| n_features_original = len(self.statistics_) |
There was a problem hiding this comment.
What is the difference with what you called feature_count originally?
There was a problem hiding this comment.
feature_count is the count of features that not completely empty. And orig_cols is the original number of features (including empty features). Changed their names to make it clearer.
sklearn/impute/_base.py
Outdated
| # Below, we iteratively regenerate the original array | ||
| # by keeping track of features eliminated in `transform` step | ||
| # for being completely empty. | ||
| imputed_ptr, orig_ptr = 0, 0 |
There was a problem hiding this comment.
It is weird to use ptr because we don't have pointers actually. Do you mean col for column.
There was a problem hiding this comment.
you can use idx if we deal with indices thought.
There was a problem hiding this comment.
Yes, I actually mean column indices. Changed the names now.
sklearn/impute/tests/test_impute.py
Outdated
| @pytest.mark.parametrize( | ||
| "missing_value", | ||
| [-1, np.nan] | ||
| ) |
There was a problem hiding this comment.
| @pytest.mark.parametrize( | |
| "missing_value", | |
| [-1, np.nan] | |
| ) | |
| @pytest.mark.parametrize("missing_value", [-1, np.nan]) |
| imputer = SimpleImputer(missing_values=missing_value, strategy='mean', | ||
| add_indicator=True) | ||
|
|
||
| X_1_trans = imputer.fit_transform(X_1) |
There was a problem hiding this comment.
Iterate over X_1 ... X_4 in a for loop
There was a problem hiding this comment.
The test case for X_2 relies on X_1, so kept them separate. Added a loop over X_3 and X_4.
There was a problem hiding this comment.
I don't see the difference. You create an imputer, transform, inverse_transform, and check the original X with the twice transformed array no?
There was a problem hiding this comment.
I agree with @glemaitre that leaving these two out of the loop looks inexplicable to the reader.
There was a problem hiding this comment.
X_2 is different in the sense that we only perform transform on it using the imputer that is fit on X_1. For X_3 and X_4, we do a fresh fit_transform. The purpose of X_2 was to test how the imputer would perform on data that is not used for fitting.
sklearn/impute/tests/test_impute.py
Outdated
| assert_array_equal(X_3_orig, X_3) | ||
| assert_array_equal(X_4_orig, X_4) | ||
|
|
||
| with pytest.raises(ValueError, match="add_indicator=True"): |
There was a problem hiding this comment.
You should put this in a separate test
There was a problem hiding this comment.
Moved this into a separate test
There was a problem hiding this comment.
I would match for Got 'add_indicator={self.add_indicator}' to be sure that we replace by the expected value
| [6, 7, np.nan, -1], | ||
| [8, 9, 0, np.nan] | ||
| ]) | ||
| X_2 = np.array([ |
There was a problem hiding this comment.
what I mean was more
X_3 = np.array([
[1, missing_value, 5, 9],
[missing_value, 4, missing_value, missing_value],
[2, missing_value, 7, missing_value],
[missing_value, 3, missing_value, 8]
])
X_4 = np.array([
[1, 1, 1, 3],
[missing_value, 2, missing_value, 1],
[2, 3, 3, 4],
[missing_value, 4, missing_value, 2]
])
|
@glemaitre Thank you for the thorough review! I incorporated the suggested changes. |
sklearn/impute/_base.py
Outdated
| X_original[:, self.indicator_.features_] = missing_mask | ||
| full_mask = X_original.astype(np.bool) | ||
|
|
||
| imputed_idx, orig_idx = 0, 0 |
sklearn/impute/tests/test_impute.py
Outdated
| add_indicator=True) | ||
|
|
||
| X_1_trans = imputer.fit_transform(X_1) | ||
| X_1_orig = imputer.inverse_transform(X_1_trans) |
There was a problem hiding this comment.
orig is not the right name -> inv_trans instead
sklearn/impute/tests/test_impute.py
Outdated
|
|
||
| for X in [X_3, X_4]: | ||
| X_trans = imputer.fit_transform(X) | ||
| X_orig = imputer.inverse_transform(X_trans) |
glemaitre
left a comment
There was a problem hiding this comment.
Only nitpicking. Almost good to go.
sklearn/impute/tests/test_impute.py
Outdated
| imputer = SimpleImputer(missing_values=missing_value, | ||
| strategy="mean") | ||
| X_1_trans = imputer.fit_transform(X_1) |
There was a problem hiding this comment.
move these line outside from the raises. It will be more explicit to know which line is raising the error
sklearn/impute/tests/test_impute.py
Outdated
| assert_array_equal(X_3_orig, X_3) | ||
| assert_array_equal(X_4_orig, X_4) | ||
|
|
||
| with pytest.raises(ValueError, match="add_indicator=True"): |
There was a problem hiding this comment.
I would match for Got 'add_indicator={self.add_indicator}' to be sure that we replace by the expected value
|
Made the changes! |
| imputer = SimpleImputer(missing_values=missing_value, strategy='mean', | ||
| add_indicator=True) | ||
|
|
||
| X_1_trans = imputer.fit_transform(X_1) |
There was a problem hiding this comment.
I agree with @glemaitre that leaving these two out of the loop looks inexplicable to the reader.
Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
|
Thanks @jnothman! Committed your suggestion. |
|
Thanks @d3b0unce! |
|
Thank you @TomDLT, @glemaitre, and @jnothman! This was my first contribution to scikit-learn ever. Made a ton of rookie mistakes. Thank you for patiently correcting me. :) |
|
@d3b0unce Congrats!!! You are ready for the next one :P |
Reference Issues/PRs
Fixes #17590
What does this implement/fix? Explain your changes.
Implements
inverse_transformfeature in SimpleImputer. The behavior of this methodis such that it returns a new array with the imputed values replaced with the original
missing_values.
inverse_transformmethod to SimpleImputerinverse_transformAny other comments?
Can this feature be implemented for other imputers? Also, I would like to evaluate the performance once the approach is finalized.