[MRG] ENH add an option to drop full missing features in MissingIndicator#13491
[MRG] ENH add an option to drop full missing features in MissingIndicator#13491jeremiedbb wants to merge 11 commits intoscikit-learn:masterfrom
Conversation
sklearn/impute.py
Outdated
| other values will be marked as False. | ||
|
|
||
| features : str, optional | ||
| features : {"missing-only", "all", "not-constant"}, optional |
There was a problem hiding this comment.
If find not-constant terrible... Please help me find a better name :) !
There was a problem hiding this comment.
It's mostly for internal use, so don't worry! But "some-missing" might be better.
|
It also fixes a bug: when X is sparse the mask would contain explicit zeros (all non missing values become explicit zeros). Here's an example to illustrate this: from sklearn.impute import MissingIndicator
from scipy.sparse import csr_matrix
X = csr_matrix([[0, 1, 2],
[1, 2, 0],
[2, 0, 1]])
mi = MissingIndicator(features='all', missing_values=1)
print(mi.fit_transform(X))
(1, 0) True
(2, 0) False
(0, 1) True
(1, 1) False
(0, 2) False
(2, 2) TrueAll the '2' became explicit zeros. |
|
Not-constant => varying?
|
| "in fit.".format(features_diff_fit_trans)) | ||
|
|
||
| if (self.features_.size > 0 and | ||
| self.features_.size < self._n_features): |
There was a problem hiding this comment.
I removed the first condition. If we want only features with missing, and there's not any, then the mask should be empty. Before, it would return the mask of all features.
I'm not a fan either :/ Discussing irl with Joris, he finds 'not-constant' explicit enough. Maybe it's not too bad after all. I put the PR on MRG and maybe we'll reach a consensus with other reviewers :) |
jnothman
left a comment
There was a problem hiding this comment.
It seems a bit silly to maintain these three options. We could consider phasing out missing-only? Not important.
doc/whats_new/v0.21.rst
Outdated
| :user:`Jérémie du Boisberranger <jeremiedbb>`. | ||
|
|
||
| - |Fix| Fixed a bug in :class:`MissingIndicator` when ``X`` is sparse. All the | ||
| non-zero missing values used to become explicit False is the transformed data. |
sklearn/impute.py
Outdated
| other values will be marked as False. | ||
|
|
||
| features : str, optional | ||
| features : {"missing-only", "all", "not-constant"}, optional |
There was a problem hiding this comment.
It's mostly for internal use, so don't worry! But "some-missing" might be better.
sklearn/impute.py
Outdated
| if missing_values_mask.format == 'csc' | ||
| else np.unique(missing_values_mask.indices)) | ||
| if self.features in ('missing-only', 'not-constant'): | ||
| if imputer_mask.format == 'csc': |
There was a problem hiding this comment.
this can be achieved with imputer_mask.getnnz(axis=0)
There was a problem hiding this comment.
Why would you make it simple when you can make it complicated :D ?
sklearn/impute.py
Outdated
| return imputer_mask, features_with_missing | ||
| if self.features == 'all': | ||
| features_indices = np.arange(X.shape[1]) | ||
| else: |
There was a problem hiding this comment.
this would be clearer as:
elif self.features == 'missing-only':
features_indices = np.flatnonzero(n_missing)
else:
features_indices = np.flatnonzero(np.logical_and(n_missing < X.shape[0], n_missing > 0))
sklearn/tests/test_impute.py
Outdated
| (np.array([[-1, 1], [1, 2]]), np.array([[-1, 1], [1, 2]]), | ||
| {'features': 'random', 'sparse': 'auto'}, | ||
| "'features' has to be either 'missing-only' or 'all'"), | ||
| "'features' has to be either 'missing-only', 'all' or 'not-constant'"), |
sklearn/tests/test_impute.py
Outdated
| Xt = mi.fit_transform(X) | ||
|
|
||
| nnz = Xt.getnnz() | ||
|
|
There was a problem hiding this comment.
You could just add an assertion elsewhere that Xt.nnz == Xt.sum(). You shouldn't need a new test, nor a specified expected value.
|
I should add: otherwise LGTM |
Aaah that's better indeed ! |
qinhanmin2014
left a comment
There was a problem hiding this comment.
How will you define error_on_new if you add the new option?
We can still implement #12583 without this PR? (e.g., use features='all' and pick the columns we need)
I doubt whether it's important to consider features containing only missing values.
sklearn/impute.py
Outdated
| if self.features in ("missing-only", "some-missing"): | ||
| features_diff_fit_trans = np.setdiff1d(features, self.features_) | ||
| if (self.error_on_new and features_diff_fit_trans.size > 0): | ||
| raise ValueError("The features {} have missing values " |
There was a problem hiding this comment.
we need to update the error message?
sklearn/tests/test_impute.py
Outdated
|
|
||
|
|
||
| def test_missing_indicator_sparse_no_explicit_zeros(): | ||
| # Check that non missing values don't become explicit zeeros in the mask |
I wouldn't modify it: raise an error if a feature has missing values at transform but not at fit. Not keeping the full missing at fit does not interfere with that. I wouldn't raise an error if a feature has full missing at fit and only some missing at transform, to keep the same behavior as the SimpleImputer (although for this one full missing can't be fitted at all).
Yes we can do that. The idea was to have imputers share similar behaviors, and this behavior is justified by the fact that constant features bring absolutely no information to learn. |
But now, when
You're right that constant features are not informative, but we have things like |
As long as you can define |
right. I did not catch that. |
|
Finally I updated
not raising in the second case would have added a lot more complexity to the code (keeping track of which feature has been dropped for which reason). Let me know what you think of this behavior. |
|
Hmm, I think the new definition is difficult to understand (and the name |
|
I think the idea in the long term is to only keep one option with the behavior of |
Let's see what @jnothman thinks. |
|
I'd be happy with that, but we should only raise a deprecation warning when
there would be a difference in output...
|
So you think current definition of |
|
I think that change is unhelpful and possibly destructive, thanks for
asking. I think features newly with present values are still not of any
interest to the user.
|
|
So there's -2 on current definition of error_on_new. |
|
You're probably right. I'm quite ambivalent about the all-missing behaviour
as long as it is tested.
|
|
Ok then. I won't close it however because it also fixes a bug. I'll rename it instead. |
Apart from the bug, you can also add some notes about the all-missing behaviour. |
well there's no special behavior in that case. I don't think it's worth adding a note on a rare edge case which we don't treat differently. Actually I'm closing this one and opening a new one to ease the review because the thread will be unrelated. |
That's fine. Apologies again for the wrong comments above. |
|
opened #13562 for the fixes only |
This PR should help #12583 to move forward.
Following discussions in #12583, adding a option to SimpleImputer to stack a MissingIndicator which is consistent with the behavior of the SimpleImputer without breaking backward compatibility requires adding an option to the MissingIndicator to drop the columns full of missing values.
To do that I added another possible value for the
featuresparameter.I wonder if we want to keep the 3 options, or raise a future warning saying that 'only-missing' will take the new option behavior in 2 releases and deprecate the new option. After all keeping features with full missing values does not really make sense. What' your opinion about that ?