[MRG] BUG: add support for non numeric values in MissingIndicator#13046
[MRG] BUG: add support for non numeric values in MissingIndicator#13046qinhanmin2014 merged 24 commits intoscikit-learn:masterfrom
Conversation
sklearn/utils/estimator_checks.py
Outdated
| msg = "argument must be a string or a number" | ||
| assert_raises_regex(TypeError, msg, estimator.fit, X, y) | ||
| else: | ||
| # If the estimator support strings passed as object dtype, |
There was a problem hiding this comment.
Actually X[0, 0] = {'foo': 'bar'} is not a string entry, but a dict entry. So I think the test was correct. (Otherwise it would have been needed to modify it for the support of strings in the SimpleImputer).
There was a problem hiding this comment.
The error is also not raised for SimpleImputer with the strategy supporting mean. The difference is that the common tests are using default parameter which is the mean strategy for the SimpleImputer.
In [9]: imputer = SimpleImputer(strategy='most_frequent')
In [11]: X = np.array([[{'foo': 'bar'}, 'b', 'c']], dtype=object)
In [12]: imputer.fit(X)
Out[12]:
SimpleImputer(copy=True, fill_value=None, missing_values=nan,
strategy='most_frequent', verbose=0)I have to check what is going wrong in check array which is the one that should raise this error.
There was a problem hiding this comment.
The error is raised by np.asarray only when we make the conversion. So it means that we neeed to raise such error when dtype=object.
There was a problem hiding this comment.
I think check_array needs an update to properly deal with non-numeric input
There was a problem hiding this comment.
I think
check_arrayneeds an update to properly deal with non-numeric input
See also #12148
jnothman
left a comment
There was a problem hiding this comment.
Should we be testing the error produced when string dtype is passed? Otherwise looks good.
sklearn/impute.py
Outdated
| force_all_finite=force_all_finite) | ||
| _check_inputs_dtype(X, self.missing_values) | ||
| if X.dtype.kind not in ("i", "u", "f", "O"): | ||
| raise ValueError("Missing indicator does not support data with " |
There was a problem hiding this comment.
Missing indicator -> MissingIndicator
| ALLOW_NAN = ['Imputer', 'SimpleImputer', 'MissingIndicator', | ||
| 'MaxAbsScaler', 'MinMaxScaler', 'RobustScaler', 'StandardScaler', | ||
| 'PowerTransformer', 'QuantileTransformer'] | ||
| SUPPORT_STRING = ['SimpleImputer', 'MissingIndicator'] |
There was a problem hiding this comment.
surely OneHotEncoder, OrdinalEncoder and meta-estimators will belong here too.
There was a problem hiding this comment.
Can we do that in another PR. I think that we should also factorize the input validation as well. It is quite redundant and we could have a common test then.
There was a problem hiding this comment.
Yes, it probably belongs in estimator tags... (perhaps post-#8022)
It is covered there: |
jnothman
left a comment
There was a problem hiding this comment.
Fine. If you use "MissingIndicator" instead of "missing indicator" this is good to merge by me.
sklearn/tests/test_impute.py
Outdated
| "'sparse' has to be a boolean or 'auto'"), | ||
| (np.array([['a', 'b'], ['c', 'a']], dtype=str), | ||
| np.array([['a', 'b'], ['c', 'a']], dtype=str), | ||
| {}, "Missing indicator does not support data with dtype")] |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Minor comments, looks good to me
| err_msg=err_msg.format(True)) | ||
| assert_ae(X_trans, X_true, err_msg=err_msg.format(True)) | ||
|
|
||
|
|
There was a problem hiding this comment.
pep8, does not need to be removed
Co-Authored-By: glemaitre <g.lemaitre58@gmail.com>
|
Just need to merge master for the pep8 failure |
|
and I mess up my merging (basically I rebase again) but this should be fine with the final squashing. |
qinhanmin2014
left a comment
There was a problem hiding this comment.
We need to refactor some input validations and some common tests, but I agree that we can do these things in another PR.
…cikit-learn#13046)" This reverts commit 29c51c4.
…cikit-learn#13046)" This reverts commit 29c51c4.
Reference Issues/PRs
closes #13035
What does this implement/fix? Explain your changes.
Add support to non-numeric values in MissingIndicator. I flag this PR as BUG since SimpleImputer is already supporting non-numeric data and that we cannot use the MissingIndicator with the SimpleImputer in 0.20. I think that it should go in 0.20.3
I added 3 tests: