[MRG+1] MAINT dissociate nan and inf in check_array#10459
[MRG+1] MAINT dissociate nan and inf in check_array#10459TomDLT merged 12 commits intoscikit-learn:masterfrom
Conversation
|
@ashimb9 do you want to take from here? |
|
@glemaitre Hmm I would not mind working on it but I also think it might be less efficient for me to jump into it midway through. So, please free to complete it if you wish but if you prefer that I work on it, then that would be fine by me too. |
|
@jnothman I think that it is ready for a first review. I am not sure if we should support |
|
@ashimb9 You can have a look as well. |
jnothman
left a comment
There was a problem hiding this comment.
I find this style of testing quite hard to follow, because related tests are not grouped together.
You could have one test for valid and one for invalid, or one for both. You can stack parametrize decorations to test the cartesian product of inputs:
pytest.mark.parametrize('value', 'force_all_finite', [(np.inf, False), (np.inf, 'allow-nan'), (np.nan, False)]
pytest.mark.parametrize('retype', [np.asarray, sparse.csr_matrix])
def test_force_all_finite_valid(value, force_all_finite, retype):
X = retype(...)
...
| ) | ||
| def test_check_array_inf_error(X_inf, accept_sparse): | ||
| X_inf[0, 0] = np.inf | ||
| with pytest.raises(ValueError): |
There was a problem hiding this comment.
should we use match to check the error message?
sklearn/utils/validation.py
Outdated
|
|
||
| if isinstance(force_all_finite, six.string_types): | ||
| if force_all_finite != 'allow-nan': | ||
| raise ValueError('When force_all_finite is a string, it should be ' |
There was a problem hiding this comment.
Use the same error message as below.
sklearn/utils/validation.py
Outdated
| 'equal to "allow-nan". Got "{}" instead.'.format( | ||
| force_all_finite)) | ||
| elif not isinstance(force_all_finite, bool): | ||
| raise ValueError('force_all_finite should be a bool or a string. Got ' |
There was a problem hiding this comment.
a string -> 'allow-nan'
not tested
| "retype", | ||
| [np.asarray, sp.csr_matrix] | ||
| ) | ||
| def test_check_array_valid(value, force_all_finite, retype): |
There was a problem hiding this comment.
sorry, I might have made a mistake: this name should mention finiteness
sklearn/utils/validation.py
Outdated
| # false positives from overflow in sum method. | ||
| if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum()) | ||
| and not np.isfinite(X).all()): | ||
| if (not allow_nan and X.dtype.char in np.typecodes['AllFloat'] |
There was a problem hiding this comment.
I think it would be easier to read if you have:
if allow_nan:
...
else:
...
sklearn/utils/validation.py
Outdated
| raise ValueError("Input contains NaN, infinity" | ||
| " or a value too large for %r." % X.dtype) | ||
| elif (allow_nan and X.dtype.char in np.typecodes['AllFloat'] | ||
| and not np.isfinite(X[~np.isnan(X)].sum()) |
There was a problem hiding this comment.
np.isnan(X) already spends the memory that .sum() is trying to avoid. You do not need both the .sum() and the .all() conditions if you're using np.isnan(X). if np.isnan().any() is sufficient and, without chunking, optimal I think.
There was a problem hiding this comment.
I am not really sure what you mean. we would like to test for infinity only, therefore shall I do the following:
if ... and not np.isfinite(x[~np.isnan(x)]).all():There was a problem hiding this comment.
Sorry, I meant to say np.isinf
There was a problem hiding this comment.
It might be worth leaving in the O(1) memory case actually, as if all is finite, you can move on. Sorry for my mistakes above.
jnothman
left a comment
There was a problem hiding this comment.
Apart from cosmetic things, LGTM
sklearn/utils/validation.py
Outdated
| raise ValueError("Input contains NaN, infinity" | ||
| " or a value too large for %r." % X.dtype) | ||
| if allow_nan: | ||
| if (X.dtype.char in np.typecodes['AllFloat'] |
There was a problem hiding this comment.
Now I look again I think these common conditions can be pulled to the outer if:
if finite sum:
pass
else:
func, msg = (np.isinf, 'Infinity') if allow_nan else ...
if func(X).any(): ...
sklearn/utils/validation.py
Outdated
|
|
||
| force_all_finite : boolean (default=True) | ||
| Whether to raise an error on np.inf and np.nan in X. | ||
| force_all_finite : boolean or str {'allow-nan'}, (default=True) |
There was a problem hiding this comment.
Drop str and {}. Just Boolean or '...'
sklearn/utils/validation.py
Outdated
| else: | ||
| _assert_all_finite(spmatrix.data) | ||
| if force_all_finite == 'allow-nan': | ||
| _assert_all_finite(spmatrix.data, allow_nan=True) |
There was a problem hiding this comment.
Put the comparison direct into the call rather than using if unnecessarily.
sklearn/utils/validation.py
Outdated
| # list of accepted types. | ||
| dtype = dtype[0] | ||
|
|
||
| if isinstance(force_all_finite, six.string_types): |
There was a problem hiding this comment.
How about if force_all_finite not in (...): raise
sklearn/utils/validation.py
Outdated
| % (array.ndim, estimator_name)) | ||
| if force_all_finite: | ||
| if force_all_finite == 'allow-nan': | ||
| _assert_all_finite(array, allow_nan=True) |
sklearn/utils/validation.py
Outdated
|
|
||
|
|
||
| def assert_all_finite(X): | ||
| is_float = X.dtype.char in np.typecodes['AllFloat'] |
There was a problem hiding this comment.
(FWIW, we can just be doing X.dtype.kind in 'fc') here.
sklearn/utils/validation.py
Outdated
| pass | ||
| elif is_float: | ||
| msg_err = "Input contains {} or a value too large for {!r}." | ||
| cond_err, type_err = ((np.isinf(X).any(), 'infinity') if allow_nan |
There was a problem hiding this comment.
Oh. I see I forgot to negate isfinite in my suggestion. Putting the conditions in like this is not very readable. Sorry.
|
ping @lesteve ;) |
|
… On 18 Jan 2018 11:56 pm, "Tom Dupré la Tour" ***@***.***> wrote:
Merged #10459 <#10459>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10459 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6z3rj1fgom8AnD09pd4l5RSXsRadks5tLz-LgaJpZM4RcC5Y>
.
|
|
I'm probably not following this 100%, but both Imputer and MICEImputer currently have |
|
Because you want to block false and it was a bug in the imputerrwhich was removing the column containing inf instead of raising an error
|
|
Gotcha! Should I merge to master and then change it to |
|
@sergeyf merge and change it |
|
@glemaitre Will do. Just to confirm, we want some more complex logic now: Right? |
|
Yes, I suppose that's right.
…On 20 January 2018 at 07:06, Sergey Feldman ***@***.***> wrote:
@glemaitre <https://github.com/glemaitre> Will do.
Just to confirm, we want some more complex logic now:
if self.missing_values == "NaN":
X = check_array(X, force_all_finite='allow-nan')
else:
X = check_array(X, force_all_finite=True)
Right?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10459 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60nyQYIfT0R_HQhLwGrW5W0rcq5Dks5tMPXVgaJpZM4RcC5Y>
.
|
|
Or you can use an inline if: `check_array(X, force_all_finite='allow-nan'
if self.missing_values == 'NaN' else True)`
|
|
Yes, I think that's what it means. We should not support NaN in a feature
matrix if missing_values != NaN. Sorry if that's a pain to change!
|
|
I think it's all good once the check_array has the flag in it?
On Jan 22, 2018 3:19 PM, "Joel Nothman" <notifications@github.com> wrote:
Yes, I think that's what it means. We should not support NaN in a feature
matrix if missing_values != NaN. Sorry if that's a pain to change!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10459 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABya7DrKqOCEskuxQrZOx9RFCLF1m6N8ks5tNReQgaJpZM4RcC5Y>
.
|
Reference Issues/PRs
closes #10455
What does this implement/fix? Explain your changes.
Any other comments?