[MRG] EHN handle NaN value in QuantileTransformer #10437
[MRG] EHN handle NaN value in QuantileTransformer #10437rth merged 38 commits intoscikit-learn:masterfrom
Conversation
|
@jnothman I have 2 questions:
|
|
what's the harm in silencing the warning?
|
I don't see any. I would go for that solution. However, |
|
Sure. Backport and inform other contributors at #10404 |
1 similar comment
|
Sure. Backport and inform other contributors at #10404 |
|
Backport is headache in fact. I tried and we have to bring to much code from numpy to be able to support it. Speaking IRL with @ogrisel, we propose to raise a NotImplemented error when there is NaN and that we require a nanfunctions (involving numpy >= 1.9) and ask for an upgrade. |
|
@lesteve @jnothman I wanted to mock the version of numpy to be sure that the error in the test is raised properly. I used |
Is pytest-mock really needed? Can we not use either:
|
+1 on that one. It seems good enough to me. Thanks |
jnothman
left a comment
There was a problem hiding this comment.
Do you think we should support other missing_values indicators?
sklearn/preprocessing/data.py
Outdated
| # for forward transform, match the output PDF | ||
| if not inverse: | ||
| X_col = output_distribution.ppf(X_col) | ||
| # comparison with NaN will raise a warning which we make silent |
There was a problem hiding this comment.
If a numpy error, can use np.errstate
There was a problem hiding this comment.
this is a scipy warning
I give it some thought and I tried couple of stuff. I have the following interrogations:
Having implemented both approaches (mid-way) I think that only replacing missing_value by NaN is sufficient. |
sklearn/preprocessing/data.py
Outdated
|
|
||
| def _check_inputs(self, X, accept_sparse_negative=False): | ||
| """Check inputs before fit and transform""" | ||
| if sparse.issparse(X): |
There was a problem hiding this comment.
check_array will convert the matrix into a float dtype. I wanted to compare when data can still be int.
However I could also use np.isclose which can handle NaN as well.
sklearn/preprocessing/data.py
Outdated
| and not np.isfinite(X[~np.isnan(X)]).all()): | ||
| raise ValueError("Input contains infinity" | ||
| " or a value too large for %r." % X.dtype) | ||
| if np.count_nonzero(self._mask_missing_values): |
There was a problem hiding this comment.
It is strange to have this in a function whose argument is X, not _mask_missing_values.
There was a problem hiding this comment.
I suspect you should avoid storing this mask as an attribute.
There was a problem hiding this comment.
This function will get away once the #10455 is addressed.
sklearn/preprocessing/data.py
Outdated
| self._percentile_func = np.nanpercentile | ||
| else: | ||
| raise NotImplementedError( | ||
| 'QuantileTransformer does not handle NaN value with' |
There was a problem hiding this comment.
Is it easy enough to just implement in sklearn.utils.fixes:
def nanpercentile(a, q):
return np.percentile(np.compress(a, ~np.isnan(a)), q)
seeing as we don't use the other features of nanpercentile?
It is annoying to have parts of the library with different minimum numpy requirements. It means that code is not portable across supported platforms.
True ... we will get always float as output. So it should be for sure in a separate PR. |
|
@jnothman I added a test_common file. Could you check that the creation of the instance is ok or you would see another way to create the instance of the estimator on the fly (a dict for instance?) |
jnothman
left a comment
There was a problem hiding this comment.
Nice test. It essentially also tests that the estimators are feature-wise... So we could in theory remove some existing tests
|
|
||
| @pytest.mark.parametrize( | ||
| "est, X, n_missing", | ||
| _generate_tuple_transformer_missing_value() |
There was a problem hiding this comment.
I don't get why this is better than just parameterizing est directly
There was a problem hiding this comment.
We could even consider a list of all feature-wise preprocessors then xfail some...
There was a problem hiding this comment.
We could even consider a list of all feature-wise preprocessors then xfail some...
I agree. We could switch to this behaviour when a majority of those preprocessor support this feature.
| rng.randint(X.shape[1], size=n_missing)] = np.nan | ||
| X_train, X_test = train_test_split(X) | ||
| # sanity check | ||
| assert not np.all(np.isnan(X_train), axis=0).any() |
There was a problem hiding this comment.
Should probably also check that there are NaNs in both train and test
jnothman
left a comment
There was a problem hiding this comment.
Please add a docstring note along the lines of "NaNs are treated as missing values: disregarded in fit, and maintained in transform". Perhaps that's too terse.
| X_col = .5 * (np.interp(X_col, quantiles, self.references_) | ||
| - np.interp(-X_col, -quantiles[::-1], | ||
| -self.references_[::-1])) | ||
| X_col[isfinite_mask] = .5 * ( |
There was a problem hiding this comment.
Fwiw, it's possible that np.ma would handle the none-missing case more efficiently than using an ad-hoc hoc mask. I've not checked.
There was a problem hiding this comment.
Playing around, I think that it will trigger the same number of copy.
| X_train, X_test = train_test_split(X) | ||
| # sanity check | ||
| assert not np.all(np.isnan(X_train), axis=0).any() | ||
| assert np.any(X_train, axis=0).all() |
There was a problem hiding this comment.
We check that there is some Nan in each column.
There was a problem hiding this comment.
Ups yes there I forgot to check for NaN :)
|
|
||
|
|
||
| from sklearn.datasets import load_iris | ||
|
|
|
@lesteve We have 2 approved here. Do you want to make a quick review before to merge this, hopefully :) |
sklearn/preprocessing/data.py
Outdated
| """Force the output of nanpercentile to be finite.""" | ||
| percentile = nanpercentile(column_data, percentiles) | ||
| with np.errstate(invalid='ignore'): # hide NaN comparison warnings | ||
| if np.all(np.isclose(percentile, np.nan, equal_nan=True)): |
There was a problem hiding this comment.
Why not:
if np.all(np.isnan(percentile))If you use that I think you can remove the with np.errstate(...)
There was a problem hiding this comment.
True. no idea how I come to something so complex.
sklearn/preprocessing/data.py
Outdated
| percentile = nanpercentile(column_data, percentiles) | ||
| with np.errstate(invalid='ignore'): # hide NaN comparison warnings | ||
| if np.all(np.isclose(percentile, np.nan, equal_nan=True)): | ||
| warnings.warn("All samples in a column of X are NaN.") |
There was a problem hiding this comment.
Maybe you can mention in the warning that you are returning 0 for all the quantiles?
It would be nice if you could test you get the warning when expected.
Bonus points if you check that you do not get any warning when you don't expect a warning.
There was a problem hiding this comment.
I am not sure anymore why I force percentile to be finite.
|
I added a test to check that inverse transform is behaving properly in the common test. |
|
I suppose the problem with making quantiles NaN is that for finite data
passed to transform, you'd get NaN when transformed. That sort of makes
sense... I suppose if NaN is in the data we assume it will be handled
downstream.
|
That's true. But it seems a more logical think to map finite to NaN if during training we did not learn anything (due to a full NaN column). So I think that the way right now is ok. |
|
@lesteve you can have a second look at it and tell us if you it makes sense to you. |
|
ping @lesteve @qinhanmin2014 |
|
LGTM Given that there are already 2 +1 and Loic's comments were addressed, as far as I can tell, will merge when CI is green. |
Reference Issues/PRs
partially addresses #10404
What does this implement/fix? Explain your changes.
NaN are handled and ignored during processing in the QuantileTransformer.
Any other comments?
TODO:
check_arrayafter addressing [RFC] Dissociate NaN and Inf when considering force_all_finite in check_array #10455