[MRG] EHN handle NaN value in QuantileTransformer by glemaitre · Pull Request #10437 · scikit-learn/scikit-learn

glemaitre · 2018-01-09T16:24:35Z

Reference Issues/PRs

partially addresses #10404

What does this implement/fix? Explain your changes.

NaN are handled and ignored during processing in the QuantileTransformer.

Any other comments?

TODO:

Address FIXME
Change check_array after addressing [RFC] Dissociate NaN and Inf when considering force_all_finite in check_array #10455

glemaitre · 2018-01-09T17:22:37Z

@jnothman I have 2 questions:

Shall we modify check_array to handle inf/nan separately. I could think that ensure_all_finite could accept a string as well nan or inf to make the checking for one of this case only. Right now, I put this logic in the _check_inputs of the QuantileTransformer
I get a RuntimeWarning due to a comparison with NaN. Since the comparison return False, everything is fine. Shall we silent the warning or instead make a comparison a masked array?

jnothman · 2018-01-09T22:18:31Z

what's the harm in silencing the warning?

jnothman

This is nice:)

glemaitre · 2018-01-09T22:35:59Z

what's the harm in silencing the warning?

I don't see any. I would go for that solution.

However, nanpercentile is available from numpy 1.9. Ubuntu has only 1.8.2 version. Shall we backport the nanfunctions.py from numpy?

jnothman · 2018-01-09T22:42:14Z

Sure. Backport and inform other contributors at #10404

jnothman · 2018-01-09T22:42:50Z

Sure. Backport and inform other contributors at #10404

glemaitre · 2018-01-10T13:20:21Z

Backport is headache in fact. I tried and we have to bring to much code from numpy to be able to support it. Speaking IRL with @ogrisel, we propose to raise a NotImplemented error when there is NaN and that we require a nanfunctions (involving numpy >= 1.9) and ask for an upgrade.

glemaitre · 2018-01-10T16:31:53Z

@lesteve @jnothman I wanted to mock the version of numpy to be sure that the error in the test is raised properly. I used pytest-mock for the moment. Is there any problem with that, should we use an alternative. Note that mock is only in the standard library in python 3 but not in python 2 which required to install it from pip as well.

lesteve · 2018-01-10T16:47:41Z

@lesteve @jnothman I wanted to mock the version of numpy to be sure that the error in the test is raised properly.

Is pytest-mock really needed? Can we not use either:

if statement in the test. The numpy <= 1.9 will be tested in one of the build on Travis.
pytest monkeypatch fixture. I haven't looked what the difference is compared to pytest-mock to be honest.

glemaitre · 2018-01-10T16:55:49Z

if statement in the test. The numpy <= 1.9 will be tested in one of the build on Travis.

+1 on that one. It seems good enough to me. Thanks

glemaitre · 2018-01-11T00:44:03Z

@jnothman @lesteve I think this is ready for a first look.

jnothman

Do you think we should support other missing_values indicators?

jnothman · 2018-01-11T07:47:36Z

sklearn/preprocessing/data.py

        # for forward transform, match the output PDF
        if not inverse:
-            X_col = output_distribution.ppf(X_col)
+            # comparison with NaN will raise a warning which we make silent


If a numpy error, can use np.errstate

this is a scipy warning

glemaitre · 2018-01-11T14:14:17Z

Do you think we should support other missing_values indicators?

I give it some thought and I tried couple of stuff. I have the following interrogations:

The QuantileTransformer is actually converting X to a float array. So the simpler way is to replace the missing_values by NaN and the processing will remain the same.
If we actually want to keep the data type of X (only at fit because we will return a float X at transform anyway), we can compute compute the quantiles only on the sliced X_col. However, at transform, you will need to convert X to a float and in this case the easier way is to fall back on the first solution (replace missing_values by NaN).

Having implemented both approaches (mid-way) I think that only replacing missing_value by NaN is sufficient.

jnothman · 2018-01-12T01:35:03Z

sklearn/preprocessing/data.py

+
    def _check_inputs(self, X, accept_sparse_negative=False):
        """Check inputs before fit and transform"""
+        if sparse.issparse(X):


Why before check_array?

check_array will convert the matrix into a float dtype. I wanted to compare when data can still be int.
However I could also use np.isclose which can handle NaN as well.

jnothman · 2018-01-12T01:35:51Z

sklearn/preprocessing/data.py

+                and not np.isfinite(X[~np.isnan(X)]).all()):
+            raise ValueError("Input contains infinity"
+                             " or a value too large for %r." % X.dtype)
+        if np.count_nonzero(self._mask_missing_values):


It is strange to have this in a function whose argument is X, not _mask_missing_values.

I suspect you should avoid storing this mask as an attribute.

This function will get away once the #10455 is addressed.

jnothman · 2018-01-12T01:42:07Z

sklearn/preprocessing/data.py

+                self._percentile_func = np.nanpercentile
+            else:
+                raise NotImplementedError(
+                    'QuantileTransformer does not handle NaN value with'


Is it easy enough to just implement in sklearn.utils.fixes:

def nanpercentile(a, q): return np.percentile(np.compress(a, ~np.isnan(a)), q)

seeing as we don't use the other features of nanpercentile?

It is annoying to have parts of the library with different minimum numpy requirements. It means that code is not portable across supported platforms.

glemaitre · 2018-03-18T11:18:19Z

I'm not sure that's applicable to QuantileTransformer...?

True ... we will get always float as output. So it should be for sure in a separate PR.

glemaitre · 2018-03-18T11:21:30Z

@jnothman I added a test_common file. Could you check that the creation of the instance is ok or you would see another way to create the instance of the estimator on the fly (a dict for instance?)

jnothman

Nice test. It essentially also tests that the estimators are feature-wise... So we could in theory remove some existing tests

jnothman · 2018-03-18T11:31:07Z

sklearn/preprocessing/tests/test_common.py

+
+@pytest.mark.parametrize(
+    "est, X, n_missing",
+    _generate_tuple_transformer_missing_value()


I don't get why this is better than just parameterizing est directly

We could even consider a list of all feature-wise preprocessors then xfail some...

We could even consider a list of all feature-wise preprocessors then xfail some...

I agree. We could switch to this behaviour when a majority of those preprocessor support this feature.

jnothman · 2018-03-18T11:34:25Z

sklearn/preprocessing/tests/test_common.py

+      rng.randint(X.shape[1], size=n_missing)] = np.nan
+    X_train, X_test = train_test_split(X)
+    # sanity check
+    assert not np.all(np.isnan(X_train), axis=0).any()


Should probably also check that there are NaNs in both train and test

jnothman

Please add a docstring note along the lines of "NaNs are treated as missing values: disregarded in fit, and maintained in transform". Perhaps that's too terse.

jnothman · 2018-03-18T21:52:54Z

sklearn/preprocessing/data.py

-            X_col = .5 * (np.interp(X_col, quantiles, self.references_)
-                          - np.interp(-X_col, -quantiles[::-1],
-                                      -self.references_[::-1]))
+            X_col[isfinite_mask] = .5 * (


Fwiw, it's possible that np.ma would handle the none-missing case more efficiently than using an ad-hoc hoc mask. I've not checked.

Playing around, I think that it will trigger the same number of copy.

jnothman · 2018-03-18T21:55:50Z

sklearn/preprocessing/tests/test_common.py

+    X_train, X_test = train_test_split(X)
+    # sanity check
+    assert not np.all(np.isnan(X_train), axis=0).any()
+    assert np.any(X_train, axis=0).all()


Is this the right condition?

We check that there is some Nan in each column.

Ups yes there I forgot to check for NaN :)

jnothman · 2018-03-18T21:57:27Z

sklearn/preprocessing/tests/test_common.py

+
+
+from sklearn.datasets import load_iris
+


Why so much vertical space?

glemaitre · 2018-03-19T10:53:00Z

@lesteve We have 2 approved here. Do you want to make a quick review before to merge this, hopefully :)

lesteve · 2018-03-20T09:37:10Z

sklearn/preprocessing/data.py

+        """Force the output of nanpercentile to be finite."""
+        percentile = nanpercentile(column_data, percentiles)
+        with np.errstate(invalid='ignore'):  # hide NaN comparison warnings
+            if np.all(np.isclose(percentile, np.nan, equal_nan=True)):


Why not:

if np.all(np.isnan(percentile))

If you use that I think you can remove the with np.errstate(...)

True. no idea how I come to something so complex.

lesteve · 2018-03-20T09:37:51Z

sklearn/preprocessing/data.py

+        percentile = nanpercentile(column_data, percentiles)
+        with np.errstate(invalid='ignore'):  # hide NaN comparison warnings
+            if np.all(np.isclose(percentile, np.nan, equal_nan=True)):
+                warnings.warn("All samples in a column of X are NaN.")


Maybe you can mention in the warning that you are returning 0 for all the quantiles?

It would be nice if you could test you get the warning when expected.

Bonus points if you check that you do not get any warning when you don't expect a warning.

I am not sure anymore why I force percentile to be finite.

glemaitre · 2018-03-20T12:17:05Z

@jnothman @lesteve would it makes sense to let the quantiles to nan? It would not affect the rest of the processing I think.

glemaitre · 2018-03-20T12:41:37Z

I added a test to check that inverse transform is behaving properly in the common test.
I am checking that the quantile are nan when a column is NaN.
It seems a better behavior than forcing the column to zero. I really don't recall what was my argument to do so.

jnothman · 2018-03-20T21:37:04Z

I suppose the problem with making quantiles NaN is that for finite data passed to transform, you'd get NaN when transformed. That sort of makes sense... I suppose if NaN is in the data we assume it will be handled downstream.

glemaitre · 2018-03-20T21:52:19Z

I suppose the problem with making quantiles NaN is that for finite data passed to transform, you'd get NaN when transformed. That sort of makes sense... I suppose if NaN is in the data we assume it will be handled downstream.

That's true. But it seems a more logical think to map finite to NaN if during training we did not learn anything (due to a full NaN column). So I think that the way right now is ok.

glemaitre · 2018-03-20T21:54:03Z

@lesteve you can have a second look at it and tell us if you it makes sense to you.

glemaitre · 2018-03-29T10:21:15Z

ping @lesteve @qinhanmin2014

rth · 2018-04-21T16:34:57Z

LGTM

Given that there are already 2 +1 and Loic's comments were addressed, as far as I can tell, will merge when CI is green.

glemaitre added 3 commits January 9, 2018 17:19

EHN handle NaN value in QuantileTransformer

cc3bb96

DOC add whats new entry

76123c8

TST relax inf/nan common test

530c7bf

jnothman approved these changes Jan 9, 2018

View reviewed changes

FIX silent warning and raise an error for numpy version

1f07963

glemaitre changed the title ~~[WIP] EHN handle NaN value in QuantileTransformer~~ [MRG] EHN handle NaN value in QuantileTransformer Jan 10, 2018

glemaitre added 2 commits January 10, 2018 17:57

TST ensure that test raise error with older numpy

91c947e

TST remove mocking

1c406c0

glemaitre force-pushed the is/10404 branch from efd219c to 1c406c0 Compare January 10, 2018 17:04

jnothman reviewed Jan 11, 2018

View reviewed changes

EHN accept integer as missing values

ecc5048

jnothman reviewed Jan 12, 2018

View reviewed changes

glemaitre added 6 commits January 12, 2018 12:18

address joel comments

965811f

FIX nanpercentile for python 2

cd28883

TST test the output under numpy < 1.9

3d0c389

FIX nanpercentile numpy 1.8

a217af6

PEP8

85c6268

TST check all missing values behaviour

1306992

Merge remote-tracking branch 'origin/master' into is/10404

daa3a91

jnothman reviewed Mar 18, 2018

View reviewed changes

glemaitre added 2 commits March 18, 2018 13:09

TST improve testing

ad878fa

TST remove parametrization on X and n_missing

6784c3b

jnothman reviewed Mar 18, 2018

View reviewed changes

jnothman approved these changes Mar 18, 2018

View reviewed changes

glemaitre added 2 commits March 19, 2018 08:34

address joel comments

daa40da

fix random state for the split training testing

2c0ceb3

lesteve reviewed Mar 20, 2018

View reviewed changes

glemaitre added 2 commits March 20, 2018 12:43

do not force percentile to be finite

004b0e3

fix

9ab77b6

TST add test for quantile transformer

33cc416

glemaitre added 3 commits March 20, 2018 14:07

TST fix for older numpy version

f58dcee

FIX for to use nanpercentile up to 1.11 for consistent behaviour

0f03485

my mistake

d0a88bd

glemaitre added 2 commits April 16, 2018 13:10

Merge branch 'master' into is/10404

d554f8e

Roman comments

1bb0006

rth merged commit c3548a8 into scikit-learn:master Apr 21, 2018

jnothman mentioned this pull request Jun 16, 2018

Disregard NaNs in preprocessing #10404

Closed

7 tasks

Uh oh!

Conversation

glemaitre commented Jan 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

glemaitre commented Jan 9, 2018

Uh oh!

jnothman commented Jan 9, 2018 via email

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jan 9, 2018

Uh oh!

jnothman commented Jan 9, 2018

Uh oh!

jnothman commented Jan 9, 2018

Uh oh!

glemaitre commented Jan 10, 2018

Uh oh!

glemaitre commented Jan 10, 2018

Uh oh!

lesteve commented Jan 10, 2018

Uh oh!

glemaitre commented Jan 10, 2018

Uh oh!

glemaitre commented Jan 11, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jan 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Mar 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Mar 18, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

glemaitre commented Jan 9, 2018 •

edited

Loading

glemaitre commented Mar 18, 2018 •

edited

Loading

lesteve Mar 20, 2018 •

edited

Loading

rth commented Apr 21, 2018 •

edited

Loading