increment_mean_and_var can now handle NaN values by pinakinathc · Pull Request #10618 · scikit-learn/scikit-learn

pinakinathc · 2018-02-10T19:00:43Z

Reference Issues/PRs

#10457 check if incremental_mean_and_var gives a green tick without failing in numerical_stability

What does this implement/fix? Explain your changes.

Any other comments?

jnothman · 2018-02-11T22:55:49Z

sklearn/utils/extmath.py

+    sum_func = np.nansum if ignore_nan else np.sum
+    new_sum = sum_func(X, axis=0)
+    if not isinstance(new_sum, np.ndarray):
+        new_sum *= np.ones(X.shape[1], dtype=np.float)


this and other similar lines do not have test coverage. Make sure all the cases you intend to handle are tested.

pinakinathc · 2018-02-12T06:47:34Z

@jnothman i am constantly getting these mismatch of the values in an array calculated. since i am getting no error in my local system, it looks the only way to figure out which line of the code is creating this error is to undo all new codes and implement 1 step at a time and see if it gets a green tick.

So in short:

revert all changes
implement each step like np.nansum() and np.nanvar and then see if it keeps on getting a green tick

Please let me know your views before i start doing that (as this kind of approach is going to take up a lot of waiting time as travis and appveyor is extremely slow)

jnothman · 2018-02-12T07:13:09Z

I must admit that it appears quite perplexing for something like test_incremental_variance_ddof to succeed on one platform and fail drastically on others. Could I suggest you try installing an old version of cython (0.25.2 is failing) to see if this affects local test runs...?

If you really need to keep pushing to test your changes, you can limit the tests to relevant modules by modifying appveyor.yml and build_tools/travis/test_script.sh.

jnothman · 2018-02-12T07:13:58Z

(Then again, it seems appveyor is failing with the most recent cython)

jnothman · 2018-02-12T07:15:08Z

Nah, it looks like cython should have nothing to do with it. Perhaps numpy version. Not sure... :\

jnothman · 2018-02-12T07:34:43Z

Behaviour could have changed across numpy versions that pertains to numerical stability. Are you sure that when you do np.ones(...) you want them to be floats, not ints?

jnothman

The docs for nansum may also give a relevant clue:
"""
In NumPy versions <= 1.8.0 Nan is returned for slices that are all-NaN or empty. In later versions zero is returned.
"""

jnothman · 2018-02-12T07:36:43Z

Though that's not going to be the problem for numpy 1.10 :|

pinakinathc · 2018-02-12T08:02:10Z

@jnothman i did np.ones() float because I doubted that maybe the division was somehow becoming an integer division i.e. 9/2=4 and not 4.5 . Maybe because of that (though it was quite illogical) but later even after making dtype=float the same errors are repeating, hence that is not the source of problem.

Probably i should move 1 step at a time. That would easily locate the source of error.

jnothman · 2018-02-12T08:06:54Z

as long as `__future__.division` is imported, that should not be an issue. I'll see if I have a moment to try replicate the test failures.

jnothman · 2018-02-12T09:32:17Z

Yes, downgrading numpy to 1.10.4 is sufficient to trigger the errors. I've not investigated the cause.

jnothman · 2018-02-12T09:38:01Z

sklearn/utils/extmath.py

+    if not isinstance(new_sum, np.ndarray):
+        new_sum *= np.ones(X.shape[1], dtype=np.float)
+
+    new_sample_count = np.count_nonzero(~np.isnan(X), axis=0)


This is the problem line: numpy < 1.12 does not support axis in count_nonzero. Unfortunately, it does not trigger a TypeError either, and just silently counts the total number of nonzeros. However, np.sum(~np.isnan(X), axis=0) will give the same result, as will len(X) - np.sum(np.isnan(X), axis=0).

jnothman · 2018-02-12T22:16:22Z

sklearn/utils/extmath.py

+    sum_func = np.nansum if ignore_nan else np.sum
+    new_sum = sum_func(X, axis=0)
+    if not isinstance(new_sum, np.ndarray):
+        new_sum *= np.ones(X.shape[1])


This is not covered by tests.

jnothman · 2018-02-12T22:16:28Z

sklearn/utils/extmath.py

+    new_sample_count = np.sum(~np.isnan(X), axis=0)
+    if not isinstance(new_sample_count, np.ndarray):
+        # If the input array is 1D
+        new_sample_count *= np.ones(X.shape[1])


This is not covered by tests

jnothman · 2018-02-12T22:16:38Z

sklearn/utils/extmath.py

+
+    new_sample_count = np.sum(~np.isnan(X), axis=0)
+    if not isinstance(new_sample_count, np.ndarray):
+        # If the input array is 1D


this does not match the condition.

X will be 2D

So remove that

pinakinathc · 2018-02-13T12:13:02Z

@jnothman can you please review this?

glemaitre

Update the docstring of updated_sample_count and last_sample_count to reflect the internal change.

For the moment, they are my comments. But frankly, I am getting lost with what is happening.

glemaitre · 2018-02-13T13:57:48Z

sklearn/utils/extmath.py

    # old = stats until now
    # new = the current increment
    # updated = the aggregated stats
+    if not isinstance(last_sample_count, np.ndarray):


Since that we don't have that much code which call _incremental_mean_and_var in the code base, I would change last_sample_count and updated_sample_count to be always an ndarray. So remove this statement.

@glemaitre do you mean that I change last_sample_count and update_sample_count to a ndarray? this will require to make changes in various files like:

sklearn/decomposition/incremental_pca.py

sklearn/decomposition/tests/test_incremental_pca.py

sklearn/preprocessing/tests/test_data.py

sklearn/utils/tests/test_extmath.py

glemaitre · 2018-02-13T14:01:32Z

sklearn/utils/extmath.py

+    sum_func = np.nansum if ignore_nan else np.sum
+    new_sum = sum_func(X, axis=0)
+    if not isinstance(new_sum, np.ndarray):
+        new_sum *= np.ones(X.shape[-1])


We don't need that. X will always be 2D and the sum should always be an ndarray isn't it?

glemaitre · 2018-02-13T14:02:17Z

sklearn/utils/extmath.py

+
+    new_sample_count = np.sum(~np.isnan(X), axis=0)
+    if not isinstance(new_sample_count, np.ndarray):
+        # If the input array is 1D


X will be 2D

glemaitre · 2018-02-13T14:02:28Z

sklearn/utils/extmath.py

+
+    new_sample_count = np.sum(~np.isnan(X), axis=0)
+    if not isinstance(new_sample_count, np.ndarray):
+        # If the input array is 1D


So remove that

glemaitre · 2018-02-13T14:03:27Z

sklearn/utils/extmath.py

    updated_sample_count = last_sample_count + new_sample_count

    updated_mean = (last_sum + new_sum) / updated_sample_count
+    updated_mean[np.isinf(updated_mean)] = 0


Why do we care about inf here. It should failed with inf isn't it?

Oh is it because of a division by zero. You need to comment it then

glemaitre · 2018-02-13T14:08:27Z

sklearn/utils/extmath.py

                (last_sum / last_over_new_count - new_sum) ** 2)
        updated_variance = updated_unnormalized_variance / updated_sample_count
+        updated_variance[np.isnan(updated_variance)] = 0
+        updated_variance[np.isinf(updated_variance)] = 0


add a comment that this is due to the division by zero

glemaitre · 2018-02-13T14:09:26Z

sklearn/utils/extmath.py

+        updated_variance[np.isinf(updated_variance)] = 0

+    # return vector only when required
+    if (updated_sample_count[0] == updated_sample_count).all():


I don't get what is this statement about.

glemaitre · 2018-02-13T14:09:36Z

sklearn/utils/tests/test_extmath.py

+                  [np.nan, np.nan, np.nan, np.nan, np.nan]])
+    X1 = A[:3, :]
+    X2 = np.array([np.nan, np.nan, np.nan, np.nan, np.nan])
+    X_means, X_variances, X_count = \


Don't use \

glemaitre · 2018-02-13T14:10:00Z

sklearn/utils/tests/test_extmath.py

+    X1 = A[:3, :]
+    X2 = np.array([np.nan, np.nan, np.nan, np.nan, np.nan])
+    X_means, X_variances, X_count = \
+        _incremental_mean_and_var(X1, [0, 0, 0, 0, 0], [0, 0, 0, 0, 0],


Use directly some numpy array. we should not accept list

glemaitre · 2018-02-13T14:10:32Z

sklearn/utils/tests/test_extmath.py

+                  [np.nan, np.nan, np.nan, np.nan, np.nan],
+                  [np.nan, np.nan, np.nan, np.nan, np.nan]])
+    X1 = A[:3, :]
+    X2 = np.array([np.nan, np.nan, np.nan, np.nan, np.nan])


X cannot be 1D. you can to a fully 2D matrix with only nan.

…her parts of the code

glemaitre · 2018-02-15T18:30:00Z

Yes but I considered those changes as minimal since that this is one estimator and mainly tests.

pinakinathc · 2018-02-16T02:31:48Z

@jnothman @glemaitre can you please review the code now. i have made all the changes according to your previous review.

…ikit-learn into sparseMatrix-test

jnothman · 2018-02-28T10:02:22Z

sklearn/utils/tests/test_extmath.py

+                  [np.nan, np.nan, np.nan, np.nan, np.nan]])
+    X1 = A[:3, :]
+    X2 = A[3:, :]
+    X_means, X_variances, X_count = _incremental_mean_and_var(


jnothman · 2018-02-28T10:02:35Z

sklearn/utils/tests/test_extmath.py

+                  [600, np.nan, 170, 430, 300],
+                  [np.nan, np.nan, np.nan, np.nan, np.nan],
+                  [np.nan, np.nan, np.nan, np.nan, np.nan]])
+    X1 = A[:3, :]


It would be better if you just wrote out the relevant portion of A here

jnothman · 2018-02-28T10:02:38Z

sklearn/utils/tests/test_extmath.py

+                  [np.nan, np.nan, np.nan, np.nan, np.nan],
+                  [np.nan, np.nan, np.nan, np.nan, np.nan]])
+    X1 = A[:3, :]
+    X2 = A[3:, :]


Your X2 is all NaN. While this is a good test case to have, we really need to be testing whether it works with a succession of not-all-NaN data as well (even).

Or perhaps you just need the corresponding test of Scaler.partial_fit, which I suspect does not currently accumulate the total count correctly.

jnothman · 2018-02-28T10:06:22Z

sklearn/preprocessing/data.py

-                                          self.n_samples_seen_)
+                _incremental_mean_and_var(
+                    X, self.mean_, self.var_,
+                    self.n_samples_seen_ * np.ones(X.shape[1]))


You need to keep n_samples_seen_ for each feature from iteration to iteration. I don't see how this could work atm. And yet, for backwards compatibility, we need to report only a scalar in cases that are not affected by this PR (i.e. where there are no NaNs, or perhaps where n_samples_seen_ is constant even if there were NaNs).

For example, you might compress the updated count to a scalar if not np.any(np.diff(n_samples_seen_))

I don't get it. From the changes, I though that self.n_samples_seen_ should always be an array.

And yet, for backwards compatibility, we need to report only a scalar in cases that are not affected by this PR (i.e. where there are no NaNs, or perhaps where n_samples_seen_ is constant even if there were NaNs).

I thought that it would be easier to change only array from now on. Only incremental_pca is affected apart of the StandardScaler and those functions are private so the end-user should not care.

n_samples_seen_ is not private IMO

@jnothman @glemaitre Sorry for being inactive for the past week. Shall I keep self.n_sample_seen_ a vector or scaler?
PS: As of now, self.n_sample_seen_ is a vector both in StandardScaler and incremental_pca

IMO it would be best for backwards compatibility to keep a scalar in the case when there are no NaNs or - for simplicity - in the case when all n_samples_seen are equal

Fair enough

@jnothman @glemaitre I'll make them generalised i.e. if all n_samples_seen are equal, it will return a scalar instead of a vector.

jnothman · 2018-02-28T10:07:10Z

sklearn/decomposition/incremental_pca.py

-                                      last_sample_count=self.n_samples_seen_)
+            _incremental_mean_and_var(
+                X, last_mean=self.mean_, last_variance=self.var_,
+                last_sample_count=self.n_samples_seen_ * np.ones(n_features))


We should not be supporting NaNs here. I think maybe we should allow it still to pass in a scalar n_samples_seen_ and _incremental_mean_and_var can broadcast it to n_features wide if appropriate.

jnothman · 2018-02-28T10:07:41Z

sklearn/preprocessing/tests/test_data.py

        assert_array_less(zero, scaler_incr.scale_ + epsilon)
        # (i+1) because the Scaler has been already fitted
-        assert_equal((i + 1), scaler_incr.n_samples_seen_)
+        assert_almost_equal((i + 1), scaler_incr.n_samples_seen_)


I think you mean assert_array_equal, not almost_equal if you are now trying to compare integer arrays rather than integer scalars

jnothman · 2018-02-28T10:08:19Z

sklearn/utils/extmath.py


 def _incremental_mean_and_var(X, last_mean=.0, last_variance=None,
-                              last_sample_count=0):
+                              last_sample_count=0, ignore_nan=True):


maybe to be strict this should be False by default

jnothman · 2018-02-28T10:09:00Z

sklearn/utils/extmath.py

+    new_sample_count = np.sum(~np.isnan(X), axis=0)
    updated_sample_count = last_sample_count + new_sample_count

+    warnings.filterwarnings('ignore')  # as division by 0 might happen


this needs to be in a catch_warnings context or else it changes the setting globally henceforth

In fact, I think you should be using with np.errstate

@jnothman I can do it, but the rest of the function only has calculation which are mostly division. Hence if I use with np.errstate then practically the rest of the code <till before the return statement> will come under it.

But you need a context manager in any case unless you modify your operands

jnothman · 2018-02-28T10:10:49Z

sklearn/utils/extmath.py

    # updated = the aggregated stats
    last_sum = last_mean * last_sample_count
-    new_sum = X.sum(axis=0)
+    sum_func = np.nansum if ignore_nan else np.sum


I wonder if we should start with if not isnan(X.sum()): ignore_nan = False, and then use fast paths that don't involve triplicating the memory like ~isnan(new_sum) does.

most of the lines in this function needs to be modified to be able to ignore NaN values. Now, if we do not want to encounter isnan types of executions are expensive, then there are 2 options:

create a separate part of the code which computes without ignoring NaN and a separate part of the code which computes ignoring NaN values

for each line of the code which includes isnan function, check if ignore_nan is true or false and write the code accordingly. Like: sumvar = np.nansum if ignore_nan else np.sum

jnothman · 2018-02-28T10:40:52Z

sklearn/utils/extmath.py

+            last_over_new_count / updated_sample_count *
+            (last_sum / last_over_new_count - new_sum) ** 2)
+        # updated_unnormalized_variance can be both NaN or Inf
+        updated_unnormalized_variance[np.isnan(


you can use ~np.isfinite

can we just use updated_unnormalized_variance[np.logical_not(new_sample_count)] = 0 or something similar?

@jnothman yeah sure, but not exactly new_sample_count but updated_unnormalized_variance[np.logical_not(updated_sample_count)] = 0 because consider the case:

last_sample_count = [1, 3, 5, 7, 9]

`new_sample_count = [3, 1, 0, 5, 9]'

so updated_unnormalized_variance[2] != 0 just because new_sample_count[2] == 0

Fine. Seems simpler than isnan and isinf to me

glemaitre · 2018-03-02T16:18:58Z

sklearn/utils/estimator_checks.py

    error_string_transform = ("Estimator doesn't check for NaN and inf in"
                              " transform.")
    for X_train in [X_train_nan, X_train_inf]:
+        if np.any(np.isnan(X_train)) and name in ALLOW_NAN:


Same comment as: #10437 (comment)

…ikit-learn into sparseMatrix-test

Sparse matrix test

glemaitre · 2018-03-18T12:16:23Z

I'll make them generalised i.e. if all n_samples_seen are equal, it will return a scalar instead of a vector.

@pinakinathc ok. ping me when you addressed all the points to be reviewed.

pinakinathc added 5 commits February 11, 2018 00:20

increment_mean_and_var can now handle NaN values

8cf9dbd

fixed errors

f7d7381

removed print statement which were added mistakenly

e3421e9

removed unwanted print statements

f924401

trying to fix the errors

c812be9

jnothman reviewed Feb 11, 2018

View reviewed changes

pinakinathc added 2 commits February 12, 2018 11:38

added test cases where there is a chance to get a 1D matrix

4758bc9

check if there is a green tick when cases involving NaN is commented out

df954ac

jnothman reviewed Feb 12, 2018

View reviewed changes

pinakinathc added 7 commits February 12, 2018 18:30

removing np.count_nonzero() from the code

75d221b

removed np.count_non_zero() in test cases

a3bb041

resolved some errors

8db4311

remove errors

1f115f1

resolving errors

566ad15

removing errors

499133d

removed errors

032b058

jnothman reviewed Feb 12, 2018

View reviewed changes

removed errors and modified test cases

e6c0521

glemaitre reviewed Feb 13, 2018

View reviewed changes

pinakinathc added 3 commits February 15, 2018 21:07

changes in the code and removed cases for 1D matrix

7810d6e

removed pep8 errors

be05d16

removed if condition at line +722 of extmath.py and changed some ot…

87667ab

…her parts of the code

made last_samples_seen and updated_sample_seen array

8e41081

pinakinathc added 2 commits February 18, 2018 14:30

modified csr_matrix and csc_matrix to be able to handle NaN values

adced1d

Merge branch 'incr-mean-and-var' of https://github.com/pinakinathc/sc…

c380bc7

…ikit-learn into sparseMatrix-test

jnothman reviewed Feb 28, 2018

View reviewed changes

glemaitre reviewed Mar 2, 2018

View reviewed changes

pinakinathc and others added 5 commits March 11, 2018 14:31

made changes to optimize the code

a0e25e9

remove pep8 errors

e524631

Merge branch 'incr-mean-and-var' of https://github.com/pinakinathc/sc…

d8c99c2

…ikit-learn into sparseMatrix-test

corrected csr_mean_variance_axis0 and csc_mean_variance_axis0

dc151a6

Merge pull request #1 from pinakinathc/sparseMatrix-test

512ca7b

Sparse matrix test

glemaitre mentioned this pull request Jun 16, 2018

[MRG] ENH: Ignore NaNs in StandardScaler and scale #11206

Merged

9 tasks

ogrisel closed this in #11206 Jun 21, 2018

Uh oh!

Conversation

pinakinathc commented Feb 10, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pinakinathc commented Feb 12, 2018

Uh oh!

jnothman commented Feb 12, 2018

Uh oh!

jnothman commented Feb 12, 2018

Uh oh!

jnothman commented Feb 12, 2018

Uh oh!

jnothman commented Feb 12, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Feb 12, 2018

Uh oh!

pinakinathc commented Feb 12, 2018

Uh oh!

jnothman commented Feb 12, 2018 via email

Uh oh!

jnothman commented Feb 12, 2018 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pinakinathc commented Feb 13, 2018

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Feb 15, 2018 via email

Uh oh!

pinakinathc commented Feb 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pinakinathc commented Feb 16, 2018 •

edited

Loading

jnothman Feb 28, 2018 •

edited

Loading