increment_mean_and_var can now handle NaN values#10618
increment_mean_and_var can now handle NaN values#10618pinakinathc wants to merge 26 commits intoscikit-learn:masterfrom
Conversation
sklearn/utils/extmath.py
Outdated
| sum_func = np.nansum if ignore_nan else np.sum | ||
| new_sum = sum_func(X, axis=0) | ||
| if not isinstance(new_sum, np.ndarray): | ||
| new_sum *= np.ones(X.shape[1], dtype=np.float) |
There was a problem hiding this comment.
this and other similar lines do not have test coverage. Make sure all the cases you intend to handle are tested.
|
@jnothman i am constantly getting these mismatch of the values in an array calculated. since i am getting no error in my local system, it looks the only way to figure out which line of the code is creating this error is to undo all new codes and implement 1 step at a time and see if it gets a green tick. So in short:
Please let me know your views before i start doing that (as this kind of approach is going to take up a lot of waiting time as travis and appveyor is extremely slow) |
|
I must admit that it appears quite perplexing for something like If you really need to keep pushing to test your changes, you can limit the tests to relevant modules by modifying |
|
(Then again, it seems appveyor is failing with the most recent cython) |
|
Nah, it looks like cython should have nothing to do with it. Perhaps numpy version. Not sure... :\ |
|
Behaviour could have changed across numpy versions that pertains to numerical stability. Are you sure that when you do |
jnothman
left a comment
There was a problem hiding this comment.
The docs for nansum may also give a relevant clue:
"""
In NumPy versions <= 1.8.0 Nan is returned for slices that are all-NaN or empty. In later versions zero is returned.
"""
|
Though that's not going to be the problem for numpy 1.10 :| |
|
@jnothman i did np.ones() float because I doubted that maybe the division was somehow becoming an integer division i.e. 9/2=4 and not 4.5 . Maybe because of that (though it was quite illogical) but later even after making dtype=float the same errors are repeating, hence that is not the source of problem. Probably i should move 1 step at a time. That would easily locate the source of error. |
|
as long as `__future__.division` is imported, that should not be an issue.
I'll see if I have a moment to try replicate the test failures.
|
|
Yes, downgrading numpy to 1.10.4 is sufficient to trigger the errors. I've
not investigated the cause.
|
sklearn/utils/extmath.py
Outdated
| if not isinstance(new_sum, np.ndarray): | ||
| new_sum *= np.ones(X.shape[1], dtype=np.float) | ||
|
|
||
| new_sample_count = np.count_nonzero(~np.isnan(X), axis=0) |
There was a problem hiding this comment.
This is the problem line: numpy < 1.12 does not support axis in count_nonzero. Unfortunately, it does not trigger a TypeError either, and just silently counts the total number of nonzeros. However, np.sum(~np.isnan(X), axis=0) will give the same result, as will len(X) - np.sum(np.isnan(X), axis=0).
sklearn/utils/extmath.py
Outdated
| sum_func = np.nansum if ignore_nan else np.sum | ||
| new_sum = sum_func(X, axis=0) | ||
| if not isinstance(new_sum, np.ndarray): | ||
| new_sum *= np.ones(X.shape[1]) |
sklearn/utils/extmath.py
Outdated
| new_sample_count = np.sum(~np.isnan(X), axis=0) | ||
| if not isinstance(new_sample_count, np.ndarray): | ||
| # If the input array is 1D | ||
| new_sample_count *= np.ones(X.shape[1]) |
sklearn/utils/extmath.py
Outdated
|
|
||
| new_sample_count = np.sum(~np.isnan(X), axis=0) | ||
| if not isinstance(new_sample_count, np.ndarray): | ||
| # If the input array is 1D |
There was a problem hiding this comment.
this does not match the condition.
|
@jnothman can you please review this? |
glemaitre
left a comment
There was a problem hiding this comment.
Update the docstring of updated_sample_count and last_sample_count to reflect the internal change.
For the moment, they are my comments. But frankly, I am getting lost with what is happening.
sklearn/utils/extmath.py
Outdated
| # old = stats until now | ||
| # new = the current increment | ||
| # updated = the aggregated stats | ||
| if not isinstance(last_sample_count, np.ndarray): |
There was a problem hiding this comment.
Since that we don't have that much code which call _incremental_mean_and_var in the code base, I would change last_sample_count and updated_sample_count to be always an ndarray. So remove this statement.
There was a problem hiding this comment.
@glemaitre do you mean that I change last_sample_count and update_sample_count to a ndarray? this will require to make changes in various files like:
sklearn/decomposition/incremental_pca.pysklearn/decomposition/tests/test_incremental_pca.pysklearn/preprocessing/tests/test_data.pysklearn/utils/tests/test_extmath.py
sklearn/utils/extmath.py
Outdated
| sum_func = np.nansum if ignore_nan else np.sum | ||
| new_sum = sum_func(X, axis=0) | ||
| if not isinstance(new_sum, np.ndarray): | ||
| new_sum *= np.ones(X.shape[-1]) |
There was a problem hiding this comment.
We don't need that. X will always be 2D and the sum should always be an ndarray isn't it?
sklearn/utils/extmath.py
Outdated
|
|
||
| new_sample_count = np.sum(~np.isnan(X), axis=0) | ||
| if not isinstance(new_sample_count, np.ndarray): | ||
| # If the input array is 1D |
sklearn/utils/extmath.py
Outdated
|
|
||
| new_sample_count = np.sum(~np.isnan(X), axis=0) | ||
| if not isinstance(new_sample_count, np.ndarray): | ||
| # If the input array is 1D |
sklearn/utils/extmath.py
Outdated
| updated_sample_count = last_sample_count + new_sample_count | ||
|
|
||
| updated_mean = (last_sum + new_sum) / updated_sample_count | ||
| updated_mean[np.isinf(updated_mean)] = 0 |
There was a problem hiding this comment.
Why do we care about inf here. It should failed with inf isn't it?
There was a problem hiding this comment.
Oh is it because of a division by zero. You need to comment it then
sklearn/utils/extmath.py
Outdated
| (last_sum / last_over_new_count - new_sum) ** 2) | ||
| updated_variance = updated_unnormalized_variance / updated_sample_count | ||
| updated_variance[np.isnan(updated_variance)] = 0 | ||
| updated_variance[np.isinf(updated_variance)] = 0 |
There was a problem hiding this comment.
add a comment that this is due to the division by zero
sklearn/utils/extmath.py
Outdated
| updated_variance[np.isinf(updated_variance)] = 0 | ||
|
|
||
| # return vector only when required | ||
| if (updated_sample_count[0] == updated_sample_count).all(): |
There was a problem hiding this comment.
I don't get what is this statement about.
sklearn/utils/tests/test_extmath.py
Outdated
| [np.nan, np.nan, np.nan, np.nan, np.nan]]) | ||
| X1 = A[:3, :] | ||
| X2 = np.array([np.nan, np.nan, np.nan, np.nan, np.nan]) | ||
| X_means, X_variances, X_count = \ |
sklearn/utils/tests/test_extmath.py
Outdated
| X1 = A[:3, :] | ||
| X2 = np.array([np.nan, np.nan, np.nan, np.nan, np.nan]) | ||
| X_means, X_variances, X_count = \ | ||
| _incremental_mean_and_var(X1, [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], |
There was a problem hiding this comment.
Use directly some numpy array. we should not accept list
sklearn/utils/tests/test_extmath.py
Outdated
| [np.nan, np.nan, np.nan, np.nan, np.nan], | ||
| [np.nan, np.nan, np.nan, np.nan, np.nan]]) | ||
| X1 = A[:3, :] | ||
| X2 = np.array([np.nan, np.nan, np.nan, np.nan, np.nan]) |
There was a problem hiding this comment.
X cannot be 1D. you can to a fully 2D matrix with only nan.
|
Yes but I considered those changes as minimal since that this is one estimator and mainly tests.
|
|
@jnothman @glemaitre can you please review the code now. i have made all the changes according to your previous review. |
| [np.nan, np.nan, np.nan, np.nan, np.nan]]) | ||
| X1 = A[:3, :] | ||
| X2 = A[3:, :] | ||
| X_means, X_variances, X_count = _incremental_mean_and_var( |
sklearn/utils/tests/test_extmath.py
Outdated
| [600, np.nan, 170, 430, 300], | ||
| [np.nan, np.nan, np.nan, np.nan, np.nan], | ||
| [np.nan, np.nan, np.nan, np.nan, np.nan]]) | ||
| X1 = A[:3, :] |
There was a problem hiding this comment.
It would be better if you just wrote out the relevant portion of A here
sklearn/utils/tests/test_extmath.py
Outdated
| [np.nan, np.nan, np.nan, np.nan, np.nan], | ||
| [np.nan, np.nan, np.nan, np.nan, np.nan]]) | ||
| X1 = A[:3, :] | ||
| X2 = A[3:, :] |
There was a problem hiding this comment.
Your X2 is all NaN. While this is a good test case to have, we really need to be testing whether it works with a succession of not-all-NaN data as well (even).
There was a problem hiding this comment.
Or perhaps you just need the corresponding test of Scaler.partial_fit, which I suspect does not currently accumulate the total count correctly.
| self.n_samples_seen_) | ||
| _incremental_mean_and_var( | ||
| X, self.mean_, self.var_, | ||
| self.n_samples_seen_ * np.ones(X.shape[1])) |
There was a problem hiding this comment.
You need to keep n_samples_seen_ for each feature from iteration to iteration. I don't see how this could work atm. And yet, for backwards compatibility, we need to report only a scalar in cases that are not affected by this PR (i.e. where there are no NaNs, or perhaps where n_samples_seen_ is constant even if there were NaNs).
For example, you might compress the updated count to a scalar if not np.any(np.diff(n_samples_seen_))
There was a problem hiding this comment.
I don't get it. From the changes, I though that self.n_samples_seen_ should always be an array.
And yet, for backwards compatibility, we need to report only a scalar in cases that are not affected by this PR (i.e. where there are no NaNs, or perhaps where n_samples_seen_ is constant even if there were NaNs).
I thought that it would be easier to change only array from now on. Only incremental_pca is affected apart of the StandardScaler and those functions are private so the end-user should not care.
There was a problem hiding this comment.
n_samples_seen_ is not private IMO
There was a problem hiding this comment.
@jnothman @glemaitre Sorry for being inactive for the past week. Shall I keep self.n_sample_seen_ a vector or scaler?
PS: As of now, self.n_sample_seen_ is a vector both in StandardScaler and incremental_pca
There was a problem hiding this comment.
IMO it would be best for backwards compatibility to keep a scalar in the case when there are no NaNs or - for simplicity - in the case when all n_samples_seen are equal
There was a problem hiding this comment.
@jnothman @glemaitre I'll make them generalised i.e. if all n_samples_seen are equal, it will return a scalar instead of a vector.
| last_sample_count=self.n_samples_seen_) | ||
| _incremental_mean_and_var( | ||
| X, last_mean=self.mean_, last_variance=self.var_, | ||
| last_sample_count=self.n_samples_seen_ * np.ones(n_features)) |
There was a problem hiding this comment.
We should not be supporting NaNs here. I think maybe we should allow it still to pass in a scalar n_samples_seen_ and _incremental_mean_and_var can broadcast it to n_features wide if appropriate.
| assert_array_less(zero, scaler_incr.scale_ + epsilon) | ||
| # (i+1) because the Scaler has been already fitted | ||
| assert_equal((i + 1), scaler_incr.n_samples_seen_) | ||
| assert_almost_equal((i + 1), scaler_incr.n_samples_seen_) |
There was a problem hiding this comment.
I think you mean assert_array_equal, not almost_equal if you are now trying to compare integer arrays rather than integer scalars
sklearn/utils/extmath.py
Outdated
|
|
||
| def _incremental_mean_and_var(X, last_mean=.0, last_variance=None, | ||
| last_sample_count=0): | ||
| last_sample_count=0, ignore_nan=True): |
There was a problem hiding this comment.
maybe to be strict this should be False by default
sklearn/utils/extmath.py
Outdated
| new_sample_count = np.sum(~np.isnan(X), axis=0) | ||
| updated_sample_count = last_sample_count + new_sample_count | ||
|
|
||
| warnings.filterwarnings('ignore') # as division by 0 might happen |
There was a problem hiding this comment.
this needs to be in a catch_warnings context or else it changes the setting globally henceforth
There was a problem hiding this comment.
In fact, I think you should be using with np.errstate
There was a problem hiding this comment.
@jnothman I can do it, but the rest of the function only has calculation which are mostly division. Hence if I use with np.errstate then practically the rest of the code <till before the return statement> will come under it.
There was a problem hiding this comment.
But you need a context manager in any case unless you modify your operands
| # updated = the aggregated stats | ||
| last_sum = last_mean * last_sample_count | ||
| new_sum = X.sum(axis=0) | ||
| sum_func = np.nansum if ignore_nan else np.sum |
There was a problem hiding this comment.
I wonder if we should start with if not isnan(X.sum()): ignore_nan = False, and then use fast paths that don't involve triplicating the memory like ~isnan(new_sum) does.
There was a problem hiding this comment.
most of the lines in this function needs to be modified to be able to ignore NaN values. Now, if we do not want to encounter isnan types of executions are expensive, then there are 2 options:
- create a separate part of the code which computes without ignoring NaN and a separate part of the code which computes ignoring NaN values
- for each line of the code which includes
isnanfunction, check ifignore_nanis true or false and write the code accordingly. Like:sumvar = np.nansum if ignore_nan else np.sum
sklearn/utils/extmath.py
Outdated
| last_over_new_count / updated_sample_count * | ||
| (last_sum / last_over_new_count - new_sum) ** 2) | ||
| # updated_unnormalized_variance can be both NaN or Inf | ||
| updated_unnormalized_variance[np.isnan( |
There was a problem hiding this comment.
can we just use updated_unnormalized_variance[np.logical_not(new_sample_count)] = 0 or something similar?
There was a problem hiding this comment.
@jnothman yeah sure, but not exactly new_sample_count but updated_unnormalized_variance[np.logical_not(updated_sample_count)] = 0 because consider the case:
last_sample_count = [1, 3, 5, 7, 9]- `new_sample_count = [3, 1, 0, 5, 9]'
- so
updated_unnormalized_variance[2] != 0just becausenew_sample_count[2] == 0
There was a problem hiding this comment.
Fine. Seems simpler than isnan and isinf to me
| error_string_transform = ("Estimator doesn't check for NaN and inf in" | ||
| " transform.") | ||
| for X_train in [X_train_nan, X_train_inf]: | ||
| if np.any(np.isnan(X_train)) and name in ALLOW_NAN: |
…ikit-learn into sparseMatrix-test
Sparse matrix test
@pinakinathc ok. ping me when you addressed all the points to be reviewed. |
Reference Issues/PRs
#10457 check if incremental_mean_and_var gives a green tick without failing in numerical_stability
What does this implement/fix? Explain your changes.
Any other comments?