[MRG] Fix for float16 overflow on accumulator operations by baluyotraf · Pull Request #13010 · scikit-learn/scikit-learn

baluyotraf · 2019-01-18T07:14:27Z

Reference Issues/PRs

This fixes #13007

What does this implement/fix? Explain your changes.

A dtype of float64 is passed when using numpy based accumulator functions to prevent overflow. This is only done for floating point inputs.

…umulators (scikit-learn#13007)

rth

Thank you @baluyotraf !

It might be good to add a non regression test for overflow in StandardScaler with float16.

rth · 2019-01-18T07:30:22Z

sklearn/utils/__init__.py

+# Use at least float64 for the accumulating functions to avoid precision issues;
+# see https://github.com/numpy/numpy/issues/9393
+# The float64 is also retained as it is in case the float overflows
+def safe_acc_op(op, x, *args, **kwargs):


Please make this private, maybe more verbose (_safe_accumulate_op) and move it to utils.extmath.

…#13007)

…tils.extmath. Also fixed some line lengths to fit the 80 limit (scikit-learn#13007)

baluyotraf · 2019-01-19T00:45:51Z

Moved the function to extmath and added the test. I also verified that the test fails on master and that it passes in this branch. Thanks for the review. o/

jnothman

Thanks!

Please add an entry to the change log at doc/whats_new/v0.21.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

jnothman · 2019-01-19T11:34:25Z

sklearn/utils/extmath.py

        updated_variance = None
    else:
-        new_unnormalized_variance = np.nanvar(X, axis=0) * new_sample_count
+        new_unnormalized_variance = \


We prefer line continuations to use parentheses rather than backslash where possible.

I think I saw a backslash someone so I kind of went along with it. I'll take note of this.

jnothman · 2019-01-19T11:37:00Z

sklearn/preprocessing/tests/test_data.py

+    # Overflow calculations may cause -inf, inf, or nan. Since there is no nan
+    # input, all of the outputs should be finite. This may be redundant since a
+    # FloatingPointError exception will be thrown on overflow above.
+    assert np.all(np.isfinite(X_scaled))


I think it makes more sense to check that the output is identical to when the input is high precision. Also may want to check that the scaler features are preserving the input dtype (although surely we have another test for that)

I tested it out before and found that output is off after 2 or 3 decimal points. Should we cast the input during fit and cast it back to float16? It's kind of similar with to #12333 only this time the imprecision is with the results rather than the mean.

Wouldn't you expect it to be off after 2 or 3 decimal points with float16?

Would a test like this be enough?

def test_scaler_float16_overflow(): # Test if the scaler will not overflow on float16 numpy arrays rng = np.random.RandomState(0) # float16 has a maximum of 65500.0. On the worst case 5 * 200000 is 100000 # which is enough to overflow the data type X = rng.uniform(5, 10, [200000, 1]).astype(np.float16) with np.errstate(over='raise'): scaler = StandardScaler().fit(X) X_scaled = scaler.transform(X) # Calculate the float64 equivalent to verify result X_scaled_f64 = StandardScaler().fit_transform(X.astype(np.float64)) # Overflow calculations may cause -inf, inf, or nan. Since there is no nan # input, all of the outputs should be finite. This may be redundant since a # FloatingPointError exception will be thrown on overflow above. assert np.all(np.isfinite(X_scaled)) # The normal distribution is very unlikely to go above 4. At 4.0-8.0 the # float16 precision is 2^-8 which is around 0.004. Thus only 2 decimals are # checked to account for precision differences. assert_array_almost_equal(X_scaled, X_scaled_f64, decimal=2)

jnothman · 2019-01-20T20:42:54Z

There are CI failures, btw.

…ator on the test (scikit-learn#13007)

jnothman · 2019-01-21T00:10:02Z

Kind of you to show your working. Looks great (especially if it also passes)!

…ult with respect to their precisions (scikit-learn#13007)

rth

Thank you @baluyotraf !

…kit-learn#13010)

agramfort · 2019-02-16T21:01:04Z

this did not fix #5602 ?

…kit-learn#13010)

…ler (scikit-learn#13010)" This reverts commit 2ff7649.

…kit-learn#13010)

Fixed overflows on float16 when working with operations involving acc…

cdd00c3

…umulators (scikit-learn#13007)

rth reviewed Jan 18, 2019

View reviewed changes

baluyotraf added 2 commits January 19, 2019 08:40

Added test for checking StandardScaler float16 overflow (scikit-learn…

2af26a5

…#13007)

Renamed safe_acc_op to _safe_accumulator_op and moved it to sklearn.u…

fd85cc2

…tils.extmath. Also fixed some line lengths to fit the 80 limit (scikit-learn#13007)

jnothman reviewed Jan 19, 2019

View reviewed changes

baluyotraf added 2 commits January 21, 2019 06:30

Changed multilines to parentheses and removed underscore number separ…

0a39f2c

…ator on the test (scikit-learn#13007)

Added overflow fix to the change log

83a6bc3

Added a test to verify that both the float64 and float16 has same res…

2aee838

…ult with respect to their precisions (scikit-learn#13007)

jnothman approved these changes Jan 21, 2019

View reviewed changes

baluyotraf changed the title ~~Fix for float16 overflow on accumulator operations~~ [MRG] Fix for float16 overflow on accumulator operations Jan 21, 2019

rth approved these changes Jan 26, 2019

View reviewed changes

rth merged commit 1f5bcae into scikit-learn:master Jan 26, 2019

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Jan 30, 2019

FIX float16 overflow on accumulator operations in StandardScaler (sci…

3721a61

…kit-learn#13010)

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 6, 2019

FIX float16 overflow on accumulator operations in StandardScaler (sci…

e9c7e64

…kit-learn#13010)

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 7, 2019

FIX float16 overflow on accumulator operations in StandardScaler (sci…

8ebf67d

…kit-learn#13010)

rth mentioned this pull request Mar 13, 2019

[MRG] Fix euclidean_distances numerical instabilities #13410

Closed

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

FIX float16 overflow on accumulator operations in StandardScaler (sci…

2ff7649

…kit-learn#13010)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX float16 overflow on accumulator operations in StandardSca…

222db0b

…ler (scikit-learn#13010)" This reverts commit 2ff7649.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX float16 overflow on accumulator operations in StandardSca…

fc2f5e7

…ler (scikit-learn#13010)" This reverts commit 2ff7649.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FIX float16 overflow on accumulator operations in StandardScaler (sci…

91b7b24

…kit-learn#13010)

Uh oh!

Conversation

baluyotraf commented Jan 18, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

rth Jan 18, 2019

Choose a reason for hiding this comment

Uh oh!

baluyotraf commented Jan 19, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Jan 19, 2019

Choose a reason for hiding this comment

Uh oh!

baluyotraf Jan 19, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman Jan 19, 2019

Choose a reason for hiding this comment

Uh oh!

baluyotraf Jan 19, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman Jan 20, 2019

Choose a reason for hiding this comment

Uh oh!

baluyotraf Jan 20, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 20, 2019

Uh oh!

jnothman commented Jan 21, 2019 via email

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

agramfort commented Feb 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants