[MRG] Increase mean precision for large float32 arrays by bauks · Pull Request #12338 · scikit-learn/scikit-learn

bauks · 2018-10-09T20:20:31Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uses at least float64 precision when computing mean in _incremental_mean_and_var.
This avoids precision issues with np.mean on long multidimensional float32 arrays as discussed in numpy/numpy#9393

jnothman · 2018-10-10T09:36:15Z

Can you confirm that there is an existing test ensuring the inputs dtype is maintained after scaling, and add one if not?

jnothman

Otherwise LGTM

bauks · 2018-10-10T17:14:37Z

I didn't see such a test so I added one. Note that it fails even before this change if we include np.float128, due to the check_array call in StandardScaler.partial_fit with dtype=FLOAT_DTYPES; this seems fine as a clear warning is given:
sklearn/utils/validation.py:590: DataConversionWarning: Data with input dtype float128 was converted to float64 by StandardScaler.

jnothman

Actually, does the dtype of mean_ etc change? This might at least deserve a note in what's new, if not a test. Otherwise this is looking good!

sklearn/preprocessing/tests/test_data.py

jnothman · 2018-10-10T22:52:06Z

I'd be okay to include this in 0.20.X.

bauks · 2018-10-10T22:58:47Z

The dtype of mean_ and scale_ is always float64, which is also the case without this change. Should I add a check for this to the test?

jnothman · 2018-10-11T01:17:30Z

Sure, thanks!

rth

LGTM apart for the comment below, and https://github.com/scikit-learn/scikit-learn/pull/12338/files#r224266370 still needs addressing (by removing copy=True).

sklearn/preprocessing/tests/test_data.py

bauks · 2018-10-12T22:18:58Z

Great, I think those issues have been addressed now.

rth

LGTM, please add a what's new in doc/whats_new/v0.20.rst under the 0.20.1 section. As far as I can tell, estimators affected by this are preprocessing.StandardScaler and decomposition.IncrementalPCA

Use at least float64 when computing mean in _incremental_mean_and_var. This avoids precision issues with np.mean on long multidimensional float32 arrays.

rth · 2018-10-16T06:46:10Z

Thanks, merging, the appveyor failure is unrelated.

rth · 2018-10-16T06:48:21Z

doc/whats_new/v0.20.rst

+  precision issues when using float32 datasets. Affects
+  :class:`preprocessing.StandardScaler` and
+  :class:`decomposition.IncrementalPCA`.
+  :issue:`12333` by :user:`bauks <bauks>`.


Actually this should reference the PR and we can't mention a private function in the what's new, only the effect of this change on public estimators. I'll fix it ..

)

…learn#12338)" This reverts commit 91f7a68.

jnothman approved these changes Oct 10, 2018

View reviewed changes

jnothman reviewed Oct 10, 2018

View reviewed changes

sklearn/preprocessing/tests/test_data.py Outdated Show resolved Hide resolved

rth reviewed Oct 11, 2018

View reviewed changes

sklearn/preprocessing/tests/test_data.py Outdated Show resolved Hide resolved

rth approved these changes Oct 13, 2018

View reviewed changes

rth added this to the 0.20.1 milestone Oct 13, 2018

Jonathan Blackman added 6 commits October 15, 2018 10:24

Increase mean precision for large float32 arrays

4120dea

Use at least float64 when computing mean in _incremental_mean_and_var. This avoids precision issues with np.mean on long multidimensional float32 arrays.

Fix line length

ab8dd53

Add test for StandardScaler preserving dtype

1740092

Add dtype test for StandardScaler mean and scale

bfdef74

Assert directly and remove default kwarg in test

3f93c86

Update v0.20.rst to document change

c2bcec0

bauks force-pushed the mean_precision branch from 99b002d to c2bcec0 Compare October 15, 2018 17:26

rth reviewed Oct 16, 2018

View reviewed changes

Fix what's new

4b3199f

rth merged commit a1d0e96 into scikit-learn:master Oct 16, 2018

anuragkapale pushed a commit to anuragkapale/scikit-learn that referenced this pull request Oct 23, 2018

FIX Increase mean precision for large float32 arrays (scikit-learn#12338

70db6ca

)

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018

FIX Increase mean precision for large float32 arrays (scikit-learn#12338

919fd0e

)

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018

FIX Increase mean precision for large float32 arrays (scikit-learn#12338

c58d322

)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

FIX Increase mean precision for large float32 arrays (scikit-learn#12338

91f7a68

)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX Increase mean precision for large float32 arrays (scikit-…

bea0859

…learn#12338)" This reverts commit 91f7a68.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX Increase mean precision for large float32 arrays (scikit-…

26f08f2

…learn#12338)" This reverts commit 91f7a68.

ogrisel mentioned this pull request Aug 21, 2023

Make standard scaler compatible to Array API #27113

Merged

Uh oh!

Conversation

bauks commented Oct 9, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

jnothman commented Oct 10, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

bauks commented Oct 10, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman commented Oct 10, 2018

Uh oh!

bauks commented Oct 10, 2018

Uh oh!

jnothman commented Oct 11, 2018 via email

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bauks commented Oct 12, 2018

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

rth commented Oct 16, 2018

Uh oh!

rth Oct 16, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants