[MRG] Increase mean precision for large float32 arrays#12338
[MRG] Increase mean precision for large float32 arrays#12338rth merged 7 commits intoscikit-learn:masterfrom
Conversation
|
Can you confirm that there is an existing test ensuring the inputs dtype is maintained after scaling, and add one if not? |
|
I didn't see such a test so I added one. Note that it fails even before this change if we include np.float128, due to the check_array call in StandardScaler.partial_fit with dtype=FLOAT_DTYPES; this seems fine as a clear warning is given: |
jnothman
left a comment
There was a problem hiding this comment.
Actually, does the dtype of mean_ etc change? This might at least deserve a note in what's new, if not a test. Otherwise this is looking good!
|
I'd be okay to include this in 0.20.X. |
|
The dtype of mean_ and scale_ is always float64, which is also the case without this change. Should I add a check for this to the test? |
|
Sure, thanks!
|
rth
left a comment
There was a problem hiding this comment.
LGTM apart for the comment below, and https://github.com/scikit-learn/scikit-learn/pull/12338/files#r224266370 still needs addressing (by removing copy=True).
|
Great, I think those issues have been addressed now. |
rth
left a comment
There was a problem hiding this comment.
LGTM, please add a what's new in doc/whats_new/v0.20.rst under the 0.20.1 section. As far as I can tell, estimators affected by this are preprocessing.StandardScaler and decomposition.IncrementalPCA
Use at least float64 when computing mean in _incremental_mean_and_var. This avoids precision issues with np.mean on long multidimensional float32 arrays.
|
Thanks, merging, the appveyor failure is unrelated. |
doc/whats_new/v0.20.rst
Outdated
| precision issues when using float32 datasets. Affects | ||
| :class:`preprocessing.StandardScaler` and | ||
| :class:`decomposition.IncrementalPCA`. | ||
| :issue:`12333` by :user:`bauks <bauks>`. |
There was a problem hiding this comment.
Actually this should reference the PR and we can't mention a private function in the what's new, only the effect of this change on public estimators. I'll fix it ..
…learn#12338)" This reverts commit 91f7a68.
…learn#12338)" This reverts commit 91f7a68.
Reference Issues/PRs
Fixes #12333.
What does this implement/fix? Explain your changes.
Uses at least float64 precision when computing mean in _incremental_mean_and_var.
This avoids precision issues with np.mean on long multidimensional float32 arrays as discussed in numpy/numpy#9393