BUG: StandardScaler partial_fit overflows

The recent implementation of `partial_fit` for `StandardScaler` can overflow. A use case there is to transform indefinitely long stream of data, but that is problematic with the current implementation. The reason is that to compute the running mean, [we keep track](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py#L788) of the sample sum.

Here the code to reproduce the behavior. To simulate long stream of data would take long time; instead, I use samples with very large norm but the effect is the same. The same batch is presented to the transformer many times. The mean should be same.

```python
from sklearn.preprocessing import StandardScaler
import numpy as np

rng = np.random.RandomState(0)

def gen_1d_uniform_batch(min_, max_, n):
    return rng.uniform(min_, max_, size=(n, 1))

max_f = np.finfo(np.float64).max / 1e5
min_f = max_f / 1e2
stream_dim = 100
batch_dim = 500000
print("mean overflow: batch vs online on %d repetitions" % stream_dim)

X = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)

scaler = StandardScaler(with_std=False).fit(X)
print(scaler.mean_)
[  1.79769313e+301]

iscaler = StandardScaler(with_std=False)
batch = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)
for _ in range(stream_dim):
    iscaler = iscaler.partial_fit(batch)
RuntimeWarning: overflow encountered in add
  updated_mean = (last_sum + new_sum) / updated_sample_count

print(iscaler.mean_)
[ inf]
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: StandardScaler partial_fit overflows #5602

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

BUG: StandardScaler partial_fit overflows #5602

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions