Skip to content

BUG: StandardScaler partial_fit overflows #5602

@giorgiop

Description

@giorgiop

The recent implementation of partial_fit for StandardScaler can overflow. A use case there is to transform indefinitely long stream of data, but that is problematic with the current implementation. The reason is that to compute the running mean, we keep track of the sample sum.

Here the code to reproduce the behavior. To simulate long stream of data would take long time; instead, I use samples with very large norm but the effect is the same. The same batch is presented to the transformer many times. The mean should be same.

from sklearn.preprocessing import StandardScaler
import numpy as np

rng = np.random.RandomState(0)

def gen_1d_uniform_batch(min_, max_, n):
    return rng.uniform(min_, max_, size=(n, 1))

max_f = np.finfo(np.float64).max / 1e5
min_f = max_f / 1e2
stream_dim = 100
batch_dim = 500000
print("mean overflow: batch vs online on %d repetitions" % stream_dim)

X = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)

scaler = StandardScaler(with_std=False).fit(X)
print(scaler.mean_)
[  1.79769313e+301]

iscaler = StandardScaler(with_std=False)
batch = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)
for _ in range(stream_dim):
    iscaler = iscaler.partial_fit(batch)
RuntimeWarning: overflow encountered in add
  updated_mean = (last_sum + new_sum) / updated_sample_count

print(iscaler.mean_)
[ inf]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions