-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
BUG: StandardScaler partial_fit overflows #5602
Copy link
Copy link
Open
Labels
BugModerateAnything that requires some knowledge of conventions and best practicesAnything that requires some knowledge of conventions and best practiceshelp wantedmodule:preprocessing
Description
The recent implementation of partial_fit for StandardScaler can overflow. A use case there is to transform indefinitely long stream of data, but that is problematic with the current implementation. The reason is that to compute the running mean, we keep track of the sample sum.
Here the code to reproduce the behavior. To simulate long stream of data would take long time; instead, I use samples with very large norm but the effect is the same. The same batch is presented to the transformer many times. The mean should be same.
from sklearn.preprocessing import StandardScaler
import numpy as np
rng = np.random.RandomState(0)
def gen_1d_uniform_batch(min_, max_, n):
return rng.uniform(min_, max_, size=(n, 1))
max_f = np.finfo(np.float64).max / 1e5
min_f = max_f / 1e2
stream_dim = 100
batch_dim = 500000
print("mean overflow: batch vs online on %d repetitions" % stream_dim)
X = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)
scaler = StandardScaler(with_std=False).fit(X)
print(scaler.mean_)
[ 1.79769313e+301]
iscaler = StandardScaler(with_std=False)
batch = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)
for _ in range(stream_dim):
iscaler = iscaler.partial_fit(batch)
RuntimeWarning: overflow encountered in add
updated_mean = (last_sum + new_sum) / updated_sample_count
print(iscaler.mean_)
[ inf]Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
BugModerateAnything that requires some knowledge of conventions and best practicesAnything that requires some knowledge of conventions and best practiceshelp wantedmodule:preprocessing