-
-
Notifications
You must be signed in to change notification settings - Fork 12k
Description
A potential performance issue relating to hash collisions from distinct NaNs was recently reported and fixed in core Python. In brief, the fact that all NaNs hash to 0 makes it easy to accidentally invoke quadratic-time behaviour from NaNs going into a set or dictionary. The fix involved changing the hash of a float NaN to match the usual object hash.
The core issue affects NumPy, too (and in fact one of the examples given in the Python bug report to demonstrate the issue used np.float64 rather than float); NumPy may want to consider changing the hash of floating-point NaNs, both to address the potential performance issue, and to remain compatible with Python.
Reproducing code example:
The following code takes several hours to execute on my machine:
>>> import collections, numpy as np
>>> x = np.full(1000000, np.nan)
>>> c = collections.Counter(x)The cause is that (a) the float64 NaN objects realised from the array all have distinct id, and so are considered unequal for the purposes of dict and set membership, together with (b) all of those NaNs have the same hash value of 0. Then we get a perfect storm of hash collisions while building the dictionary underlying the Counter object.
NumPy/Python version information:
>>> import sys, numpy; print(numpy.__version__, sys.version)
1.20.2 3.9.4 (default, Apr 6 2021, 11:23:37)
[Clang 11.0.3 (clang-1103.0.32.62)]