-
-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Description
In the pull request gh-9834, I reported that a call such as
np.random.hypergeometric(2**55, 2**55, 10, size=20)
where good and bad are ridiculously large and nsample is less than or equal to 10, incorrectly generates an array of all zeros. As explained in that pull request, the problem is because of the limited precision of the floating point calculations in the underlying C code.
There are also problems when nsample is larger. For example, these calls will hang the Python interpreter:
np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2)
np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12)
Moreover, the following call generates samples that are not correctly distributed:
np.random.hypergeometric(2**48, 2**47, 12, size=10000000)
I have run repeated tests and checked the distribution using either the chi-square test or the G-test (i.e. likelihood ratio test). When the arguments are crazy big, the distribution is no longer correct.
All these problems are connected to floating point calculations in which important information ends up having a magnitude on the scale of the ULP of the variables in the computation.
A quick fix is to simply disallow such large arguments. I'll create a pull request in which the function will raise a ValueError if good + bad exceeds some safe maximum. That maximum is still to be determined, but preliminary experiments show that something on the order of 2**35 works. I don't expect the change to have any impact on existing code--it is hard to imagine anyone actually using the function with such large values. It is still worthwhile fixing the issue, if only to prevent hanging the interpreter when someone accidentally gives huge arguments to the function.
A better fix will be to improve the implementation, but that will almost certainly require changes the stream of variates produced by the function. Such a change will have to wait until the change in numpy.random's reproducibility policy has been implemented.