Skip to content

BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443

@WarrenWeckesser

Description

@WarrenWeckesser

In the pull request gh-9834, I reported that a call such as

np.random.hypergeometric(2**55, 2**55, 10, size=20)

where good and bad are ridiculously large and nsample is less than or equal to 10, incorrectly generates an array of all zeros. As explained in that pull request, the problem is because of the limited precision of the floating point calculations in the underlying C code.

There are also problems when nsample is larger. For example, these calls will hang the Python interpreter:

np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2)
np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12)

Moreover, the following call generates samples that are not correctly distributed:

np.random.hypergeometric(2**48, 2**47, 12, size=10000000)

I have run repeated tests and checked the distribution using either the chi-square test or the G-test (i.e. likelihood ratio test). When the arguments are crazy big, the distribution is no longer correct.

All these problems are connected to floating point calculations in which important information ends up having a magnitude on the scale of the ULP of the variables in the computation.

A quick fix is to simply disallow such large arguments. I'll create a pull request in which the function will raise a ValueError if good + bad exceeds some safe maximum. That maximum is still to be determined, but preliminary experiments show that something on the order of 2**35 works. I don't expect the change to have any impact on existing code--it is hard to imagine anyone actually using the function with such large values. It is still worthwhile fixing the issue, if only to prevent hanging the interpreter when someone accidentally gives huge arguments to the function.

A better fix will be to improve the implementation, but that will almost certainly require changes the stream of variates produced by the function. Such a change will have to wait until the change in numpy.random's reproducibility policy has been implemented.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions