BUG: random: Fix formula in rk_hypergeometric_hrua by WarrenWeckesser · Pull Request #11138 · numpy/numpy

WarrenWeckesser · 2018-05-22T20:19:37Z

The HRUA algorithm is described in Stadlober's thesis "Sampling from
Poisson, binomial and hypergeometric distributions: ratio of uniforms
as a simple and fast alternative." The pseudo-code for the algorithm
begins on page 82, where the parameter c is defined to be

c = sqrt((N - n)*n*p*q/(N - 1) + 1/2)

N is the population size, n is the sample size, p = M/N is the
fraction of the first type (i.e. the fraction of "good" values or
successes) and q is 1 - p.

The C code takes advantage of the symmetry and uses

m = min(sample, popsize - sample);

as the sample size to be computed, where popsize is (not surprisingly)
the population size (i.e. good + bad), and sample is the requested
sample size.

The formula for c above is translated in the C code as

d7 = sqrt((double)(popsize - m) * sample * d4 * d5 / (popsize - 1) + 0.5);

where d4 and d5 are p and q, resp.

The problem is that sample should have been replaced by m in that
formula. It should be

d7 = sqrt((double)(popsize - m) * m * d4 * d5 / (popsize - 1) + 0.5);

I haven't thoroughly analyzed the implications of this mistake. It will
only have an effect when sample is greater than half of popsize.
(When sample is less than or equal to half of popsize, m == sample.)
A few tests of the expected values of the distribution don't show
any obvious biases, but I haven't run it through a rigorous statistical
testing procedure.

It appears that it simply means that the bound for the rejection
method is too big, so the acceptance ratio of the method is much
lower than it should be. This can be seen in a test such as:

In [9]: %timeit np.random.hypergeometric(10, 190, 189, size=2500000)
2.07 s ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit np.random.hypergeometric(10, 190, 11, size=2500000)
1.14 s ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Those two computations should take the same amount of time.
After the fixing the formula, they do:

In [4]: %timeit np.random.hypergeometric(10, 190, 189, size=2500000)
1.14 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit np.random.hypergeometric(10, 190, 11, size=2500000)
1.14 s ± 694 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

The HRUA algorithm is described in Stadlober's thesis "Sampling from Poisson, binomial and hypergeometric distributions: ratio of uniforms as a simple and fast alternative." The pseudo-code for the algorithm begins on page 82, where the parameter c is defined to be c = sqrt((N - n)*n*p*q/(N - 1) + 1/2) `N` is the population size, `n` is the sample size, `p` = `M/N` is the fraction of the first type (i.e. the fraction of "good" values or successes) and `q` is `1 - p`. The C code takes advantage of the symmetry and uses m = min(sample, popsize - sample); as the sample size to be computed, where `popsize` is (not surprisingly) the population size (i.e. `good + bad`), and `sample` is the requested sample size. The formula for `c` above is translated in the C code as d7 = sqrt((double)(popsize - m) * sample * d4 * d5 / (popsize - 1) + 0.5); where `d4` and `d5` are `p` and `q`, resp. The problem is that `sample` should have been replaced by `m` in that formula. It should be d7 = sqrt((double)(popsize - m) * m * d4 * d5 / (popsize - 1) + 0.5); I haven't thoroughly analyzed the implications of this mistake. It will only have an effect when `sample` is greater than half of `popsize`. (When `sample` is less than or equal to half of `popsize`, `m == sample`.) A few tests of the expected values of the distribution don't show any obvious biases, but I haven't run it through a rigorous statistical testing procedure. It appears that it simply means that the bound for the rejection method is too big, so the acceptance ratio of the method is much lower than it should be. This can be seen in a test such as: In [9]: %timeit np.random.hypergeometric(10, 190, 189, size=2500000) 2.07 s ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [10]: %timeit np.random.hypergeometric(10, 190, 11, size=2500000) 1.14 s ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Those two computations should take the same amount of time. After the fixing the formula, they do: In [4]: %timeit np.random.hypergeometric(10, 190, 189, size=2500000) 1.14 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: %timeit np.random.hypergeometric(10, 190, 11, size=2500000) 1.14 s ± 694 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

WarrenWeckesser · 2018-05-22T20:23:19Z

This changes the stream of values produced by numpy.random.hypergeometric(), so the test for repeatability will fail.

bashtage · 2018-05-22T22:34:03Z

Can you verify the problem using a Q-Q or similar plot? If this is indeed a bug then it should be fixable despite breaking the stream.

WarrenWeckesser · 2018-05-22T22:43:04Z

The mistake is pretty clear; at this point, I don't think I'll spend more time on numerical experiments related to it. Instead, I'm continuing to look into the derivation of the HRUA algorithm. Once I understand that better, it should be clear whether or not the mistake would introduce any kind of bias. If the only effect is to make the code slower than it could be, then there is no urgent need to fix it in numpy, and hypergeometric can remain in its petrified state. :)

WarrenWeckesser · 2018-05-23T14:24:45Z

A bit more reading confirms what I observed. The mistake makes the shape parameter of the "table mountain hat" function used in the ratio-of-uniforms method bigger than necessary, so it doesn't cover the hypergeometric distribution's scaled PMF as tightly as it could. And that means that the rejection method is not as efficient as it could be. The variates that are generated, however, are not wrong.

So there is no urgent need to merge this pull request. It can wait until numpy implements some form of versioning for the random module.

WarrenWeckesser · 2018-07-02T15:06:42Z

I'm closing this PR. When NEP 019 is implemented (assuming that it will be accepted), we can fix this in the RandomGenerator class.

Summary of the changes: * Move the functions random_hypergeometric_hyp, random_hypergeometric_hrua and random_hypergeometric from distributions.c to legacy-distributions.c. These are now the legacy implementation of hypergeometric. * Add the files logfactorial.c and logfactorial.h, containing the function logfactorial(int64_t k). * Add the files random_hypergeometric.c and random_hypergeometric.h, containing the function random_hypergeometric (the new implementation of the hypergeometric distribution). See more details below. * Fix two tests in numpy/random/tests/test_generator_mt19937.py that used values returned by the hypergeometric distribution. The new implementation changes the stream, so those tests needed to be updated. * Remove another test obviated by an added constraint on the arguments of hypergeometric. Details of the rewrite: If you carefully step through the old function rk_hypergeometric_hyp(), you'll see that the end result is basically the same as the new function hypergeometric_sample(), but the new function accomplishes the result with just integers. The floating point calculations in the old code caused problems when the arguments were extremely large (explained in more detail in the unmerged pull request numpy#9834). The new version of hypergeometric_hrua() is a new translation of Stadlober's ratio-of-uniforms algorithm for the hypergeometric distribution. It fixes a mistake in the old implementation that made the method less efficient than it could be (see the details in the unmerged pull request numpy#11138), and uses a faster function for computing log(k!). The HRUA algorithm suffers from loss of floating point precision when the arguments are *extremely* large (see the comments in github issue 11443). To avoid these problems, the arguments `ngood` and `nbad` of hypergeometric must be less than 10**9. This constraint obviates an existing regression test that was run on systems with 64 bit long integers, so that test was removed.

WarrenWeckesser mentioned this pull request May 22, 2018

ENH: random: Add multivariate_hypergeometric function. #8056

Closed

charris added 00 - Bug component: numpy.random labels May 23, 2018

WarrenWeckesser closed this Jul 2, 2018

WarrenWeckesser mentioned this pull request Jun 12, 2019

MAINT: random: Rewrite the hypergeometric distribution. #13761

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: random: Fix formula in rk_hypergeometric_hrua#11138

BUG: random: Fix formula in rk_hypergeometric_hrua#11138
WarrenWeckesser wants to merge 1 commit intonumpy:masterfrom
WarrenWeckesser:fix-hypergeom-hrua

WarrenWeckesser commented May 22, 2018

Uh oh!

WarrenWeckesser commented May 22, 2018

Uh oh!

bashtage commented May 22, 2018

Uh oh!

WarrenWeckesser commented May 22, 2018 •

edited

Loading

Uh oh!

WarrenWeckesser commented May 23, 2018 •

edited

Loading

Uh oh!

WarrenWeckesser commented Jul 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

WarrenWeckesser commented May 22, 2018

Uh oh!

WarrenWeckesser commented May 22, 2018

Uh oh!

bashtage commented May 22, 2018

Uh oh!

WarrenWeckesser commented May 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WarrenWeckesser commented May 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WarrenWeckesser commented Jul 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WarrenWeckesser commented May 22, 2018 •

edited

Loading

WarrenWeckesser commented May 23, 2018 •

edited

Loading