BUG: random: Fix formula in rk_hypergeometric_hrua#11138
BUG: random: Fix formula in rk_hypergeometric_hrua#11138WarrenWeckesser wants to merge 1 commit intonumpy:masterfrom
Conversation
The HRUA algorithm is described in Stadlober's thesis "Sampling from
Poisson, binomial and hypergeometric distributions: ratio of uniforms
as a simple and fast alternative." The pseudo-code for the algorithm
begins on page 82, where the parameter c is defined to be
c = sqrt((N - n)*n*p*q/(N - 1) + 1/2)
`N` is the population size, `n` is the sample size, `p` = `M/N` is the
fraction of the first type (i.e. the fraction of "good" values or
successes) and `q` is `1 - p`.
The C code takes advantage of the symmetry and uses
m = min(sample, popsize - sample);
as the sample size to be computed, where `popsize` is (not surprisingly)
the population size (i.e. `good + bad`), and `sample` is the requested
sample size.
The formula for `c` above is translated in the C code as
d7 = sqrt((double)(popsize - m) * sample * d4 * d5 / (popsize - 1) + 0.5);
where `d4` and `d5` are `p` and `q`, resp.
The problem is that `sample` should have been replaced by `m` in that
formula. It should be
d7 = sqrt((double)(popsize - m) * m * d4 * d5 / (popsize - 1) + 0.5);
I haven't thoroughly analyzed the implications of this mistake. It will
only have an effect when `sample` is greater than half of `popsize`.
(When `sample` is less than or equal to half of `popsize`, `m == sample`.)
A few tests of the expected values of the distribution don't show
any obvious biases, but I haven't run it through a rigorous statistical
testing procedure.
It appears that it simply means that the bound for the rejection
method is too big, so the acceptance ratio of the method is much
lower than it should be. This can be seen in a test such as:
In [9]: %timeit np.random.hypergeometric(10, 190, 189, size=2500000)
2.07 s ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit np.random.hypergeometric(10, 190, 11, size=2500000)
1.14 s ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Those two computations should take the same amount of time.
After the fixing the formula, they do:
In [4]: %timeit np.random.hypergeometric(10, 190, 189, size=2500000)
1.14 s ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit np.random.hypergeometric(10, 190, 11, size=2500000)
1.14 s ± 694 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
|
This changes the stream of values produced by |
|
Can you verify the problem using a Q-Q or similar plot? If this is indeed a bug then it should be fixable despite breaking the stream. |
|
The mistake is pretty clear; at this point, I don't think I'll spend more time on numerical experiments related to it. Instead, I'm continuing to look into the derivation of the HRUA algorithm. Once I understand that better, it should be clear whether or not the mistake would introduce any kind of bias. If the only effect is to make the code slower than it could be, then there is no urgent need to fix it in numpy, and |
|
A bit more reading confirms what I observed. The mistake makes the shape parameter of the "table mountain hat" function used in the ratio-of-uniforms method bigger than necessary, so it doesn't cover the hypergeometric distribution's scaled PMF as tightly as it could. And that means that the rejection method is not as efficient as it could be. The variates that are generated, however, are not wrong. So there is no urgent need to merge this pull request. It can wait until numpy implements some form of versioning for the random module. |
|
I'm closing this PR. When NEP 019 is implemented (assuming that it will be accepted), we can fix this in the |
Summary of the changes: * Move the functions random_hypergeometric_hyp, random_hypergeometric_hrua and random_hypergeometric from distributions.c to legacy-distributions.c. These are now the legacy implementation of hypergeometric. * Add the files logfactorial.c and logfactorial.h, containing the function logfactorial(int64_t k). * Add the files random_hypergeometric.c and random_hypergeometric.h, containing the function random_hypergeometric (the new implementation of the hypergeometric distribution). See more details below. * Fix two tests in numpy/random/tests/test_generator_mt19937.py that used values returned by the hypergeometric distribution. The new implementation changes the stream, so those tests needed to be updated. * Remove another test obviated by an added constraint on the arguments of hypergeometric. Details of the rewrite: If you carefully step through the old function rk_hypergeometric_hyp(), you'll see that the end result is basically the same as the new function hypergeometric_sample(), but the new function accomplishes the result with just integers. The floating point calculations in the old code caused problems when the arguments were extremely large (explained in more detail in the unmerged pull request numpy#9834). The new version of hypergeometric_hrua() is a new translation of Stadlober's ratio-of-uniforms algorithm for the hypergeometric distribution. It fixes a mistake in the old implementation that made the method less efficient than it could be (see the details in the unmerged pull request numpy#11138), and uses a faster function for computing log(k!). The HRUA algorithm suffers from loss of floating point precision when the arguments are *extremely* large (see the comments in github issue 11443). To avoid these problems, the arguments `ngood` and `nbad` of hypergeometric must be less than 10**9. This constraint obviates an existing regression test that was run on systems with 64 bit long integers, so that test was removed.
The HRUA algorithm is described in Stadlober's thesis "Sampling from
Poisson, binomial and hypergeometric distributions: ratio of uniforms
as a simple and fast alternative." The pseudo-code for the algorithm
begins on page 82, where the parameter c is defined to be
Nis the population size,nis the sample size,p=M/Nis thefraction of the first type (i.e. the fraction of "good" values or
successes) and
qis1 - p.The C code takes advantage of the symmetry and uses
as the sample size to be computed, where
popsizeis (not surprisingly)the population size (i.e.
good + bad), andsampleis the requestedsample size.
The formula for
cabove is translated in the C code aswhere
d4andd5arepandq, resp.The problem is that
sampleshould have been replaced bymin thatformula. It should be
I haven't thoroughly analyzed the implications of this mistake. It will
only have an effect when
sampleis greater than half ofpopsize.(When
sampleis less than or equal to half ofpopsize,m == sample.)A few tests of the expected values of the distribution don't show
any obvious biases, but I haven't run it through a rigorous statistical
testing procedure.
It appears that it simply means that the bound for the rejection
method is too big, so the acceptance ratio of the method is much
lower than it should be. This can be seen in a test such as:
Those two computations should take the same amount of time.
After the fixing the formula, they do: