ENH: Faster algorithm for np.random.choice.#10124
ENH: Faster algorithm for np.random.choice.#10124jeremycg wants to merge 1 commit intonumpy:masterfrom
np.random.choice.#10124Conversation
|
See also #9855 |
|
Thanks @eric-wieser , that makes sense. It can't be incorporated due to "stream-backwards compatibility", in that the random numbers will not match the previous versions with the same seed, and this is required for backwards compatibility. Reading the PRs, it seems like there are multiple pull requests that are hanging around due to this same issue - should this be closed, or kept open for a point when a decision is made about how to incorporate versioned seeds? |
|
How we have been managing this is by adding a |
np.random.
np.random.np.random.choice.
|
I really think a wholesale new approach is required to allow progress in this module. Most approaches lead to lots of long-run code complexity and most people will never benefit from improved algorithms since they will never become the default. I haven't found a good way to accomplish this within the constraints of Cython (since new can't really be used, which might allow some clever tricks, although this approach is also probably not right either). |
|
This PR should be part of the random number improvement NEP 19, specifically it should be part of the bashtage/randomgen project which will eventually be vendored or copied into NumPy. |
|
@bashtage, it seems this idea was not incorporated into the new |
|
Not verbatim, but in principle. The block starting here: https://github.com/numpy/numpy/blob/master/numpy/random/generator.pyx#L680 |
|
Well, maybe that is a different issue - perhaps leave this open until we know for sure. |
|
@jeremycg can you take a look? |
|
It's not a replacement for the mentioned block, but for the chunk starting at line 657, which has not been changed: https://github.com/numpy/numpy/blob/master/numpy/random/generator.pyx#L657 The change is specific to the weighted random choice without replacement. I'm not up to date with the update to the rules for "stream-backwards compatibility": will review what's happened and update |
|
In Generator there are no guarantees at all at the moment, not even deprecation cycles, so this is a great time to make these kinds of performance-driven changes. |
|
Does this method rely on argpartition providing a determinstic order for reproducibility? Curious since the docstring disclaims this . |
|
It does depend on argpartition, depending on how we define reproducibility. The actual elements chosen will be repeatable, but the order in which they are returned depends on argpartitions order. We are getting n samples without replacement, weighted by p I'm not sure what the desired output here would be: if we care about repeated order, we could sort by our derived key, at a small cost of time, or sort and then scramble our original indices. |
|
Does this method have performance problems when the sample size is large but the required size is small, e.g., |
|
Since there were a few questions and this needs to be redone in the new random number API, I will close this. Not to stop it in its track; I actually hope for a new PR against the new random number generator API, where we can make such changes! |
Replacing the Random Choice with replacement with a more efficient version taken from here: https://stackoverflow.com/questions/15113650/faster-weighted-sampling-without-replacement/15205104#15205104
And published here:
http://www.sciencedirect.com/science/article/pii/S002001900500298X#
The basic idea is we determine a key for each item to be sampled from, based on 'a random uniform number and raising it to the power of one over the weight for each item'. Here we do it in log space, for numerical stability. Then we take the
sizehighest, by a partition sort.For me, it is roughly double as fast, on both 50,20 and 50000, 20000 for size and samples:
It fails on one unit test, test_choice_nonuniform_noreplace, as the randomly chosen numbers are different, due to the different algorithm. The rest pass.
I am not 100% sure on the
size -1in the argpartition, review is more than welcome.