Skip to content

ENH: Faster algorithm for np.random.choice.#10124

Closed
jeremycg wants to merge 1 commit intonumpy:masterfrom
jeremycg:reserviorchoice
Closed

ENH: Faster algorithm for np.random.choice.#10124
jeremycg wants to merge 1 commit intonumpy:masterfrom
jeremycg:reserviorchoice

Conversation

@jeremycg
Copy link

@jeremycg jeremycg commented Nov 29, 2017

Replacing the Random Choice with replacement with a more efficient version taken from here: https://stackoverflow.com/questions/15113650/faster-weighted-sampling-without-replacement/15205104#15205104

And published here:
http://www.sciencedirect.com/science/article/pii/S002001900500298X#

The basic idea is we determine a key for each item to be sampled from, based on 'a random uniform number and raising it to the power of one over the weight for each item'. Here we do it in log space, for numerical stability. Then we take the size highest, by a partition sort.

For me, it is roughly double as fast, on both 50,20 and 50000, 20000 for size and samples:

sizeofsample, nsample = 50, 20
p = np.random.uniform(size = sizeofsample)
p = p/sum(p)
data = np.arange(sizeofsample)

%%timeit
np.random.choice(data, p = p, replace = False, size = nsample)

It fails on one unit test, test_choice_nonuniform_noreplace, as the randomly chosen numbers are different, due to the different algorithm. The rest pass.

I am not 100% sure on the size -1 in the argpartition, review is more than welcome.

@eric-wieser
Copy link
Member

See also #9855

@jeremycg
Copy link
Author

Thanks @eric-wieser , that makes sense. It can't be incorporated due to "stream-backwards compatibility", in that the random numbers will not match the previous versions with the same seed, and this is required for backwards compatibility.

Reading the PRs, it seems like there are multiple pull requests that are hanging around due to this same issue - should this be closed, or kept open for a point when a decision is made about how to incorporate versioned seeds?

@charris
Copy link
Member

charris commented Nov 30, 2017

How we have been managing this is by adding a method keyword to the calls, although eventually that will end up creating a mess of alternatives. We are also locked into using a single rng. There is work ongoing to fix some of these problems. @bashtage Thoughts?

@charris charris changed the title adding changes to np random choice ENH: Faster algorithm for np.random. Nov 30, 2017
@charris charris changed the title ENH: Faster algorithm for np.random. ENH: Faster algorithm for np.random.choice. Nov 30, 2017
@bashtage
Copy link
Contributor

I really think a wholesale new approach is required to allow progress in this module. Most approaches lead to lots of long-run code complexity and most people will never benefit from improved algorithms since they will never become the default. I haven't found a good way to accomplish this within the constraints of Cython (since new can't really be used, which might allow some clever tricks, although this approach is also probably not right either).

@mattip
Copy link
Member

mattip commented Oct 16, 2018

This PR should be part of the random number improvement NEP 19, specifically it should be part of the bashtage/randomgen project which will eventually be vendored or copied into NumPy.

@mattip mattip added the 57 - Close? Issues which may be closable unless discussion continued label Oct 16, 2018
@mattip
Copy link
Member

mattip commented May 29, 2019

@bashtage, it seems this idea was not incorporated into the new generator.pyx code. Any thoughts?

@bashtage
Copy link
Contributor

Not verbatim, but in principle.

The block starting here:

https://github.com/numpy/numpy/blob/master/numpy/random/generator.pyx#L680

@bashtage
Copy link
Contributor

Well, maybe that is a different issue - perhaps leave this open until we know for sure.

@mattip
Copy link
Member

mattip commented May 29, 2019

@jeremycg can you take a look?

@jeremycg
Copy link
Author

It's not a replacement for the mentioned block, but for the chunk starting at line 657, which has not been changed:

https://github.com/numpy/numpy/blob/master/numpy/random/generator.pyx#L657

The change is specific to the weighted random choice without replacement.

I'm not up to date with the update to the rules for "stream-backwards compatibility": will review what's happened and update

@bashtage
Copy link
Contributor

In Generator there are no guarantees at all at the moment, not even deprecation cycles, so this is a great time to make these kinds of performance-driven changes.

@bashtage
Copy link
Contributor

Does this method rely on argpartition providing a determinstic order for reproducibility? Curious since the docstring disclaims this .

@jeremycg
Copy link
Author

It does depend on argpartition, depending on how we define reproducibility.

The actual elements chosen will be repeatable, but the order in which they are returned depends on argpartitions order.

We are getting n samples without replacement, weighted by p
We find keys, based on random deviates from the exponential distribution, and weight by our p
We then use arg partition to find the n largest keys
We return these elements from our original array

I'm not sure what the desired output here would be: if we care about repeated order, we could sort by our derived key, at a small cost of time, or sort and then scramble our original indices.

@bashtage
Copy link
Contributor

Does this method have performance problems when the sample size is large but the required size is small, e.g.,

x = np.arange(10000000.)
p = 1./np.sqrt((x+1))
p ./ p.sum()
np.random.choice(x, p=p, replace = False, size = 20)

@seberg
Copy link
Member

seberg commented Apr 10, 2020

Since there were a few questions and this needs to be redone in the new random number API, I will close this. Not to stop it in its track; I actually hope for a new PR against the new random number generator API, where we can make such changes!

@seberg seberg closed this Apr 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

01 - Enhancement 57 - Close? Issues which may be closable unless discussion continued component: numpy.random

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants