Skip to content

ENH: More efficient algorithm for unweighted random choice without replacement#5158

Closed
staticd-growthecommons wants to merge 4 commits intonumpy:masterfrom
staticd-growthecommons:choice-unweighted-no-replace
Closed

ENH: More efficient algorithm for unweighted random choice without replacement#5158
staticd-growthecommons wants to merge 4 commits intonumpy:masterfrom
staticd-growthecommons:choice-unweighted-no-replace

Conversation

@staticd-growthecommons
Copy link

  1. The present algorithm applies a Knuth shuffle to the entire population and then truncates it to the requested size. This can be more efficiently achieved by not shuffling those elements that are not seen by the end user. Especially relevant when choosing small samples from a large population.

  2. The present shuffling code is very general purpose. As what is being shuffled is an array of indices, specifying the type in the cython code gives significant speed benefits.

input size output size original new
10 1 25.3 µs 24.6 µs
10 3 25.5 µs 24.7 µs
10 10 25.5 µs 25 µs
100 10 40.6 µs 25.2 µs
100 30 40.7 µs 26 µs
100 100 40.8 µs 28.2 µs
1000 100 190 µs 29.7 µs
1000 300 192 µs 35.7 µs
1000 1000 192 µs 58.9 µs
10000 1000 1.68 ms 76.4 µs
10000 3000 1.69 ms 140 µs
10000 10000 1.71 ms 363 µs

@staticd-growthecommons staticd-growthecommons changed the title More efficient algorithm for unweighted random choice without replacement ENH: More efficient algorithm for unweighted random choice without replacement Oct 7, 2014
@staticd-growthecommons
Copy link
Author

Oh, and if anyone wants to verify that the code is working properly you can run the following test:

 def foo(sz,osz,t):
     a=np.arange(sz)
     f=np.zeros((osz,sz))
     for trial in range(t):
         c=np.random.choice(a,osz,replace=False)
         for i in range(osz):
             f[i,c[i]]+=1
     f=f*sz/t
     print f

for i in range(1,11):
     foo(10,i,1000000)

It prints the number of elements * probability of finding a given number in a given slot. If it is uniformly distributed all the outputs must be close to 1

It wont pass the present

@seberg
Copy link
Member

seberg commented Oct 7, 2014

I know the old stuff was not great, and I always didn't like that implication, but unfortunatly the exact stream of random numbers must be reliable, so changing this is a bit more involved, or does the stream not change with your algorithm change?

@arjoly
Copy link

arjoly commented Oct 7, 2014

For information, scikit-learn has some nice codes to do sampling without replacement
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/_random.pyx

@seberg
Copy link
Member

seberg commented Oct 7, 2014

Its great, but unfortunatly we were not smart enough to at least take the last N values, which wouuld have allowed to not shuffle the whole thing... As is, to change this, you would probably need a transitional keyword argument and some futurewarnings, etc.

@charris
Copy link
Member

charris commented Oct 7, 2014

I wonder if we couldn't start adding a version/method keyword to some of the generators if we want to implement faster algorithms? The normal distribution is another that can be speeded up.

@staticd-growthecommons
Copy link
Author

The stream is not identical because the original algorithm processed it from the end to the begining and then gave the first N values. This behaviour cannot be replicated without prcessing the entire population.

The output from the new algorithm satisfies the conditions:
(1) no number appears twice in the output (see test below)
(2) at every position of the output, it is equally likely for any of the remaining input numbers to appear (see test given in previous post)

In [1]: import numpy as np

In [2]: def norepeats(x,maxv):
    c=np.zeros(maxv,dtype=np.int)
    for i in range(len(x)):
        c[x[i]]+=1
    if np.any(c>1): print "error",c
   ...:     

In [3]: for i in range(10):
    for j in range(1,i):
        for t in range(1000):
            norepeats(np.random.choice(i,j,replace=False),i)

@seberg Does numpy guarantee that the outputs will have the same sequence for a given seed across generations?

@charris That is something that would be extremely useful. I have alternate algorithms for each of the choice methods, some of which are slower than the original for some parameters while upto 9x faster for other parameter values. Allowing the user to choose the the algorithm will probably be quite useful. Especially when they want to make a speed/memory tradeoff.

@charris
Copy link
Member

charris commented Oct 7, 2014

Does numpy guarantee that the outputs will have the same sequence for a given seed across generations?

Yes, it is needed for reproducibility.

@staticd-growthecommons
Copy link
Author

Oh.. if it's not got a bunch of hidden costs to develop, the idea of an algorithm key word sounds very tempting

@seberg
Copy link
Member

seberg commented Oct 7, 2014

I am +1 for that. In general, I think that we can even put algorithm=None with a FutureWarning in place if (and only if) the random state was set explicitly. That way we would have some way of changing the default while bothering as few people as possible (and possibly even switching to the faster algorithm faster for those who don't care for reproducability).

@seberg
Copy link
Member

seberg commented Oct 7, 2014

There is always some overhead, especially since we should try to switch the default at some point. But it probably is quite managable (of course it needs hopfully simple followups, later). So if you have some time, maybe you can give it a shot...

@staticd-growthecommons
Copy link
Author

How can one check if the random state was set explicitly?

@seberg
Copy link
Member

seberg commented Oct 7, 2014

I don't think we have that yet, but since we a function to set it, it might be easy?

@njsmith
Copy link
Member

njsmith commented Oct 8, 2014

Currently the object is initialized from a seed= argument. Make it so that
it's instead initialized from seed=, version= arguments, where version
defaults to 0, version=1 gives fancier algorithms, version=-1 gives the
fanciest supported (just like pickle). If seed= isn't given at all then I
guess default to version=-1 b/c the output won't be reproduceable anyway...

On Tue, Oct 7, 2014 at 3:25 PM, seberg notifications@github.com wrote:

I don't think we have that yet, but since we a function to set it, it
might be easy?


Reply to this email directly or view it on GitHub
#5158 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@charris
Copy link
Member

charris commented Oct 8, 2014

The trick with a version argument is to make it backward compatible with pickled generators. We will need to deal with the same problem if we add new rngs, so might as well consider both problems together.

@rkern
Copy link
Member

rkern commented Oct 8, 2014

Handling the unpickling is easy enough. The version number is just another bit of state recorded by __setstate__() and read by __getstate__(). __getstate__() simply takes the absence of the version number as meaning version=0.

@charris
Copy link
Member

charris commented Oct 8, 2014

@rkern I was concerned if we would be able to determine the absence of the version number. Is that easy to do?

@rkern
Copy link
Member

rkern commented Oct 8, 2014

@bashtage
Copy link
Contributor

xref #13164 This issue is fixed in randomgen..

@anirudh2290
Copy link
Member

According to @bashtage 's last comment this issue is fixed in randomgen merge. Just tried to do some basic runs to verify performance.

On my machine

Input Size Output Size Before (1.16.2) After (master)
1000 1000 192 us 69.9 us
10000 1000 478 us 268 us
10000 3000 482 us 270 us
10000 10000 492 us 290 us

So there is close to 2x to 3x speedup depending on input and output size.

Can we close this for now or does this PR still have potential for speed improvements? Thoughts @bashtage @staticd-growthecommons

@seberg
Copy link
Member

seberg commented May 20, 2020

Lets close it. If the new API can be further improved, please open a new PR. I think there was a possible speedup which returns the chosen elements themselves not shuffled (i.e. its a random subset, but the random subset may be ordered).

@seberg seberg closed this May 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants