ENH: More efficient algorithm for unweighted random choice without replacement by staticd-growthecommons · Pull Request #5158 · numpy/numpy

staticd-growthecommons · 2014-10-07T09:56:07Z

The present algorithm applies a Knuth shuffle to the entire population and then truncates it to the requested size. This can be more efficiently achieved by not shuffling those elements that are not seen by the end user. Especially relevant when choosing small samples from a large population.
The present shuffling code is very general purpose. As what is being shuffled is an array of indices, specifying the type in the cython code gives significant speed benefits.

input size	output size	original	new
10	1	25.3 µs	24.6 µs
10	3	25.5 µs	24.7 µs
10	10	25.5 µs	25 µs
100	10	40.6 µs	25.2 µs
100	30	40.7 µs	26 µs
100	100	40.8 µs	28.2 µs
1000	100	190 µs	29.7 µs
1000	300	192 µs	35.7 µs
1000	1000	192 µs	58.9 µs
10000	1000	1.68 ms	76.4 µs
10000	3000	1.69 ms	140 µs
10000	10000	1.71 ms	363 µs

…ment

staticd-growthecommons · 2014-10-07T10:13:27Z

Oh, and if anyone wants to verify that the code is working properly you can run the following test:

 def foo(sz,osz,t):
     a=np.arange(sz)
     f=np.zeros((osz,sz))
     for trial in range(t):
         c=np.random.choice(a,osz,replace=False)
         for i in range(osz):
             f[i,c[i]]+=1
     f=f*sz/t
     print f

for i in range(1,11):
     foo(10,i,1000000)

It prints the number of elements * probability of finding a given number in a given slot. If it is uniformly distributed all the outputs must be close to 1

It wont pass the present

seberg · 2014-10-07T10:50:04Z

I know the old stuff was not great, and I always didn't like that implication, but unfortunatly the exact stream of random numbers must be reliable, so changing this is a bit more involved, or does the stream not change with your algorithm change?

arjoly · 2014-10-07T13:22:16Z

For information, scikit-learn has some nice codes to do sampling without replacement
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/_random.pyx

seberg · 2014-10-07T13:40:02Z

Its great, but unfortunatly we were not smart enough to at least take the last N values, which wouuld have allowed to not shuffle the whole thing... As is, to change this, you would probably need a transitional keyword argument and some futurewarnings, etc.

charris · 2014-10-07T13:44:13Z

I wonder if we couldn't start adding a version/method keyword to some of the generators if we want to implement faster algorithms? The normal distribution is another that can be speeded up.

staticd-growthecommons · 2014-10-07T13:47:14Z

The stream is not identical because the original algorithm processed it from the end to the begining and then gave the first N values. This behaviour cannot be replicated without prcessing the entire population.

The output from the new algorithm satisfies the conditions:
(1) no number appears twice in the output (see test below)
(2) at every position of the output, it is equally likely for any of the remaining input numbers to appear (see test given in previous post)

In [1]: import numpy as np

In [2]: def norepeats(x,maxv):
    c=np.zeros(maxv,dtype=np.int)
    for i in range(len(x)):
        c[x[i]]+=1
    if np.any(c>1): print "error",c
   ...:     

In [3]: for i in range(10):
    for j in range(1,i):
        for t in range(1000):
            norepeats(np.random.choice(i,j,replace=False),i)

@seberg Does numpy guarantee that the outputs will have the same sequence for a given seed across generations?

@charris That is something that would be extremely useful. I have alternate algorithms for each of the choice methods, some of which are slower than the original for some parameters while upto 9x faster for other parameter values. Allowing the user to choose the the algorithm will probably be quite useful. Especially when they want to make a speed/memory tradeoff.

charris · 2014-10-07T13:50:51Z

Does numpy guarantee that the outputs will have the same sequence for a given seed across generations?

Yes, it is needed for reproducibility.

staticd-growthecommons · 2014-10-07T13:55:12Z

Oh.. if it's not got a bunch of hidden costs to develop, the idea of an algorithm key word sounds very tempting

seberg · 2014-10-07T14:03:36Z

I am +1 for that. In general, I think that we can even put algorithm=None with a FutureWarning in place if (and only if) the random state was set explicitly. That way we would have some way of changing the default while bothering as few people as possible (and possibly even switching to the faster algorithm faster for those who don't care for reproducability).

seberg · 2014-10-07T14:05:15Z

There is always some overhead, especially since we should try to switch the default at some point. But it probably is quite managable (of course it needs hopfully simple followups, later). So if you have some time, maybe you can give it a shot...

staticd-growthecommons · 2014-10-07T14:10:09Z

How can one check if the random state was set explicitly?

seberg · 2014-10-07T14:25:09Z

I don't think we have that yet, but since we a function to set it, it might be easy?

njsmith · 2014-10-08T01:31:06Z

Currently the object is initialized from a seed= argument. Make it so that
it's instead initialized from seed=, version= arguments, where version
defaults to 0, version=1 gives fancier algorithms, version=-1 gives the
fanciest supported (just like pickle). If seed= isn't given at all then I
guess default to version=-1 b/c the output won't be reproduceable anyway...

On Tue, Oct 7, 2014 at 3:25 PM, seberg notifications@github.com wrote:

I don't think we have that yet, but since we a function to set it, it
might be easy?

—
Reply to this email directly or view it on GitHub
#5158 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

charris · 2014-10-08T14:16:44Z

The trick with a version argument is to make it backward compatible with pickled generators. We will need to deal with the same problem if we add new rngs, so might as well consider both problems together.

rkern · 2014-10-08T14:29:01Z

Handling the unpickling is easy enough. The version number is just another bit of state recorded by __setstate__() and read by __getstate__(). __getstate__() simply takes the absence of the version number as meaning version=0.

charris · 2014-10-08T14:43:53Z

@rkern I was concerned if we would be able to determine the absence of the version number. Is that easy to do?

rkern · 2014-10-08T14:55:13Z

https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/mtrand.pyx#L699

if len(state) == 5:
    state += (0,)

bashtage · 2019-04-14T22:13:03Z

xref #13164 This issue is fixed in randomgen..

anirudh2290 · 2020-05-20T22:43:53Z

According to @bashtage 's last comment this issue is fixed in randomgen merge. Just tried to do some basic runs to verify performance.

On my machine

Input Size	Output Size	Before (1.16.2)	After (master)
1000	1000	192 us	69.9 us
10000	1000	478 us	268 us
10000	3000	482 us	270 us
10000	10000	492 us	290 us

So there is close to 2x to 3x speedup depending on input and output size.

Can we close this for now or does this PR still have potential for speed improvements? Thoughts @bashtage @staticd-growthecommons

seberg · 2020-05-20T23:49:28Z

Lets close it. If the new API can be further improved, please open a new PR. I think there was a possible speedup which returns the chosen elements themselves not shuffled (i.e. its a random subset, but the random subset may be ordered).

More efficient algorithm for unweighted random choice without replace…

d55bc14

…ment

staticd-growthecommons changed the title ~~More efficient algorithm for unweighted random choice without replacement~~ ENH: More efficient algorithm for unweighted random choice without replacement Oct 7, 2014

Udated test to match new choice_uniform_noreplace

3f49d88

staticd-growthecommons added 2 commits October 8, 2014 00:40

preserve legacy behavior as default

d19abcd

typo in test file

12b8b48

rgommers added 01 - Enhancement component: numpy.random labels Mar 8, 2015

njsmith mentioned this pull request May 11, 2015

using random.choice to sample integers in a large range #5299

Closed

anntzer mentioned this pull request May 11, 2015

docstring of sklearn.utils.random.sample_without_replacement scikit-learn/scikit-learn#4699

Closed

anntzer added a commit to anntzer/numpy that referenced this pull request May 23, 2015

Example use of .version: new .choice from numpy#5158.

48e5447

bashtage mentioned this pull request Apr 14, 2019

ENH: tracking issue for merging randomgen into numpy #13164

Closed

16 tasks

wkschwartz mentioned this pull request Jun 23, 2019

MAINT: Rewrite Floyd algorithm #13812

Merged

seberg closed this May 20, 2020

Uh oh!

Conversation

staticd-growthecommons commented Oct 7, 2014

Uh oh!

staticd-growthecommons commented Oct 7, 2014

Uh oh!

seberg commented Oct 7, 2014

Uh oh!

arjoly commented Oct 7, 2014

Uh oh!

seberg commented Oct 7, 2014

Uh oh!

charris commented Oct 7, 2014

Uh oh!

staticd-growthecommons commented Oct 7, 2014

Uh oh!

charris commented Oct 7, 2014

Uh oh!

staticd-growthecommons commented Oct 7, 2014

Uh oh!

seberg commented Oct 7, 2014

Uh oh!

seberg commented Oct 7, 2014

Uh oh!

staticd-growthecommons commented Oct 7, 2014

Uh oh!

seberg commented Oct 7, 2014

Uh oh!

njsmith commented Oct 8, 2014

Uh oh!

charris commented Oct 8, 2014

Uh oh!

rkern commented Oct 8, 2014

Uh oh!

charris commented Oct 8, 2014

Uh oh!

rkern commented Oct 8, 2014

Uh oh!

bashtage commented Apr 14, 2019

Uh oh!

anirudh2290 commented May 20, 2020

Uh oh!

seberg commented May 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants