ENH: More efficient algorithm for unweighted random choice without replacement#5158
Conversation
|
Oh, and if anyone wants to verify that the code is working properly you can run the following test: def foo(sz,osz,t):
a=np.arange(sz)
f=np.zeros((osz,sz))
for trial in range(t):
c=np.random.choice(a,osz,replace=False)
for i in range(osz):
f[i,c[i]]+=1
f=f*sz/t
print f
for i in range(1,11):
foo(10,i,1000000)It prints the number of elements * probability of finding a given number in a given slot. If it is uniformly distributed all the outputs must be close to 1 It wont pass the present |
|
I know the old stuff was not great, and I always didn't like that implication, but unfortunatly the exact stream of random numbers must be reliable, so changing this is a bit more involved, or does the stream not change with your algorithm change? |
|
For information, scikit-learn has some nice codes to do sampling without replacement |
|
Its great, but unfortunatly we were not smart enough to at least take the last N values, which wouuld have allowed to not shuffle the whole thing... As is, to change this, you would probably need a transitional keyword argument and some futurewarnings, etc. |
|
I wonder if we couldn't start adding a version/method keyword to some of the generators if we want to implement faster algorithms? The normal distribution is another that can be speeded up. |
|
The stream is not identical because the original algorithm processed it from the end to the begining and then gave the first N values. This behaviour cannot be replicated without prcessing the entire population. The output from the new algorithm satisfies the conditions: In [1]: import numpy as np
In [2]: def norepeats(x,maxv):
c=np.zeros(maxv,dtype=np.int)
for i in range(len(x)):
c[x[i]]+=1
if np.any(c>1): print "error",c
...:
In [3]: for i in range(10):
for j in range(1,i):
for t in range(1000):
norepeats(np.random.choice(i,j,replace=False),i)@seberg Does numpy guarantee that the outputs will have the same sequence for a given seed across generations? @charris That is something that would be extremely useful. I have alternate algorithms for each of the choice methods, some of which are slower than the original for some parameters while upto 9x faster for other parameter values. Allowing the user to choose the the algorithm will probably be quite useful. Especially when they want to make a speed/memory tradeoff. |
Yes, it is needed for reproducibility. |
|
Oh.. if it's not got a bunch of hidden costs to develop, the idea of an algorithm key word sounds very tempting |
|
I am +1 for that. In general, I think that we can even put |
|
There is always some overhead, especially since we should try to switch the default at some point. But it probably is quite managable (of course it needs hopfully simple followups, later). So if you have some time, maybe you can give it a shot... |
|
How can one check if the random state was set explicitly? |
|
I don't think we have that yet, but since we a function to set it, it might be easy? |
|
Currently the object is initialized from a seed= argument. Make it so that On Tue, Oct 7, 2014 at 3:25 PM, seberg notifications@github.com wrote:
Nathaniel J. Smith |
|
The trick with a version argument is to make it backward compatible with pickled generators. We will need to deal with the same problem if we add new rngs, so might as well consider both problems together. |
|
Handling the unpickling is easy enough. The version number is just another bit of state recorded by |
|
@rkern I was concerned if we would be able to determine the absence of the version number. Is that easy to do? |
|
https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/mtrand.pyx#L699 |
|
xref #13164 This issue is fixed in randomgen.. |
|
According to @bashtage 's last comment this issue is fixed in randomgen merge. Just tried to do some basic runs to verify performance. On my machine
So there is close to 2x to 3x speedup depending on input and output size. Can we close this for now or does this PR still have potential for speed improvements? Thoughts @bashtage @staticd-growthecommons |
|
Lets close it. If the new API can be further improved, please open a new PR. I think there was a possible speedup which returns the chosen elements themselves not shuffled (i.e. its a random subset, but the random subset may be ordered). |
The present algorithm applies a Knuth shuffle to the entire population and then truncates it to the requested size. This can be more efficiently achieved by not shuffling those elements that are not seen by the end user. Especially relevant when choosing small samples from a large population.
The present shuffling code is very general purpose. As what is being shuffled is an array of indices, specifying the type in the cython code gives significant speed benefits.