make it possible to run multiple containers in parallel#168
make it possible to run multiple containers in parallel#168erikbern merged 4 commits intonew-run-june-2020from
Conversation
|
I would be interested in the results of 8x annoy in parallel vs. a single-threaded version on a 8 vCPU ec2 instance. I don't know how much load the ec2 instances have otherwise, but 8x times avx2 distance computations might show some serious performance drops in query time. |
|
I don't think this change should have any impact on speed right? Each container is limited to 1 CPU, so they will get their own CPU (as long as the parallelism is lower than the number of CPUs, of course) That being said, I just noticed running |
|
I don't know which CPUs are used in the current ec2 generation, but you may always run into problems with shared cache lines and clock downscaling if there are too many heavy SIMD instructions. It's best to benchmark that :-) |
|
It's true you might seem some more cache thrashing with higher parallelism. Idk feels like it's worth it though in order to bring down the total runtime by 4-8x. It should affect all the algorithms approximately the same anyway. Maybe we can increase the number of runs to more than 2. I think something like 98% of all time is spent building the index as opposed to running the queries, so it's unlikely that more than one algorithm is running queries at any point in time. Running the queries 3-5 times would warm up the cache on that CPU. I'm very tempted to doing this, as it would bring down the runtime from say a month to a week instead. It would make it easier/cheaper to re-run the benchmarks more often, with a marginal impact on the results. |
6b28948 to
610ab43
Compare
|
Rewrote this to actually farm it out to each CPU. I think this should work. I wiped the results for MNIST and I'm rerunning it now. Just looking at the output it looks much faster, and when I run If this works out well them I'm tempted to re-run the glove benchmarks using this as well (but maybe using say 3 or 4 processes not 7). |
|
I can also run the full benchmark on a smaller random dataset and see if there's any difference in the results with/without this change. |
|
As I said above, I think annoy on MNIST might already show a performance issue if there is one. Maybe compare single-threaded/4 parallel processes/8 parallel processes/16 parallel processes (the latter just as a sanity check to see that it performs much worse). |
|
Sure, I can do that. It assigns CPUs 1...args.parallelism so I can't do more than 15 on a c5.4xlarge. I'm currently running with 7. Once it's done let me try with 1 and see what's the difference. |
|
... and here's the benchmark with parallelism = 1 (it took almost 24h to run as I expected). There is a small different, but it seems to be no more than 5-10%, and more importantly, it doesn't change the relative ranking (seems to affect all algorithms equally). My thinking is to merge this, increase the number of runs from 2 to 5, and then run all benchmarks with parallelism 3-5 in order to finish everything much faster. |
|
Merging this for now and kicking of a glove build |


This should do the trick :)
Running it on MNIST right now to see if it works