Port benches from dhardy/master#201
Conversation
Benchmarks done sampling n*10_000 indices from 1_000_000 length; show inplace faster from appprox 40_0000 indices.
src/seq.rs
Outdated
| // and a trade off could probably be made between memory/cpu, since hashmap operations | ||
| // are slower than array index swapping. | ||
| if amount >= length / 2 { | ||
| if amount * 20 >= length { |
There was a problem hiding this comment.
I would do if amount >= length / 20 so that there is no risk of overflow (a possibilitiy on 32 bit systems... maybe).
On that note, one might think there is a tradeoff here between speed and memory usage. Let's actually look at it a bit.
The memory usage of inplace is sizeof<usize>() * length
The memory usage of cache is a bit more complicated. It is:
(sizeof<usize>() * amount) # Vec return value
# HashTable usage:
# - size of the stored "hash" value
# - size of the stored key/value pair
+ (sizeof<usize>() + sizeof<usize>() * 2) * amount
Which is obviously just
sizeof<usize>() * 4 * amount
This is not accounting for how much the capacitiy's length is less than the allocated space.
So from a memory usage perspective you might want to do
if amount * 4 >= length
But there is always a tradeoff. Basically at amount >= length / 20 you will be using around 5x more memory than the cache implementation would. If that is acceptible (which I honestly think it is) then go for it!
There was a problem hiding this comment.
So long as that much memory is available, I think it should be used. Unfortunately knowing how much is available is another problem, so the crude approximation of amount >= length / N seems reasonable IMO, for some N between 4 and 20 it seems. Probably optimising for execution time is more important.
Another thing that occurs to me is that (a) the length is specified as a usize due to common use (container indices), but this gives the function different capabilities on 32- and 64-bit platforms; at the same time (b) using smaller integer sizes should help significantly and (c) if length >= 2.pow(32) then the user is dealing with very big numbers (unusual) and the inplace version of the function is almost certainly not applicable. So do you think sample_indices should accept u64 arguments, test whether they are within u32 range, and sample_indices_inplace should always use u32 numbers? Potentially it could even be worth having a version of sample_indices_inplace using smaller ints (u16), but probably not worth it. Not sure whether sample_indices_cache should use usize or u64 in this case, or possibly both u32 and u64 impls.
If you see any utility in my suggestions above, might be better if you make a separate PR and I remove the changes to seq from this one.
There was a problem hiding this comment.
I actually considered doing something like you suggest: splitting the method up by the maximum size to conserve on memory/overhead, but left that as a future improvement. The problem is that sample_indicies would then have to copy that data over to a new Vec<usize> which is an annoying cost.
Ideally sample_indices would be fully generic and would just use sizeof<T> to make the correct choices. sample_slice could then call the correct one based on amount or length.
However, I was not sure how to make something generic over any integers, it looks like I might have to use the num crate?
I think this PR stands on it's own and making things generic isn't going to change this cutoff -- I would make the cache version generic too and the hashmap should reap (roughtly) the same benefits.
I don't think thinking TOO hard about absurdly large numbers of indicies is too useful at this stage... if someone's usecase is sampling 1/20th of their total memory (as indicies!) then they should probably consider rolling their own code (specifically they might want to consider a more memory-packed hashmap implementation, etc).
|
These numbers are very interesting, thanks for doing them! It's good to see that the cache implemenation has payoff for the "general use case" of sampling a small part of a large sample. That's its main use case! It's also incredible just how fast array creation and indexing is. The inplace version really does make big wins pretty quick. |
|
This actually has me pondering if there is possibly a third implementation for For instance, when looking at possible methods I came across this paper. They promise that:
In my (extremely amateur) assesment I thought it would not beat the cached/inplace methods since those had the same big One problem is that although the sampling is random, their method actually puts the elements in their original sequential order -- so we would either have to revert our guarantee of the "random" ordering or call Another downside (of a new method) is that the inplace/cached implementations have identical logic, which makes it much easier to test that they are accurate (since if they have the same seeded rng they will generate identical samples). It also reduces the "magic" under the hood -- you will always be using the same algorithm under the hood, one is just optimized for small samples from large populations. There may be additional methods we could leverage between this range (or even replace the cached version entirely). I'm not sure I'm the right person for the job though 😜 |
|
The lazy option would be to make the |
|
Ya, I don't like the "smell" of that... reason being I expect that memory sizes are going to explode in the next decade-ish because of technologies like xpoint. I don't want to be tied to the ultra-slow implementation just because you have a length above I was intending to open an issue to investigate a generic sample_indicies. I just opened that ticket here: #202 |
Some of the benchmarks take a while to run; we only need to check they work.
| script: | ||
| - cargo doc --no-deps --all-features | ||
| - cargo bench | ||
| - cargo test --benches |
There was a problem hiding this comment.
I'm not familiar with test --benches -- does this automatically build in release?
There was a problem hiding this comment.
or does it just "run the benches once", i.e. treat them as tests.
There was a problem hiding this comment.
Yes, that's it. I don't see any value in running the actual benchmarks on CI hardware. No idea whether it builds in debug/release/hybrid mode and don't really care.
There was a problem hiding this comment.
Agreed. I didn't know about that flag, it's pretty cool! 😄
Port benches from dhardy/master
This "copies" 2/3 of the updated benchmarks. The distributions benchmark was not included because the distributions API changed massively in dhardy/master.
There's also some new benchmarks for the new seq code by @vitiral and a corresponding adjustment to the factor used to select index sampling method.
I'll leave this PR open to comments for a couple of days before merging.
Full benchmark results on the index sampling:
Code: