Skip to content

Port benches from dhardy/master#201

Merged
dhardy merged 7 commits intorust-random:masterfrom
dhardy:benches
Dec 7, 2017
Merged

Port benches from dhardy/master#201
dhardy merged 7 commits intorust-random:masterfrom
dhardy:benches

Conversation

@dhardy
Copy link
Member

@dhardy dhardy commented Dec 5, 2017

This "copies" 2/3 of the updated benchmarks. The distributions benchmark was not included because the distributions API changed massively in dhardy/master.

There's also some new benchmarks for the new seq code by @vitiral and a corresponding adjustment to the factor used to select index sampling method.

I'll leave this PR open to comments for a couple of days before merging.

Full benchmark results on the index sampling:

test misc_sample_indices_cache_10k_of_1M   ... bench:     638,669 ns/iter (+/- 45,200)
test misc_sample_indices_cache_20k_of_1M   ... bench:   1,321,714 ns/iter (+/- 48,029)
test misc_sample_indices_cache_30k_of_1M   ... bench:   1,962,807 ns/iter (+/- 136,196)
test misc_sample_indices_cache_40k_of_1M   ... bench:   2,749,011 ns/iter (+/- 193,221)
test misc_sample_indices_cache_50k_of_1M   ... bench:   3,747,376 ns/iter (+/- 224,058)
test misc_sample_indices_inplace_10k_of_1M ... bench:   1,932,205 ns/iter (+/- 29,327)
test misc_sample_indices_inplace_20k_of_1M ... bench:   2,177,965 ns/iter (+/- 182,303)
test misc_sample_indices_inplace_30k_of_1M ... bench:   2,424,295 ns/iter (+/- 269,591)
test misc_sample_indices_inplace_40k_of_1M ... bench:   2,656,004 ns/iter (+/- 186,875)
test misc_sample_indices_inplace_50k_of_1M ... bench:   2,905,461 ns/iter (+/- 415,676)

test misc_sample_indices_cache_100k_of_1M   ... bench:   8,117,463 ns/iter (+/- 2,811,883)
test misc_sample_indices_cache_200k_of_1M   ... bench:  17,116,134 ns/iter (+/- 2,513,516)
test misc_sample_indices_cache_300k_of_1M   ... bench:  33,816,353 ns/iter (+/- 1,446,760)
test misc_sample_indices_cache_400k_of_1M   ... bench:  46,573,604 ns/iter (+/- 863,395)
test misc_sample_indices_cache_500k_of_1M   ... bench:  72,825,164 ns/iter (+/- 862,083)
test misc_sample_indices_inplace_100k_of_1M ... bench:   4,104,628 ns/iter (+/- 171,397)
test misc_sample_indices_inplace_200k_of_1M ... bench:   6,504,177 ns/iter (+/- 242,639)
test misc_sample_indices_inplace_300k_of_1M ... bench:   8,793,716 ns/iter (+/- 326,420)
test misc_sample_indices_inplace_400k_of_1M ... bench:  10,855,500 ns/iter (+/- 563,205)
test misc_sample_indices_inplace_500k_of_1M ... bench:  12,791,521 ns/iter (+/- 604,272)

Code:

macro_rules! sample_indices {
    ($name:ident, $method:ident, $amount:expr, $length:expr) => {
        #[bench]
        fn $name(b: &mut Bencher) {
            let mut rng = weak_rng();
            b.iter(|| {
                black_box($method(&mut rng, $length, $amount));
            })
        }
    }
}

sample_indices!(misc_sample_indices_inplace_10k_of_1M, sample_indices_inplace, 10_000, 1_000_000);
sample_indices!(misc_sample_indices_inplace_20k_of_1M, sample_indices_inplace, 20_000, 1_000_000);
sample_indices!(misc_sample_indices_inplace_30k_of_1M, sample_indices_inplace, 30_000, 1_000_000);
sample_indices!(misc_sample_indices_inplace_40k_of_1M, sample_indices_inplace, 40_000, 1_000_000);
sample_indices!(misc_sample_indices_inplace_50k_of_1M, sample_indices_inplace, 50_000, 1_000_000);

sample_indices!(misc_sample_indices_cache_10k_of_1M, sample_indices_cache, 10_000, 1_000_000);
sample_indices!(misc_sample_indices_cache_20k_of_1M, sample_indices_cache, 20_000, 1_000_000);
sample_indices!(misc_sample_indices_cache_30k_of_1M, sample_indices_cache, 30_000, 1_000_000);
sample_indices!(misc_sample_indices_cache_40k_of_1M, sample_indices_cache, 40_000, 1_000_000);
sample_indices!(misc_sample_indices_cache_50k_of_1M, sample_indices_cache, 50_000, 1_000_000);

src/seq.rs Outdated
// and a trade off could probably be made between memory/cpu, since hashmap operations
// are slower than array index swapping.
if amount >= length / 2 {
if amount * 20 >= length {
Copy link
Contributor

@vitiral vitiral Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do if amount >= length / 20 so that there is no risk of overflow (a possibilitiy on 32 bit systems... maybe).

On that note, one might think there is a tradeoff here between speed and memory usage. Let's actually look at it a bit.

The memory usage of inplace is sizeof<usize>() * length

The memory usage of cache is a bit more complicated. It is:

(sizeof<usize>() * amount)  # Vec return value
    # HashTable usage:
    # - size of the stored "hash" value
    # - size of the stored key/value pair
    + (sizeof<usize>() + sizeof<usize>() * 2) * amount 

Which is obviously just

sizeof<usize>() * 4 * amount

This is not accounting for how much the capacitiy's length is less than the allocated space.

So from a memory usage perspective you might want to do

if amount * 4 >= length 

But there is always a tradeoff. Basically at amount >= length / 20 you will be using around 5x more memory than the cache implementation would. If that is acceptible (which I honestly think it is) then go for it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So long as that much memory is available, I think it should be used. Unfortunately knowing how much is available is another problem, so the crude approximation of amount >= length / N seems reasonable IMO, for some N between 4 and 20 it seems. Probably optimising for execution time is more important.

Another thing that occurs to me is that (a) the length is specified as a usize due to common use (container indices), but this gives the function different capabilities on 32- and 64-bit platforms; at the same time (b) using smaller integer sizes should help significantly and (c) if length >= 2.pow(32) then the user is dealing with very big numbers (unusual) and the inplace version of the function is almost certainly not applicable. So do you think sample_indices should accept u64 arguments, test whether they are within u32 range, and sample_indices_inplace should always use u32 numbers? Potentially it could even be worth having a version of sample_indices_inplace using smaller ints (u16), but probably not worth it. Not sure whether sample_indices_cache should use usize or u64 in this case, or possibly both u32 and u64 impls.

If you see any utility in my suggestions above, might be better if you make a separate PR and I remove the changes to seq from this one.

Copy link
Contributor

@vitiral vitiral Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually considered doing something like you suggest: splitting the method up by the maximum size to conserve on memory/overhead, but left that as a future improvement. The problem is that sample_indicies would then have to copy that data over to a new Vec<usize> which is an annoying cost.

Ideally sample_indices would be fully generic and would just use sizeof<T> to make the correct choices. sample_slice could then call the correct one based on amount or length.

However, I was not sure how to make something generic over any integers, it looks like I might have to use the num crate?

I think this PR stands on it's own and making things generic isn't going to change this cutoff -- I would make the cache version generic too and the hashmap should reap (roughtly) the same benefits.

I don't think thinking TOO hard about absurdly large numbers of indicies is too useful at this stage... if someone's usecase is sampling 1/20th of their total memory (as indicies!) then they should probably consider rolling their own code (specifically they might want to consider a more memory-packed hashmap implementation, etc).

@vitiral
Copy link
Contributor

vitiral commented Dec 5, 2017

These numbers are very interesting, thanks for doing them!

It's good to see that the cache implemenation has payoff for the "general use case" of sampling a small part of a large sample. That's its main use case!

It's also incredible just how fast array creation and indexing is. The inplace version really does make big wins pretty quick.

@vitiral
Copy link
Contributor

vitiral commented Dec 5, 2017

This actually has me pondering if there is possibly a third implementation for length / 20 ~< amount ~< length / 2.

For instance, when looking at possible methods I came across this paper. They promise that:

The algorithms require a constant amount of space and are short and easy to implement. The main result of this paper is the design and analysis of Algorithm D, which does the sampling in O(n) time, on the average; roughly n uniform random variates are generated, and approximately n exponentiation operations (of the form a b, for real numbers a and b) are performed during the sampling. This solves an open problem in the literature. CPU timings on a large mainframe computer indicate that Algorithm D is significantly faster than the sampling algorithms in use today.

In my (extremely amateur) assesment I thought it would not beat the cached/inplace methods since those had the same big O and did not rely on exponentiation. However, on more careful (but still amateur...) analysis it's possible there is a niche for such sampling methods within that range.

One problem is that although the sampling is random, their method actually puts the elements in their original sequential order -- so we would either have to revert our guarantee of the "random" ordering or call shuffle on the indicies before selecting for their method.

Another downside (of a new method) is that the inplace/cached implementations have identical logic, which makes it much easier to test that they are accurate (since if they have the same seeded rng they will generate identical samples). It also reduces the "magic" under the hood -- you will always be using the same algorithm under the hood, one is just optimized for small samples from large populations.

There may be additional methods we could leverage between this range (or even replace the cached version entirely). I'm not sure I'm the right person for the job though 😜

@dhardy
Copy link
Member Author

dhardy commented Dec 5, 2017

The lazy option would be to make the inplace version u32 only and the condition length <= u32::MAX && amount <= length / 20. If you wish to investigate additional optimisation I'll leave that to you (though moderately simple code is also an advantage).

@vitiral
Copy link
Contributor

vitiral commented Dec 5, 2017

Ya, I don't like the "smell" of that... reason being I expect that memory sizes are going to explode in the next decade-ish because of technologies like xpoint. I don't want to be tied to the ultra-slow implementation just because you have a length above u32::MAX.

I was intending to open an issue to investigate a generic sample_indicies. I just opened that ticket here: #202

Some of the benchmarks take a while to run; we only need to check they work.
script:
- cargo doc --no-deps --all-features
- cargo bench
- cargo test --benches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with test --benches -- does this automatically build in release?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or does it just "run the benches once", i.e. treat them as tests.

Copy link
Member Author

@dhardy dhardy Dec 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's it. I don't see any value in running the actual benchmarks on CI hardware. No idea whether it builds in debug/release/hybrid mode and don't really care.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I didn't know about that flag, it's pretty cool! 😄

@dhardy dhardy merged commit 0e88e28 into rust-random:master Dec 7, 2017
@dhardy dhardy deleted the benches branch December 19, 2017 17:07
pitdicker pushed a commit to pitdicker/rand that referenced this pull request Apr 4, 2018
Port benches from dhardy/master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants