Port benches from dhardy/master by dhardy · Pull Request #201 · rust-random/rand

dhardy · 2017-12-05T12:56:35Z

This "copies" 2/3 of the updated benchmarks. The distributions benchmark was not included because the distributions API changed massively in dhardy/master.

There's also some new benchmarks for the new seq code by @vitiral and a corresponding adjustment to the factor used to select index sampling method.

I'll leave this PR open to comments for a couple of days before merging.

Full benchmark results on the index sampling:

test misc_sample_indices_cache_10k_of_1M   ... bench:     638,669 ns/iter (+/- 45,200)
test misc_sample_indices_cache_20k_of_1M   ... bench:   1,321,714 ns/iter (+/- 48,029)
test misc_sample_indices_cache_30k_of_1M   ... bench:   1,962,807 ns/iter (+/- 136,196)
test misc_sample_indices_cache_40k_of_1M   ... bench:   2,749,011 ns/iter (+/- 193,221)
test misc_sample_indices_cache_50k_of_1M   ... bench:   3,747,376 ns/iter (+/- 224,058)
test misc_sample_indices_inplace_10k_of_1M ... bench:   1,932,205 ns/iter (+/- 29,327)
test misc_sample_indices_inplace_20k_of_1M ... bench:   2,177,965 ns/iter (+/- 182,303)
test misc_sample_indices_inplace_30k_of_1M ... bench:   2,424,295 ns/iter (+/- 269,591)
test misc_sample_indices_inplace_40k_of_1M ... bench:   2,656,004 ns/iter (+/- 186,875)
test misc_sample_indices_inplace_50k_of_1M ... bench:   2,905,461 ns/iter (+/- 415,676)

test misc_sample_indices_cache_100k_of_1M   ... bench:   8,117,463 ns/iter (+/- 2,811,883)
test misc_sample_indices_cache_200k_of_1M   ... bench:  17,116,134 ns/iter (+/- 2,513,516)
test misc_sample_indices_cache_300k_of_1M   ... bench:  33,816,353 ns/iter (+/- 1,446,760)
test misc_sample_indices_cache_400k_of_1M   ... bench:  46,573,604 ns/iter (+/- 863,395)
test misc_sample_indices_cache_500k_of_1M   ... bench:  72,825,164 ns/iter (+/- 862,083)
test misc_sample_indices_inplace_100k_of_1M ... bench:   4,104,628 ns/iter (+/- 171,397)
test misc_sample_indices_inplace_200k_of_1M ... bench:   6,504,177 ns/iter (+/- 242,639)
test misc_sample_indices_inplace_300k_of_1M ... bench:   8,793,716 ns/iter (+/- 326,420)
test misc_sample_indices_inplace_400k_of_1M ... bench:  10,855,500 ns/iter (+/- 563,205)
test misc_sample_indices_inplace_500k_of_1M ... bench:  12,791,521 ns/iter (+/- 604,272)

Code:

macro_rules! sample_indices {
    ($name:ident, $method:ident, $amount:expr, $length:expr) => {
        #[bench]
        fn $name(b: &mut Bencher) {
            let mut rng = weak_rng();
            b.iter(|| {
                black_box($method(&mut rng, $length, $amount));
            })
        }
    }
}

sample_indices!(misc_sample_indices_inplace_10k_of_1M, sample_indices_inplace, 10_000, 1_000_000);
sample_indices!(misc_sample_indices_inplace_20k_of_1M, sample_indices_inplace, 20_000, 1_000_000);
sample_indices!(misc_sample_indices_inplace_30k_of_1M, sample_indices_inplace, 30_000, 1_000_000);
sample_indices!(misc_sample_indices_inplace_40k_of_1M, sample_indices_inplace, 40_000, 1_000_000);
sample_indices!(misc_sample_indices_inplace_50k_of_1M, sample_indices_inplace, 50_000, 1_000_000);

sample_indices!(misc_sample_indices_cache_10k_of_1M, sample_indices_cache, 10_000, 1_000_000);
sample_indices!(misc_sample_indices_cache_20k_of_1M, sample_indices_cache, 20_000, 1_000_000);
sample_indices!(misc_sample_indices_cache_30k_of_1M, sample_indices_cache, 30_000, 1_000_000);
sample_indices!(misc_sample_indices_cache_40k_of_1M, sample_indices_cache, 40_000, 1_000_000);
sample_indices!(misc_sample_indices_cache_50k_of_1M, sample_indices_cache, 50_000, 1_000_000);

Benchmarks done sampling n*10_000 indices from 1_000_000 length; show inplace faster from appprox 40_0000 indices.

vitiral · 2017-12-05T13:55:42Z

src/seq.rs

    // and a trade off could probably be made between memory/cpu, since hashmap operations
    // are slower than array index swapping.
-    if amount >= length / 2 {
+    if amount * 20 >= length {


I would do if amount >= length / 20 so that there is no risk of overflow (a possibilitiy on 32 bit systems... maybe).

On that note, one might think there is a tradeoff here between speed and memory usage. Let's actually look at it a bit.

The memory usage of inplace is sizeof<usize>() * length

The memory usage of cache is a bit more complicated. It is:

(sizeof<usize>() * amount) # Vec return value # HashTable usage: # - size of the stored "hash" value # - size of the stored key/value pair + (sizeof<usize>() + sizeof<usize>() * 2) * amount

Which is obviously just

sizeof<usize>() * 4 * amount

This is not accounting for how much the capacitiy's length is less than the allocated space.

So from a memory usage perspective you might want to do

if amount * 4 >= length

But there is always a tradeoff. Basically at amount >= length / 20 you will be using around 5x more memory than the cache implementation would. If that is acceptible (which I honestly think it is) then go for it!

So long as that much memory is available, I think it should be used. Unfortunately knowing how much is available is another problem, so the crude approximation of amount >= length / N seems reasonable IMO, for some N between 4 and 20 it seems. Probably optimising for execution time is more important.

Another thing that occurs to me is that (a) the length is specified as a usize due to common use (container indices), but this gives the function different capabilities on 32- and 64-bit platforms; at the same time (b) using smaller integer sizes should help significantly and (c) if length >= 2.pow(32) then the user is dealing with very big numbers (unusual) and the inplace version of the function is almost certainly not applicable. So do you think sample_indices should accept u64 arguments, test whether they are within u32 range, and sample_indices_inplace should always use u32 numbers? Potentially it could even be worth having a version of sample_indices_inplace using smaller ints (u16), but probably not worth it. Not sure whether sample_indices_cache should use usize or u64 in this case, or possibly both u32 and u64 impls.

If you see any utility in my suggestions above, might be better if you make a separate PR and I remove the changes to seq from this one.

I actually considered doing something like you suggest: splitting the method up by the maximum size to conserve on memory/overhead, but left that as a future improvement. The problem is that sample_indicies would then have to copy that data over to a new Vec<usize> which is an annoying cost.

Ideally sample_indices would be fully generic and would just use sizeof<T> to make the correct choices. sample_slice could then call the correct one based on amount or length.

However, I was not sure how to make something generic over any integers, it looks like I might have to use the num crate?

I think this PR stands on it's own and making things generic isn't going to change this cutoff -- I would make the cache version generic too and the hashmap should reap (roughtly) the same benefits.

I don't think thinking TOO hard about absurdly large numbers of indicies is too useful at this stage... if someone's usecase is sampling 1/20th of their total memory (as indicies!) then they should probably consider rolling their own code (specifically they might want to consider a more memory-packed hashmap implementation, etc).

vitiral · 2017-12-05T14:00:13Z

These numbers are very interesting, thanks for doing them!

It's good to see that the cache implemenation has payoff for the "general use case" of sampling a small part of a large sample. That's its main use case!

It's also incredible just how fast array creation and indexing is. The inplace version really does make big wins pretty quick.

vitiral · 2017-12-05T15:16:28Z

This actually has me pondering if there is possibly a third implementation for length / 20 ~< amount ~< length / 2.

For instance, when looking at possible methods I came across this paper. They promise that:

The algorithms require a constant amount of space and are short and easy to implement. The main result of this paper is the design and analysis of Algorithm D, which does the sampling in O(n) time, on the average; roughly n uniform random variates are generated, and approximately n exponentiation operations (of the form a b, for real numbers a and b) are performed during the sampling. This solves an open problem in the literature. CPU timings on a large mainframe computer indicate that Algorithm D is significantly faster than the sampling algorithms in use today.

In my (extremely amateur) assesment I thought it would not beat the cached/inplace methods since those had the same big O and did not rely on exponentiation. However, on more careful (but still amateur...) analysis it's possible there is a niche for such sampling methods within that range.

One problem is that although the sampling is random, their method actually puts the elements in their original sequential order -- so we would either have to revert our guarantee of the "random" ordering or call shuffle on the indicies before selecting for their method.

Another downside (of a new method) is that the inplace/cached implementations have identical logic, which makes it much easier to test that they are accurate (since if they have the same seeded rng they will generate identical samples). It also reduces the "magic" under the hood -- you will always be using the same algorithm under the hood, one is just optimized for small samples from large populations.

There may be additional methods we could leverage between this range (or even replace the cached version entirely). I'm not sure I'm the right person for the job though 😜

dhardy · 2017-12-05T16:56:20Z

The lazy option would be to make the inplace version u32 only and the condition length <= u32::MAX && amount <= length / 20. If you wish to investigate additional optimisation I'll leave that to you (though moderately simple code is also an advantage).

vitiral · 2017-12-05T17:05:46Z

Ya, I don't like the "smell" of that... reason being I expect that memory sizes are going to explode in the next decade-ish because of technologies like xpoint. I don't want to be tied to the ultra-slow implementation just because you have a length above u32::MAX.

I was intending to open an issue to investigate a generic sample_indicies. I just opened that ticket here: #202

Some of the benchmarks take a while to run; we only need to check they work.

vitiral · 2017-12-05T17:44:30Z

.travis.yml

      script:
        - cargo doc --no-deps --all-features
-        - cargo bench
+        - cargo test --benches


I'm not familiar with test --benches -- does this automatically build in release?

or does it just "run the benches once", i.e. treat them as tests.

Yes, that's it. I don't see any value in running the actual benchmarks on CI hardware. No idea whether it builds in debug/release/hybrid mode and don't really care.

Agreed. I didn't know about that flag, it's pretty cool! 😄

Port benches from dhardy/master

dhardy added 4 commits December 5, 2017 12:15

Add 'misc' bench from dhardy/master

ca2ab34

Misc bench: better test new 'seq' code

dfee0f2

seq::sample_indices: adjust factor, according to benchmark results

f92afa2

Benchmarks done sampling n*10_000 indices from 1_000_000 length; show inplace faster from appprox 40_0000 indices.

Add 'generators' bench from dhardy/master

35b473c

vitiral reviewed Dec 5, 2017

View reviewed changes

vitiral mentioned this pull request Dec 5, 2017

investigate generic sample_indicies #202

Closed

dhardy added 2 commits December 5, 2017 17:38

Merge branch 'master' into benches

97060e2

Address review comment and only _test_ benches

64ba765

Some of the benchmarks take a while to run; we only need to check they work.

vitiral reviewed Dec 5, 2017

View reviewed changes

Also only test benches on Appveyor [ci skip]

1043438

dhardy merged commit 0e88e28 into rust-random:master Dec 7, 2017

dhardy deleted the benches branch December 19, 2017 17:07

pitdicker pushed a commit to pitdicker/rand that referenced this pull request Apr 4, 2018

Merge pull request rust-random#201 from dhardy/benches

d637d5b

Port benches from dhardy/master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port benches from dhardy/master#201

Port benches from dhardy/master#201
dhardy merged 7 commits intorust-random:masterfrom
dhardy:benches

dhardy commented Dec 5, 2017

Uh oh!

vitiral Dec 5, 2017 •

edited

Loading

Uh oh!

dhardy Dec 5, 2017

Uh oh!

vitiral Dec 5, 2017 •

edited

Loading

Uh oh!

vitiral commented Dec 5, 2017

Uh oh!

vitiral commented Dec 5, 2017 •

edited

Loading

Uh oh!

dhardy commented Dec 5, 2017

Uh oh!

vitiral commented Dec 5, 2017

Uh oh!

vitiral Dec 5, 2017

Uh oh!

vitiral Dec 5, 2017

Uh oh!

dhardy Dec 5, 2017 •

edited

Loading

Uh oh!

vitiral Dec 5, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dhardy commented Dec 5, 2017

Uh oh!

vitiral Dec 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhardy Dec 5, 2017

Choose a reason for hiding this comment

Uh oh!

vitiral Dec 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vitiral commented Dec 5, 2017

Uh oh!

vitiral commented Dec 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhardy commented Dec 5, 2017

Uh oh!

vitiral commented Dec 5, 2017

Uh oh!

vitiral Dec 5, 2017

Choose a reason for hiding this comment

Uh oh!

vitiral Dec 5, 2017

Choose a reason for hiding this comment

Uh oh!

dhardy Dec 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vitiral Dec 5, 2017

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vitiral Dec 5, 2017 •

edited

Loading

vitiral Dec 5, 2017 •

edited

Loading

vitiral commented Dec 5, 2017 •

edited

Loading

dhardy Dec 5, 2017 •

edited

Loading