Conversation
This PR adds more benchmarks so we can get and accurate idea about two
things:
- What is the cost of having to zero the buffer before calling
`getrandom`?
- What is the performance on aligned, 32-byte buffers?
- This is by far the most common use, as its used to seed
usersapce CSPRNGs.
I ran the benchmarks on my system:
- CPU: AMD Ryzen 7 5700G
- OS: Linux 5.15.52-1-lts
- Rust Version: 1.62.0-nightly (ea92b0838 2022-05-07)
I got the following results:
```
test bench_large ... bench: 3,759,323 ns/iter (+/- 177,100) = 557 MB/s
test bench_large_init ... bench: 3,821,229 ns/iter (+/- 39,132) = 548 MB/s
test bench_page ... bench: 7,281 ns/iter (+/- 59) = 562 MB/s
test bench_page_init ... bench: 7,290 ns/iter (+/- 69) = 561 MB/s
test bench_seed ... bench: 206 ns/iter (+/- 3) = 155 MB/s
test bench_seed_init ... bench: 206 ns/iter (+/- 1) = 155 MB/s
```
These results were very consistent across multiple runs, and roughtly
behave as we would expect:
- The thoughput is highest with a buffer large enough to amoritize the
syscall overhead, but small enough to stay in the L1D cache.
- There is a _very_ small cost to zeroing the buffer beforehand.
- This cost is imperceptible in the common 32-byte usecase, where the
syscall overhead dominates.
- The cost is slightly higher (1%) with multi-megabyte buffers as the
data gets evicted from the L1 cache between the `memset` and the
call to `getrandom`.
I would love to see results for other platforms. Could we get someone to
run this on an M1 Mac?
Signed-off-by: Joe Richey <joerichey@google.com>
Member
Author
|
I also locally patched the crate to use the again, these results were quite stable over multiple runs, showing a small improvement from not having to initialize the buffer. For this and the above x86_64 Linux benchmark, I used RUSTFLAGS="-C opt-level=3 -C codegen-units=1 -C embed-bitcode=yes -C lto=fat -C target-cpu=native" |
Member
Author
|
@newpavlov anything blocking merging in these benchmarks? If we merge them in, it will be easier for people to run them on different platforms. This will, in turn, make it easier to figure out if #226 and #271 are worth it. |
Member
Author
|
On another system:
Linux implementation (default): RDRAND implementation (patched): Again, the difference is detectable, but very, very small. |
Member
Author
|
On a
Linux implementation: |
newpavlov
approved these changes
Jul 13, 2022
takumi-earth
pushed a commit
to earthlings-dev/getrandom
that referenced
this pull request
Jan 27, 2026
This PR adds more benchmarks so we can get and accurate idea about two
things:
- What is the cost of having to zero the buffer before calling
`getrandom`?
- What is the performance on aligned, 32-byte buffers?
- This is by far the most common use, as its used to seed
usersapce CSPRNGs.
I ran the benchmarks on my system:
- CPU: AMD Ryzen 7 5700G
- OS: Linux 5.15.52-1-lts
- Rust Version: 1.62.0-nightly (ea92b0838 2022-05-07)
I got the following results:
```
test bench_large ... bench: 3,759,323 ns/iter (+/- 177,100) = 557 MB/s
test bench_large_init ... bench: 3,821,229 ns/iter (+/- 39,132) = 548 MB/s
test bench_page ... bench: 7,281 ns/iter (+/- 59) = 562 MB/s
test bench_page_init ... bench: 7,290 ns/iter (+/- 69) = 561 MB/s
test bench_seed ... bench: 206 ns/iter (+/- 3) = 155 MB/s
test bench_seed_init ... bench: 206 ns/iter (+/- 1) = 155 MB/s
```
These results were very consistent across multiple runs, and roughtly
behave as we would expect:
- The thoughput is highest with a buffer large enough to amoritize the
syscall overhead, but small enough to stay in the L1D cache.
- There is a _very_ small cost to zeroing the buffer beforehand.
- This cost is imperceptible in the common 32-byte usecase, where the
syscall overhead dominates.
- The cost is slightly higher (1%) with multi-megabyte buffers as the
data gets evicted from the L1 cache between the `memset` and the
call to `getrandom`.
I would love to see results for other platforms. Could we get someone to
run this on an M1 Mac?
Signed-off-by: Joe Richey <joerichey@google.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds more benchmarks so we can get and accurate idea about two
things:
getrandom?usersapce CSPRNGs.
I ran the benchmarks on my system:
I got the following results:
These results were very consistent across multiple runs, and roughtly
behave as we would expect:
syscall overhead, but small enough to stay in the L1D cache.
syscall overhead dominates.
data gets evicted from the L1 cache between the
memsetand thecall to
getrandom.I would love to see results for other platforms. Could we get someone to
run this on an M1 Mac?
Signed-off-by: Joe Richey joerichey@google.com