SelfContainedChannel benchmarks and candidate optimizations#2791
SelfContainedChannel benchmarks and candidate optimizations#2791sporksmith merged 12 commits intoshadow:mainfrom
Conversation
This is a bit of a long-shot, but when I did this in shadow#2791, the microbenchmark results became much more consistent, and apparent benefits of some of the candidate optimizations went away. I think it's less likely to matter here since the channels themselves are bigger - they'll have a ShimEvent placed between them in memory - and so probably won't end up on the same cache line anyways, but it's worth checking.
|
After rebasing and adding a large alignment to the candidate Rust IPCData, it appears to have the same performance as the C version. I'm still a little curious about what these microbenchmarks look like on our benchmark runner machine, but I think we can drop it if it's a hassle. |
Sure. The benchmark machine is running a benchmark at the moment so I used a newer AMD machine instead. I can run on the benchmark machine when idle if desired. Here is a screenshot with the fancy colors. Text output below. |
|
@robgjansen thanks! Interesting but orthogonal that pinning makes a much bigger difference on that machine. It looks like removing the Reading and Writing state (and making the interfaces Since the e2e tor benchmark for migrating IPC to Rust now appears to be performance neutral with the "basic" implementation, I'm leaning towards just moving forward with that. I'll add a comment to consider revisiting that optimization in the future (and adding the checked Reader and Writer "checkouts" to make it safe again), but it doesn't seem worth it right now. |
11b7e1a to
cbe9441
Compare
|
OK, sounds good. The machine I used support hyperthreading whereas the benchmark runner does not, which might have something to do with pinning performance. |
```
$ cargo bench --manifest-path=src/Cargo.toml -p vasi-sync
Running benches/scchannel.rs (src/target/release/deps/scchannel-ae8dc0b68617fce5)
ping pong time: [4.3334 µs 4.5088 µs 4.6953 µs]
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low severe
2 (2.00%) low mild
2 (2.00%) high mild
ping pong pinned time: [3.2115 µs 3.2179 µs 3.2259 µs]
Found 13 outliers among 100 measurements (13.00%)
7 (7.00%) high mild
6 (6.00%) high severe
```
```
$ cargo bench --manifest-path=src/Cargo.toml -p vasi-sync
Running benches/scchannel.rs (src/target/release/deps/scchannel-ae8dc0b68617fce5)
ping pong time: [4.2460 µs 4.4094 µs 4.5736 µs]
change: [-4.5442% -0.5164% +3.7840%] (p = 0.81 > 0.05)
No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low severe
4 (4.00%) low mild
1 (1.00%) high mild
ping pong pinned time: [3.1927 µs 3.1973 µs 3.2016 µs]
change: [-1.0734% -0.8648% -0.6660%] (p = 0.00 < 0.05)
Change within noise threshold.
```
```
$ cargo bench --manifest-path=src/Cargo.toml -p vasi-sync
Running benches/scchannel.rs (src/target/release/deps/scchannel-ae8dc0b68617fce5)
ping pong time: [4.1223 µs 4.2546 µs 4.3760 µs]
change: [-8.6267% -4.6473% -0.6051%] (p = 0.03 < 0.05)
Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) low severe
4 (4.00%) low mild
ping pong pinned time: [3.1942 µs 3.1974 µs 3.2008 µs]
change: [+0.0989% +0.2742% +0.4571%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
```
This mostly works, but:
* there's a loom failure I can't figure out in one test (which
shouldn't actually matter in shadow)
* we can't safely drop the contents of the channel (if any)
after closing it. This might be fixable, but generally there's a
potential race if we try to drop the contents when closing, since it
could be getting read in parallel.
It appears to hurt performance without pinning, and be a no-op with
pinning.
```
$ cargo bench --manifest-path=src/Cargo.toml -p vasi-sync
ping pong time: [4.8376 µs 4.8934 µs 4.9665 µs]
change: [+10.053% +13.523% +17.508%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
2 (2.00%) low mild
1 (1.00%) high mild
8 (8.00%) high severe
ping pong pinned time: [3.2015 µs 3.2051 µs 3.2094 µs]
change: [+0.5280% +0.6995% +0.8895%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
```
This reverts commit 11050ef. This doesn't appear to help performance, and appears to still have some concurrency bugs according to loom and miri. Keeping in the commit history for posterity.
This reverts commit df29378. This complicates the design and doesn't appear to help performance. Keeping in the commit history for posterity.
This reverts three commits: * Revert "Update SelfContainedChannel benchmark" * Revert "scchannel: remove Reading state" * Revert "scchannel: remove Writing state" It also adds a comment to SelfContainedMutex to consider revisiting. The reverted change significantly complicates the design, especially if we wanted to make safe APIs (e.g. by adding checked creation of `!Send` Reader and Writer ends of the channel). On my desktop machine it appears to have ~zero performance benefit. On a larger NRL machine it had a little performance benefit, but mostly only when pinning is disabled, which we expect to be rare in large simulations since that already significantly hurts performance. In case we want to consider revisiting, here is the performance benefit on one of the large NRL machines: ``` HEAD is now at 2628adb Update SelfContainedChannel benchmark Compiling vasi-sync v0.1.0 (/home/rjansen/workspace/temp/shadow/src/lib/vasi-sync) Finished bench [optimized] target(s) in 1.58s Running unittests src/lib.rs (src/target/release/deps/vasi_sync-fa33eb5cd8eff33e) running 0 tests test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s Running benches/scchannel.rs (src/target/release/deps/scchannel-24425348ef654f1a) ping pong time: [11.567 µs 11.955 µs 12.254 µs] change: [-25.107% -22.604% -19.866%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 5 (5.00%) low severe 1 (1.00%) low mild 1 (1.00%) high mild ping pong pinned time: [2.1631 µs 2.1662 µs 2.1693 µs] change: [-3.7961% -3.5645% -3.3038%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe ```
cbe9441 to
efd79a1
Compare

This adds criterion-based microbenchmarks for SelfContainedChannel. On top of that it adds several candidate optimizations, with the microbenchmark results in the corresponding commit messages.
On my desktop dev machine these are all either neutral or hurt performance.
On a larger simulation machine, removing the internal "Reading" and "Writing" states has some benefit, but mostly only when pinning is disabled.
The candidate optimizations and microbenchmark results are reverted, but left in the commit history in case we want to revisit in the future.
Outdated (to be stripped from merge commit message)
@robgjansen would it be possible for you to run these on the benchmark machine or an equivalent machine? It should only take a few minutes to run.
I think it'd look something like this: