SelfContainedChannel benchmarks and candidate optimizations by sporksmith · Pull Request #2791 · shadow/shadow

sporksmith · 2023-03-16T23:20:50Z

This adds criterion-based microbenchmarks for SelfContainedChannel. On top of that it adds several candidate optimizations, with the microbenchmark results in the corresponding commit messages.

On my desktop dev machine these are all either neutral or hurt performance.

On a larger simulation machine, removing the internal "Reading" and "Writing" states has some benefit, but mostly only when pinning is disabled.

The candidate optimizations and microbenchmark results are reverted, but left in the commit history in case we want to revisit in the future.

Outdated (to be stripped from merge commit message)

@robgjansen would it be possible for you to run these on the benchmark machine or an equivalent machine? It should only take a few minutes to run.

I think it'd look something like this:

git remote add sporksmith https://github.com/sporksmith/shadow.git
git fetch sporksmith

# baseline 
git checkout 1cecce3701ab98f049bf6b49e28d51dbd1df2296
cargo bench  --manifest-path=src/Cargo.toml  -p vasi-sync

# after removing Reading and Writing states 
git checkout 2628adb473592324a9db49a66a538884b28167a5
cargo bench  --manifest-path=src/Cargo.toml  -p vasi-sync

# after moving sleeper bit to a separate atomic
git checkout df293781a5f8e06357afc9051997c2718c8d47e0
cargo bench  --manifest-path=src/Cargo.toml  -p vasi-sync

# after moving writer_closed bit to a separate atomic
git checkout 11050ef98876001dc8d6619e104cf197d205b039
cargo bench  --manifest-path=src/Cargo.toml  -p vasi-sync

This is a bit of a long-shot, but when I did this in shadow#2791, the microbenchmark results became much more consistent, and apparent benefits of some of the candidate optimizations went away. I think it's less likely to matter here since the channels themselves are bigger - they'll have a ShimEvent placed between them in memory - and so probably won't end up on the same cache line anyways, but it's worth checking.

sporksmith · 2023-03-17T15:03:47Z

After rebasing and adding a large alignment to the candidate Rust IPCData, it appears to have the same performance as the C version. I'm still a little curious about what these microbenchmarks look like on our benchmark runner machine, but I think we can drop it if it's a hassle.

robgjansen · 2023-03-17T15:10:07Z

@robgjansen would it be possible for you to run these on the benchmark machine or an equivalent machine?

Sure. The benchmark machine is running a benchmark at the moment so I used a newer AMD machine instead. I can run on the benchmark machine when idle if desired.

Here is a screenshot with the fancy colors. Text output below.

    Finished bench [optimized] target(s) in 4m 37s                                                                                                         
     Running unittests src/lib.rs (src/target/release/deps/vasi_sync-fa33eb5cd8eff33e)                                                                     

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/scchannel.rs (src/target/release/deps/scchannel-24425348ef654f1a)
ping pong               time:   [14.594 µs 15.341 µs 15.913 µs]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) low severe

ping pong pinned        time:   [2.2432 µs 2.2462 µs 2.2491 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Previous HEAD position was 1cecce37 Add a benchmark for SelfContainedChannel
HEAD is now at 2628adb4 Update SelfContainedChannel benchmark
   Compiling vasi-sync v0.1.0 (/home/rjansen/workspace/temp/shadow/src/lib/vasi-sync)
    Finished bench [optimized] target(s) in 1.58s
     Running unittests src/lib.rs (src/target/release/deps/vasi_sync-fa33eb5cd8eff33e)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/scchannel.rs (src/target/release/deps/scchannel-24425348ef654f1a)
ping pong               time:   [11.567 µs 11.955 µs 12.254 µs]
                        change: [-25.107% -22.604% -19.866%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild

ping pong pinned        time:   [2.1631 µs 2.1662 µs 2.1693 µs]
                        change: [-3.7961% -3.5645% -3.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Previous HEAD position was 2628adb4 Update SelfContainedChannel benchmark
HEAD is now at df293781 scchannel: move sleeper bit to another atomic
   Compiling vasi v0.1.0 (/home/rjansen/workspace/temp/shadow/src/lib/vasi)
   Compiling vasi-sync v0.1.0 (/home/rjansen/workspace/temp/shadow/src/lib/vasi-sync)
    Finished bench [optimized] target(s) in 1.61s
     Running unittests src/lib.rs (src/target/release/deps/vasi_sync-fa33eb5cd8eff33e)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/scchannel.rs (src/target/release/deps/scchannel-24425348ef654f1a)
ping pong               time:   [11.916 µs 12.481 µs 12.926 µs]
                        change: [+1.7138% +5.3248% +8.7079%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) low severe

ping pong pinned        time:   [2.1969 µs 2.2006 µs 2.2046 µs]
                        change: [+1.5706% +2.0104% +2.6089%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Previous HEAD position was df293781 scchannel: move sleeper bit to another atomic
HEAD is now at 11050ef9 move writer_closed to a separate atomic
   Compiling vasi-sync v0.1.0 (/home/rjansen/workspace/temp/shadow/src/lib/vasi-sync)
    Finished bench [optimized] target(s) in 1.57s
     Running unittests src/lib.rs (src/target/release/deps/vasi_sync-fa33eb5cd8eff33e)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/scchannel.rs (src/target/release/deps/scchannel-24425348ef654f1a)
ping pong               time:   [11.453 µs 11.988 µs 12.366 µs]
                        change: [-8.2206% -4.9541% -1.5380%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) low severe
  1 (1.00%) high mild

ping pong pinned        time:   [2.2517 µs 2.2544 µs 2.2571 µs]
                        change: [+1.6144% +2.2020% +2.6288%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

sporksmith · 2023-03-17T15:18:36Z

@robgjansen thanks!

Interesting but orthogonal that pinning makes a much bigger difference on that machine.

It looks like removing the Reading and Writing state (and making the interfaces unsafe) helps a bit on this machine, though mostly when not using pinning.

Since the e2e tor benchmark for migrating IPC to Rust now appears to be performance neutral with the "basic" implementation, I'm leaning towards just moving forward with that. I'll add a comment to consider revisiting that optimization in the future (and adding the checked Reader and Writer "checkouts" to make it safe again), but it doesn't seem worth it right now.

robgjansen · 2023-03-17T21:43:27Z

OK, sounds good. The machine I used support hyperthreading whereas the benchmark runner does not, which might have something to do with pinning performance.

``` $ cargo bench --manifest-path=src/Cargo.toml -p vasi-sync Running benches/scchannel.rs (src/target/release/deps/scchannel-ae8dc0b68617fce5) ping pong time: [4.3334 µs 4.5088 µs 4.6953 µs] Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) low severe 2 (2.00%) low mild 2 (2.00%) high mild ping pong pinned time: [3.2115 µs 3.2179 µs 3.2259 µs] Found 13 outliers among 100 measurements (13.00%) 7 (7.00%) high mild 6 (6.00%) high severe ```

``` $ cargo bench --manifest-path=src/Cargo.toml -p vasi-sync Running benches/scchannel.rs (src/target/release/deps/scchannel-ae8dc0b68617fce5) ping pong time: [4.2460 µs 4.4094 µs 4.5736 µs] change: [-4.5442% -0.5164% +3.7840%] (p = 0.81 > 0.05) No change in performance detected. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low severe 4 (4.00%) low mild 1 (1.00%) high mild ping pong pinned time: [3.1927 µs 3.1973 µs 3.2016 µs] change: [-1.0734% -0.8648% -0.6660%] (p = 0.00 < 0.05) Change within noise threshold. ```

``` $ cargo bench --manifest-path=src/Cargo.toml -p vasi-sync Running benches/scchannel.rs (src/target/release/deps/scchannel-ae8dc0b68617fce5) ping pong time: [4.1223 µs 4.2546 µs 4.3760 µs] change: [-8.6267% -4.6473% -0.6051%] (p = 0.03 < 0.05) Change within noise threshold. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) low severe 4 (4.00%) low mild ping pong pinned time: [3.1942 µs 3.1974 µs 3.2008 µs] change: [+0.0989% +0.2742% +0.4571%] (p = 0.00 < 0.05) Change within noise threshold. Found 7 outliers among 100 measurements (7.00%) 5 (5.00%) high mild 2 (2.00%) high severe ```

This mostly works, but: * there's a loom failure I can't figure out in one test (which shouldn't actually matter in shadow) * we can't safely drop the contents of the channel (if any) after closing it. This might be fixable, but generally there's a potential race if we try to drop the contents when closing, since it could be getting read in parallel. It appears to hurt performance without pinning, and be a no-op with pinning. ``` $ cargo bench --manifest-path=src/Cargo.toml -p vasi-sync ping pong time: [4.8376 µs 4.8934 µs 4.9665 µs] change: [+10.053% +13.523% +17.508%] (p = 0.00 < 0.05) Performance has regressed. Found 11 outliers among 100 measurements (11.00%) 2 (2.00%) low mild 1 (1.00%) high mild 8 (8.00%) high severe ping pong pinned time: [3.2015 µs 3.2051 µs 3.2094 µs] change: [+0.5280% +0.6995% +0.8895%] (p = 0.00 < 0.05) Change within noise threshold. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe ```

This reverts commit 11050ef. This doesn't appear to help performance, and appears to still have some concurrency bugs according to loom and miri. Keeping in the commit history for posterity.

This reverts commit df29378. This complicates the design and doesn't appear to help performance. Keeping in the commit history for posterity.

This reverts three commits: * Revert "Update SelfContainedChannel benchmark" * Revert "scchannel: remove Reading state" * Revert "scchannel: remove Writing state" It also adds a comment to SelfContainedMutex to consider revisiting. The reverted change significantly complicates the design, especially if we wanted to make safe APIs (e.g. by adding checked creation of `!Send` Reader and Writer ends of the channel). On my desktop machine it appears to have ~zero performance benefit. On a larger NRL machine it had a little performance benefit, but mostly only when pinning is disabled, which we expect to be rare in large simulations since that already significantly hurts performance. In case we want to consider revisiting, here is the performance benefit on one of the large NRL machines: ``` HEAD is now at 2628adb Update SelfContainedChannel benchmark Compiling vasi-sync v0.1.0 (/home/rjansen/workspace/temp/shadow/src/lib/vasi-sync) Finished bench [optimized] target(s) in 1.58s Running unittests src/lib.rs (src/target/release/deps/vasi_sync-fa33eb5cd8eff33e) running 0 tests test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s Running benches/scchannel.rs (src/target/release/deps/scchannel-24425348ef654f1a) ping pong time: [11.567 µs 11.955 µs 12.254 µs] change: [-25.107% -22.604% -19.866%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 5 (5.00%) low severe 1 (1.00%) low mild 1 (1.00%) high mild ping pong pinned time: [2.1631 µs 2.1662 µs 2.1693 µs] change: [-3.7961% -3.5645% -3.3038%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe ```

github-actions bot assigned sporksmith Mar 16, 2023

github-actions bot added Component: Build Build/install tools and dependencies Component: Libraries Support functions like LD_PRELOAD and logging labels Mar 16, 2023

This was referenced Mar 17, 2023

Unsafe scmutex #2777

Closed

Separate sleeper atomic #2778

Closed

Convert Ipc to Rust #2775

Merged

sporksmith force-pushed the scchannel-bench branch 2 times, most recently from 11b7e1a to cbe9441 Compare March 17, 2023 15:39

sporksmith added this to the Convert existing modules to Rust milestone Mar 17, 2023

sporksmith marked this pull request as ready for review March 17, 2023 15:42

sporksmith requested a review from stevenengler March 17, 2023 15:42

stevenengler approved these changes Mar 20, 2023

View reviewed changes

sporksmith added 12 commits March 20, 2023 13:56

scchannel test close without write

c2a8bca

Add criterion dev dep to vasi-sync

ea922dd

scchannel: remove Writing state

d4f1b7b

scchannel: remove Reading state

605eb63

VirtualAddressSpaceIndependent: impl for AtomicBool

4e8dc0a

Revert "move writer_closed to a separate atomic"

f7afbbe

This reverts commit 11050ef. This doesn't appear to help performance, and appears to still have some concurrency bugs according to loom and miri. Keeping in the commit history for posterity.

Revert "scchannel: move sleeper bit to another atomic"

0d087b1

This reverts commit df29378. This complicates the design and doesn't appear to help performance. Keeping in the commit history for posterity.

sporksmith force-pushed the scchannel-bench branch from cbe9441 to efd79a1 Compare March 20, 2023 19:00

sporksmith enabled auto-merge March 20, 2023 19:00

sporksmith merged commit 22bb6ac into shadow:main Mar 20, 2023

sporksmith deleted the scchannel-bench branch March 20, 2023 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SelfContainedChannel benchmarks and candidate optimizations#2791

SelfContainedChannel benchmarks and candidate optimizations#2791
sporksmith merged 12 commits intoshadow:mainfrom
sporksmith:scchannel-bench

sporksmith commented Mar 16, 2023 •

edited

Loading

Uh oh!

sporksmith commented Mar 17, 2023

Uh oh!

robgjansen commented Mar 17, 2023

Uh oh!

sporksmith commented Mar 17, 2023

Uh oh!

robgjansen commented Mar 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sporksmith commented Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Outdated (to be stripped from merge commit message)

Uh oh!

sporksmith commented Mar 17, 2023

Uh oh!

robgjansen commented Mar 17, 2023

Uh oh!

sporksmith commented Mar 17, 2023

Uh oh!

robgjansen commented Mar 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sporksmith commented Mar 16, 2023 •

edited

Loading