Skip to content

SelfContainedChannel benchmarks and candidate optimizations#2791

Merged
sporksmith merged 12 commits intoshadow:mainfrom
sporksmith:scchannel-bench
Mar 20, 2023
Merged

SelfContainedChannel benchmarks and candidate optimizations#2791
sporksmith merged 12 commits intoshadow:mainfrom
sporksmith:scchannel-bench

Conversation

@sporksmith
Copy link
Copy Markdown
Contributor

@sporksmith sporksmith commented Mar 16, 2023

This adds criterion-based microbenchmarks for SelfContainedChannel. On top of that it adds several candidate optimizations, with the microbenchmark results in the corresponding commit messages.

On my desktop dev machine these are all either neutral or hurt performance.

On a larger simulation machine, removing the internal "Reading" and "Writing" states has some benefit, but mostly only when pinning is disabled.

The candidate optimizations and microbenchmark results are reverted, but left in the commit history in case we want to revisit in the future.

Outdated (to be stripped from merge commit message)

@robgjansen would it be possible for you to run these on the benchmark machine or an equivalent machine? It should only take a few minutes to run.

I think it'd look something like this:

git remote add sporksmith https://github.com/sporksmith/shadow.git
git fetch sporksmith

# baseline 
git checkout 1cecce3701ab98f049bf6b49e28d51dbd1df2296
cargo bench  --manifest-path=src/Cargo.toml  -p vasi-sync

# after removing Reading and Writing states 
git checkout 2628adb473592324a9db49a66a538884b28167a5
cargo bench  --manifest-path=src/Cargo.toml  -p vasi-sync

# after moving sleeper bit to a separate atomic
git checkout df293781a5f8e06357afc9051997c2718c8d47e0
cargo bench  --manifest-path=src/Cargo.toml  -p vasi-sync

# after moving writer_closed bit to a separate atomic
git checkout 11050ef98876001dc8d6619e104cf197d205b039
cargo bench  --manifest-path=src/Cargo.toml  -p vasi-sync

@github-actions github-actions bot added Component: Build Build/install tools and dependencies Component: Libraries Support functions like LD_PRELOAD and logging labels Mar 16, 2023
sporksmith added a commit to sporksmith/shadow that referenced this pull request Mar 16, 2023
This is a bit of a long-shot, but when I did this in shadow#2791,
the microbenchmark results became much more consistent, and apparent
benefits of some of the candidate optimizations went away.

I think it's less likely to matter here since the channels themselves
are bigger - they'll have a ShimEvent placed between them in memory -
and so probably won't end up on the same cache line anyways, but it's
worth checking.
@sporksmith
Copy link
Copy Markdown
Contributor Author

After rebasing and adding a large alignment to the candidate Rust IPCData, it appears to have the same performance as the C version. I'm still a little curious about what these microbenchmarks look like on our benchmark runner machine, but I think we can drop it if it's a hassle.

@robgjansen
Copy link
Copy Markdown
Member

@robgjansen would it be possible for you to run these on the benchmark machine or an equivalent machine?

Sure. The benchmark machine is running a benchmark at the moment so I used a newer AMD machine instead. I can run on the benchmark machine when idle if desired.

Here is a screenshot with the fancy colors. Text output below.

Screenshot 2023-03-17 at 11 09 20 AM

    Finished bench [optimized] target(s) in 4m 37s                                                                                                         
     Running unittests src/lib.rs (src/target/release/deps/vasi_sync-fa33eb5cd8eff33e)                                                                     

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/scchannel.rs (src/target/release/deps/scchannel-24425348ef654f1a)
ping pong               time:   [14.594 µs 15.341 µs 15.913 µs]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) low severe

ping pong pinned        time:   [2.2432 µs 2.2462 µs 2.2491 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Previous HEAD position was 1cecce37 Add a benchmark for SelfContainedChannel
HEAD is now at 2628adb4 Update SelfContainedChannel benchmark
   Compiling vasi-sync v0.1.0 (/home/rjansen/workspace/temp/shadow/src/lib/vasi-sync)
    Finished bench [optimized] target(s) in 1.58s
     Running unittests src/lib.rs (src/target/release/deps/vasi_sync-fa33eb5cd8eff33e)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/scchannel.rs (src/target/release/deps/scchannel-24425348ef654f1a)
ping pong               time:   [11.567 µs 11.955 µs 12.254 µs]
                        change: [-25.107% -22.604% -19.866%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild

ping pong pinned        time:   [2.1631 µs 2.1662 µs 2.1693 µs]
                        change: [-3.7961% -3.5645% -3.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Previous HEAD position was 2628adb4 Update SelfContainedChannel benchmark
HEAD is now at df293781 scchannel: move sleeper bit to another atomic
   Compiling vasi v0.1.0 (/home/rjansen/workspace/temp/shadow/src/lib/vasi)
   Compiling vasi-sync v0.1.0 (/home/rjansen/workspace/temp/shadow/src/lib/vasi-sync)
    Finished bench [optimized] target(s) in 1.61s
     Running unittests src/lib.rs (src/target/release/deps/vasi_sync-fa33eb5cd8eff33e)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/scchannel.rs (src/target/release/deps/scchannel-24425348ef654f1a)
ping pong               time:   [11.916 µs 12.481 µs 12.926 µs]
                        change: [+1.7138% +5.3248% +8.7079%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) low severe

ping pong pinned        time:   [2.1969 µs 2.2006 µs 2.2046 µs]
                        change: [+1.5706% +2.0104% +2.6089%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Previous HEAD position was df293781 scchannel: move sleeper bit to another atomic
HEAD is now at 11050ef9 move writer_closed to a separate atomic
   Compiling vasi-sync v0.1.0 (/home/rjansen/workspace/temp/shadow/src/lib/vasi-sync)
    Finished bench [optimized] target(s) in 1.57s
     Running unittests src/lib.rs (src/target/release/deps/vasi_sync-fa33eb5cd8eff33e)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/scchannel.rs (src/target/release/deps/scchannel-24425348ef654f1a)
ping pong               time:   [11.453 µs 11.988 µs 12.366 µs]
                        change: [-8.2206% -4.9541% -1.5380%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) low severe
  1 (1.00%) high mild

ping pong pinned        time:   [2.2517 µs 2.2544 µs 2.2571 µs]
                        change: [+1.6144% +2.2020% +2.6288%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

@sporksmith
Copy link
Copy Markdown
Contributor Author

@robgjansen thanks!

Interesting but orthogonal that pinning makes a much bigger difference on that machine.

It looks like removing the Reading and Writing state (and making the interfaces unsafe) helps a bit on this machine, though mostly when not using pinning.

Since the e2e tor benchmark for migrating IPC to Rust now appears to be performance neutral with the "basic" implementation, I'm leaning towards just moving forward with that. I'll add a comment to consider revisiting that optimization in the future (and adding the checked Reader and Writer "checkouts" to make it safe again), but it doesn't seem worth it right now.

@sporksmith sporksmith force-pushed the scchannel-bench branch 2 times, most recently from 11b7e1a to cbe9441 Compare March 17, 2023 15:39
@sporksmith sporksmith marked this pull request as ready for review March 17, 2023 15:42
@sporksmith sporksmith requested a review from stevenengler March 17, 2023 15:42
@robgjansen
Copy link
Copy Markdown
Member

OK, sounds good. The machine I used support hyperthreading whereas the benchmark runner does not, which might have something to do with pinning performance.

```
$ cargo bench  --manifest-path=src/Cargo.toml  -p vasi-sync
     Running benches/scchannel.rs (src/target/release/deps/scchannel-ae8dc0b68617fce5)
ping pong               time:   [4.3334 µs 4.5088 µs 4.6953 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild

ping pong pinned        time:   [3.2115 µs 3.2179 µs 3.2259 µs]
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) high mild
  6 (6.00%) high severe
```
```
$ cargo bench  --manifest-path=src/Cargo.toml  -p vasi-sync
     Running benches/scchannel.rs (src/target/release/deps/scchannel-ae8dc0b68617fce5)
ping pong               time:   [4.2460 µs 4.4094 µs 4.5736 µs]
                        change: [-4.5442% -0.5164% +3.7840%] (p = 0.81 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild

ping pong pinned        time:   [3.1927 µs 3.1973 µs 3.2016 µs]
                        change: [-1.0734% -0.8648% -0.6660%] (p = 0.00 < 0.05)
                        Change within noise threshold.
```
```
$ cargo bench  --manifest-path=src/Cargo.toml  -p vasi-sync
     Running benches/scchannel.rs (src/target/release/deps/scchannel-ae8dc0b68617fce5)
ping pong               time:   [4.1223 µs 4.2546 µs 4.3760 µs]
                        change: [-8.6267% -4.6473% -0.6051%] (p = 0.03 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild

ping pong pinned        time:   [3.1942 µs 3.1974 µs 3.2008 µs]
                        change: [+0.0989% +0.2742% +0.4571%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

```
This mostly works, but:

* there's a loom failure I can't figure out in one test (which
shouldn't actually matter in shadow)
* we can't safely drop the contents of the channel (if any)
after closing it. This might be fixable, but generally there's a
potential race if we try to drop the contents when closing, since it
could be getting read in parallel.

It appears to hurt performance without pinning, and be a no-op with
pinning.

```
$ cargo bench  --manifest-path=src/Cargo.toml  -p vasi-sync
ping pong               time:   [4.8376 µs 4.8934 µs 4.9665 µs]
                        change: [+10.053% +13.523% +17.508%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  8 (8.00%) high severe

ping pong pinned        time:   [3.2015 µs 3.2051 µs 3.2094 µs]
                        change: [+0.5280% +0.6995% +0.8895%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
```
This reverts commit 11050ef.

This doesn't appear to help performance, and appears to still have some
concurrency bugs according to loom and miri.

Keeping in the commit history for posterity.
This reverts commit df29378.

This complicates the design and doesn't appear to help performance.
Keeping in the commit history for posterity.
This reverts three commits:
* Revert "Update SelfContainedChannel benchmark"
* Revert "scchannel: remove Reading state"
* Revert "scchannel: remove Writing state"

It also adds a comment to SelfContainedMutex to consider revisiting.

The reverted change significantly complicates the design, especially if
we wanted to make safe APIs (e.g. by adding checked creation of `!Send`
Reader and Writer ends of the channel).

On my desktop machine it appears to have ~zero performance benefit.

On a larger NRL machine it had a little performance benefit, but mostly
only when pinning is disabled, which we expect to be rare in
large simulations since that already significantly hurts performance.

In case we want to consider revisiting, here is the performance benefit
on one of the large NRL machines:

```
HEAD is now at 2628adb Update SelfContainedChannel benchmark
   Compiling vasi-sync v0.1.0 (/home/rjansen/workspace/temp/shadow/src/lib/vasi-sync)
    Finished bench [optimized] target(s) in 1.58s
     Running unittests src/lib.rs (src/target/release/deps/vasi_sync-fa33eb5cd8eff33e)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/scchannel.rs (src/target/release/deps/scchannel-24425348ef654f1a)
ping pong               time:   [11.567 µs 11.955 µs 12.254 µs]
                        change: [-25.107% -22.604% -19.866%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild

ping pong pinned        time:   [2.1631 µs 2.1662 µs 2.1693 µs]
                        change: [-3.7961% -3.5645% -3.3038%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
```
@sporksmith sporksmith enabled auto-merge March 20, 2023 19:00
@sporksmith sporksmith merged commit 22bb6ac into shadow:main Mar 20, 2023
@sporksmith sporksmith deleted the scchannel-bench branch March 20, 2023 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Build Build/install tools and dependencies Component: Libraries Support functions like LD_PRELOAD and logging

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants