Skip to content

Convert Ipc to Rust#2775

Merged
sporksmith merged 2 commits intoshadow:mainfrom
sporksmith:ipc-rust
Mar 17, 2023
Merged

Convert Ipc to Rust#2775
sporksmith merged 2 commits intoshadow:mainfrom
sporksmith:ipc-rust

Conversation

@sporksmith
Copy link
Copy Markdown
Contributor

@sporksmith sporksmith commented Mar 1, 2023

No description provided.

@github-actions github-actions bot added Component: Libraries Support functions like LD_PRELOAD and logging Component: Main Composing the core Shadow executable labels Mar 1, 2023
@sporksmith
Copy link
Copy Markdown
Contributor Author

Started a benchmark since this is very much in the hot path: https://github.com/shadow/benchmark/actions/runs/4306832877

Locally it seems this PR might be a slight perf gain:

main:

Benchmark 1: ./setup test --extra tor-minimal
  Time (mean ± σ):     14.517 s ±  0.136 s    [User: 15.578 s, System: 5.084 s]
  Range (min … max):   14.285 s … 14.741 s    10 runs

this pr:

Benchmark 1: ./setup test --extra tor-minimal
  Time (mean ± σ):     14.320 s ±  0.190 s    [User: 15.400 s, System: 4.854 s]
  Range (min … max):   13.971 s … 14.559 s    10 runs

@sporksmith
Copy link
Copy Markdown
Contributor Author

A possible reason for a performance gain - the C++ implementation ultimately uses shadow_sem, which uses more-expensive CST ordering on atomic operations. The Rust implementation avoids that, partly with the help of loom to validate that it's still correct with less-restrictive ordering requirements.

@github-actions github-actions bot added the Component: Documentation In-repository documentation, under docs/ label Mar 1, 2023
@sporksmith
Copy link
Copy Markdown
Contributor Author

Unfortunately it looks like it got slightly slower on the full benchmark: https://github.com/shadow/benchmark-results/blob/master/tor/2023-03-01-T18-40-39/plots/run_time.png

I think the most likely suspect is SelfContainedMutex, and there is some room for optimization there.

  • We could remove the "reading" and "writing" states, which are really only needed to ensure soundness in case of multiple writers or multiple readers (which are already not supported). This is easy to do if we make the APIs unsafe. Otherwise we could add safe APIs for "taking" a non-Send Reader or Writer object from the channel, which would ensure there is only ever a single reader or writer at a time. The latter is a bit more work, though.

  • We might be able to split the 'has_sleeper' and writer_closed bits into separate atomics, similar to what shadow_sem does. This makes the correctness a little harder to reason about, which I think is why we conservatively ended up going with SeqCst operations. I think we could lean on loom to ensure we get it right, though.

For now I'd lean towards filing an issue to explore those optimizations, adding a comment in SelfContainedMutex referencing the issue, and merging the current form.

@sporksmith sporksmith mentioned this pull request Mar 2, 2023
@sporksmith
Copy link
Copy Markdown
Contributor Author

I tried a couple optimizations.

  • This PR in its current state had 1.7% overhead.
  • Unsafe scmutex #2777 "unsafe scmutex" - gets rid of the 'writing' and 'reading' states of the atomic, which means fewer atomic operations, but also turns parallel readers and writers into undefined behavior. This reduces the overhead compared to nightly to 1.2%.
    I think we can make this safe again with minimal additional overhead for Rust code by adding checked APIs to create !Sync Reader and Writer objects from the channel instead of calling send and recv directly on the Sync channel object.
  • Separate sleeper atomic #2778 extracts the "has_sleeper" bit into a separate atomic, to reduce contention when updating. This reduces the overhead compared to nightly to 1.0%.

I have a couple more things to try:

  • Using [repr(align)] to get the atomics on different cache lines, so that writing to one doesn't invalidate the other. I'm not sure whether this will end up being a win or not since it also hurts cache locality, but it's safe and easy to try.
  • Moving the closed bit to a separate atomic. I have a prototype of this working, but it gets a bit complicated. It'd be nice not to have to do this.

@sporksmith
Copy link
Copy Markdown
Contributor Author

After rebasing and adding a large alignment to Ipc (to rule out artifacts due to placement within CPU cache lines), performance seems to match the C implementation: https://github.com/shadow/benchmark-results/blob/master/tor/2023-03-16-T13-26-54/plots/run_time.png

I'm on the fence whether to keep the large alignment; maybe I'll run one more benchmark without it. In the meantime I think this can be reviewed and merged with or without it.

@sporksmith
Copy link
Copy Markdown
Contributor Author

Also see #2791 - none of the candidate optimizations appear to help in microbenchmarking.

@sporksmith sporksmith changed the title Ipc rust Convert Ipc to Rust Mar 17, 2023
@sporksmith sporksmith requested a review from stevenengler March 17, 2023 14:57
@sporksmith sporksmith marked this pull request as ready for review March 17, 2023 14:57
Notably this removes the user-facing experimental options
`preload_spin_max` and `use_explicit_block_message`, and hard-wires the
current defaults of not spinning and not using an explicit block message
(which would be redundant with spinning disabled).

We've stopped using these options, and I haven't bothered reimplementing
them in the Rust version of `IPCData`.
@sporksmith sporksmith enabled auto-merge March 17, 2023 18:15
@sporksmith sporksmith merged commit a5340f6 into shadow:main Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Documentation In-repository documentation, under docs/ Component: Libraries Support functions like LD_PRELOAD and logging Component: Main Composing the core Shadow executable

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants