Add preemption to escape pure-cpu busy-loops#3520
Conversation
b806900 to
b8cd997
Compare
|
Tor benchmark: https://github.com/shadow/benchmark/actions/runs/13932935805. (With this feature disabled as per default, to ensure it doesn't introduce unacceptable overhead when not actually used) |
I also started/queued a tor benchmark with preemption enabled and with 3 repetitions. Not that it will affect whether we merge this or anything, I'm just curious what the results look like. If it looks like it will take a long time, then we can cancel it. |
Results: https://github.com/shadow/benchmark-results/blob/master/tor/2025-03-18-T20-32-33/plots/run_time.png Looks like a small but noticeable performance hit :(. 7318 s vs 7219 s, so about 1.3%. My prime suspect is the syscall handler, where I'm no longer skipping some code when I'll do some more experimentation tomorrow to try to confirm if this is the culprit. (I suppose I should also run multiple trials, but I think this is a big enough regression to probably not just be "unlucky") |
stevenengler
left a comment
There was a problem hiding this comment.
I didn't finish reviewing it yet (I haven't looked at most of the shim changes), so just leaving the comments I have so far.
I'll do some more experimentation tomorrow to try to confirm if this is the culprit. (I suppose I should also run multiple trials, but I think this is a big enough regression to probably not just be "unlucky")
If the benchmark I started is still running when you want to start another, please cancel mine, I don't want it to block anything.
Looks like a small but noticeable performance hit :(. 7318 s vs 7219 s, so about 1.3%.
My prime suspect is the syscall handler, where I'm no longer skipping some code when
model-unblocked-syscall-latencyis disabled
IMO I don't think the difference is big enough to prevent merging. But if it turns out that the performance regression can be eliminated by skipping that code path, I think it would be nice to have.
|
Haven't been able to reproduce the performance regression locally, possibly just because my system is too noisy. I made some tweaks to the hot path in the shim and am trying again with 3 trials: https://github.com/shadow/benchmark/actions/runs/13936903437 |
|
After that last change and running 3 trials, the performance difference looks to be in the noise. https://github.com/shadow/benchmark-results/blob/master/tor/2025-03-19-T06-28-41/plots/run_time.png Performance with the feature enabled is definitely slower (as expected), though not horrendous https://github.com/shadow/benchmark-results/blob/master/tor/2025-03-18-T22-45-17/plots/run_time.png. I think the only way to improve it substantially would be to use the per-thread |
|
Also worth noting that as-expected having preemption enabled doesn't appear to have changed any of the results inside the simulation https://github.com/shadow/benchmark-results/tree/master/tor/2025-03-18-T22-45-17. I expect there were probably no preemptions. When I've run the test suite locally (including the tor-minimal test) with the feature globally enabled, the only test besides the new cpu-busy-loop test that triggers it is the mmap test when initializing large chunks of memory, and that only in debug builds. |
stevenengler
left a comment
There was a problem hiding this comment.
Looks good, I think this will be a nice feature to have.
(Just pasting the image into a comment so we have it even if we need to prune benchmark results from the repo in the future.) |
This new feature is off by default, and enabled via the new experimental `--native-preemption-enabled` option. When enabled, the shim code uses a native timer to interrupt managed code that runs for longer than `--native-preemption-native-interval` (100 ms by default) without returning control to shadow, and moves simulated time forward by `--native-preemption-sim-interval` (10 ms by default). It is intended to be used to escape CPU-only busy loops. Fixes shadow#2066
* Hold the host shmem lock longer to avoid multiple locks and unlocks. * Replace "fine grained" temporaries copied from shmem with temporaries to shmem itself, for clarity. * reset `unapplied_cpu_latency` to zero when rescheduling too, not just when moving time forward. I'm fairly certain this was a (usually mostly-innocuous) bug.
Accessing thread-local-storage is cheap, but not free, and this code is very much on the hot path.

This new feature is off by default, and enabled via the new experimental
--native-preemption-enabledoption.When enabled, the shim code uses a native timer to interrupt managed code that runs for longer than
--native-preemption-native-interval(100 ms by default) without returning control to shadow, and moves simulated time forward by--native-preemption-sim-interval(10 ms by default).It is intended to be used to escape CPU-only busy loops.
Fixes #2066