Investigate performance issues in phantom

Through some initial benchmarks, we have discovered the following issues that need further investigation to better understand and improve performance:

(rwails TODO: Update each of these issues with steps to repro.)

- [x] [Issue #1046 and PR #1047] Adding to the Shadow worker count doesn't always improve performance and occasionally decreases performance on embarrassingly parallel workloads. This issue affects both Shadow and Phantom.
- [x] Phantom is currently experiencing a lot of context switching and CPU migrations on large experiments. Core pinning yields non-intuitive results. For example, pinning 8 workers to 1 code does not yield an 8x slowdown on a 56-core machine.
  - [x] general cpu pinning feature
    - [x] [PR #1063] initial code on dev; decided not to merge and instead split to master and dev
    - [x] [PR #1079] master
    - [x] dev
  - [x] [PR #1079] optimize pinning by considering numa nodes and hyperthreads
    - [x] We think this can be done on master using a generally good strategy, and then we get it for free on dev through a master->dev merge.
- [x] Phantom is scaling sublinearly with the number of phold or sleep hosts.
  - [x] [PR #1059] The `fork` syscall appears to be taking a significant amount of time during experiment construction. This may be improved by having a single thread fork all of the processes (although I'm not totally sure without measuring). We may also evaluate the `posix_spawn` POSIX C function which may have better performance than `fork+exec`.
  - [x] [PR #1128] use vfork instead of fork in ptrace mode as well.
- [x] [PR #1118 #1127] Preliminary experiments implementing realtime scheduling show promising results. We should explore this further.
- [x] [issue #1141] We're spending a significant amount of time inside g_get_monotonic_time from our syscall instrumentation. Consider creating a separate "instrumented" build type and disabling these in "release" builds.
- [ ] ~Further measure and document benefits of [disabling CPU vulnerability mitigations](https://linuxreviews.org/HOWTO_make_Linux_run_blazing_fast_(again)_on_Intel_CPUs), such as spectre. In initial tests this seems to help the dev and master branches equally. (Though it's surprising that the dev branch doesn't see more benefit, due to the extra context switching)~. Deferred for MVP, since there doesn't seem to be a phantom-specific benefit.
- [ ] [PR #1238] ~Further evaluate performance of the work-stealing scheduler. So far it appears to be slower than "host" scheduling, but this may only be because we've been using uniform workloads. We should also measure whether having to detach and reattach ptrace is a significant performance bottleneck, and if so determine whether we can mitigate it. (Idea: instead of preemptively detaching plugins, have a message queue for each worker for "please detach" requests, which can be polled in the `threadptrace_resume` "event loop")~. While we haven't fully investigated where the overheads in the work-stealing scheduler are coming from, the alternate approach of worker-thread-stealing (instead of host stealing) implemented in #1238 seems to not have this problem (and has other benefits). Cooperative detaching to enable host-stealing could still be a performance win on its own or in combination with worker-stealing, but we're deferring it for now.
- [x] [issue #1134] In ptrace-mode, avoid quadratic growth with # of hosts due to linear scan of all [children](https://github.com/torvalds/linux/blob/219d54332a09e8d8741c1e1982f5eae56099de85/kernel/exit.c#L1395) and [traced threads](https://github.com/torvalds/linux/blob/219d54332a09e8d8741c1e1982f5eae56099de85/kernel/exit.c#L1409) in [wait_consider_task](https://github.com/torvalds/linux/blob/219d54332a09e8d8741c1e1982f5eae56099de85/kernel/exit.c#L1279) (via `waitpid`). Short of patching the kernel to avoid linear scans here, I think we'll need to have a non-worker thread fork the plugin threads so that they're not in the worker's "child" list, and have workers ptrace-detach inactive plugins to keep the "traced" list minimal (we already do this in the work-stealing-scheduler to facilitate handoffs). 

In the `bench` directory, running `make && shadow -n preload confix.xml -w 64 --preload-spin-max=128` yields good performance on my machine. Amusingly, the machine only has 4 cores, but using 64 workers seems to produce the best run times.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance issues in phantom #1041

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate performance issues in phantom #1041

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions