Skip to content

Investigate performance issues in phantom #1041

@robgjansen

Description

@robgjansen

Through some initial benchmarks, we have discovered the following issues that need further investigation to better understand and improve performance:

(rwails TODO: Update each of these issues with steps to repro.)

  • [Issue Bug in work stealing scheduler policy causing idle workers #1046 and PR Prevent idle workers in the work stealing policy #1047] Adding to the Shadow worker count doesn't always improve performance and occasionally decreases performance on embarrassingly parallel workloads. This issue affects both Shadow and Phantom.
  • Phantom is currently experiencing a lot of context switching and CPU migrations on large experiments. Core pinning yields non-intuitive results. For example, pinning 8 workers to 1 code does not yield an 8x slowdown on a 56-core machine.
    • general cpu pinning feature
    • [PR Pinning main #1079] optimize pinning by considering numa nodes and hyperthreads
      • We think this can be done on master using a generally good strategy, and then we get it for free on dev through a master->dev merge.
  • Phantom is scaling sublinearly with the number of phold or sleep hosts.
    • [PR Prefer vfork to fork #1059] The fork syscall appears to be taking a significant amount of time during experiment construction. This may be improved by having a single thread fork all of the processes (although I'm not totally sure without measuring). We may also evaluate the posix_spawn POSIX C function which may have better performance than fork+exec.
    • [PR Ptrace vfork #1128] use vfork instead of fork in ptrace mode as well.
  • [PR Realtime scheduling #1118 Add sched_yield calls in thread_ptrace #1127] Preliminary experiments implementing realtime scheduling show promising results. We should explore this further.
  • [issue Significant time spent in g_get_monotonic_time #1141] We're spending a significant amount of time inside g_get_monotonic_time from our syscall instrumentation. Consider creating a separate "instrumented" build type and disabling these in "release" builds.
  • Further measure and document benefits of disabling CPU vulnerability mitigations, such as spectre. In initial tests this seems to help the dev and master branches equally. (Though it's surprising that the dev branch doesn't see more benefit, due to the extra context switching). Deferred for MVP, since there doesn't seem to be a phantom-specific benefit.
  • [PR Decouple # of worker threads from max concurrency #1238] Further evaluate performance of the work-stealing scheduler. So far it appears to be slower than "host" scheduling, but this may only be because we've been using uniform workloads. We should also measure whether having to detach and reattach ptrace is a significant performance bottleneck, and if so determine whether we can mitigate it. (Idea: instead of preemptively detaching plugins, have a message queue for each worker for "please detach" requests, which can be polled in the threadptrace_resume "event loop"). While we haven't fully investigated where the overheads in the work-stealing scheduler are coming from, the alternate approach of worker-thread-stealing (instead of host stealing) implemented in Decouple # of worker threads from max concurrency #1238 seems to not have this problem (and has other benefits). Cooperative detaching to enable host-stealing could still be a performance win on its own or in combination with worker-stealing, but we're deferring it for now.
  • [issue waitpid is O(# children), resulting in O((# sim processes)^2) in thread_ptrace #1134] In ptrace-mode, avoid quadratic growth with # of hosts due to linear scan of all children and traced threads in wait_consider_task (via waitpid). Short of patching the kernel to avoid linear scans here, I think we'll need to have a non-worker thread fork the plugin threads so that they're not in the worker's "child" list, and have workers ptrace-detach inactive plugins to keep the "traced" list minimal (we already do this in the work-stealing-scheduler to facilitate handoffs).

In the bench directory, running make && shadow -n preload confix.xml -w 64 --preload-spin-max=128 yields good performance on my machine. Amusingly, the machine only has 4 cores, but using 64 workers seems to produce the best run times.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Component: MainComposing the core Shadow executablePriority: CriticalMajor issue that requires immediate attentionTag: PerformanceRelated to improving shadow's run-timeType: BugError or flaw producing unexpected results

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions