-
Notifications
You must be signed in to change notification settings - Fork 269
Support pinning when running parallel simulations on same hardware #1565
Copy link
Copy link
Closed
Labels
Type: EnhancementNew functionality or improved designNew functionality or improved design
Description
Currently Shadow's cpu-pinning logic assumes it has the entire machine to itself. In particular, if two instances of Shadow are running concurrently with pinning, they'll both pin to the same set of CPUs. It'd be good if we could avoid this, particularly when running on shared hardware where other users might be running simulations at the same time.
Some options:
- Out of band global coordination: In Shadow, use sched_getaffinity before doing any pinning, and only use the set of CPUs that were initially assigned when pinning. A user could then use a tool like
taskset(1)to assign each shadow simulation a disjoint set of CPUs to work with. This option is pretty easy to implement in Shadow, but puts the burden of global coordination on the user. EDIT: Now implemented in Respect shadow's initial CPU affinity #1575 - Shadow global coordination: Each instance of Shadow could have a "pid" file that records which CPUs it's using. When Shadow starts up it would check for existing pid files, validate that those pids are still running, read them to find which CPUs are in use, choose its own set of CPUs disjoint from those in the current files, and write its own pid file. Some care would be needed to avoid race conditions (maybe just a global lock file). This removes the burden of global coordination from the user, but adds substantial complexity to Shadow, including global mutable state.
- Flexible pinning: Let the Linux scheduler choose the initial CPU to run each worker thread on, and then pin the worker and its managed thread to that CPU. We might also want to periodically unpin to give the Linux scheduler a chance to choose a different CPU. This strategy would let the scheduler handle picking idle CPUs. Potential downsides though are more CPU migrations (if/when we unpin to allow the scheduler to reassign), and giving up control over the initial assignment (which currently tries to maximize cache affinity and avoid using multiple logical CPUs on the same core).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Type: EnhancementNew functionality or improved designNew functionality or improved design