Skip to content

CPU Pinning#1063

Closed
rwails wants to merge 4 commits intoshadow:devfrom
rwails:cpu-pinning
Closed

CPU Pinning#1063
rwails wants to merge 4 commits intoshadow:devfrom
rwails:cpu-pinning

Conversation

@rwails
Copy link
Copy Markdown
Collaborator

@rwails rwails commented Jan 1, 2021

This PR adds an optional flag that enables CPU pinning for Shadow workers and plugins. Workers are pinned uniformly among logical CPUs available on the machine. Plugin threads are pinned to the same logical CPU that the worker is pinned to. This behavior is enabled with the --cpu-pin=1 flag.

Pinning can yield a significant performance improvement. On syscall dense workloads, I am measuring a 20-50% improvement in performance (using --cpu-pin=1 and --preload-spin-max=0).

Ryan Wails added 3 commits December 30, 2020 14:52
thread. Each plugin thread adopts the worker's affinity value during
`thread_new` and `thread_continue` procedure calls.

Cleanup routine in affinity.c needs to be improved.

Need to add a command line switch to selectively enable pinning.
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 1, 2021

Codecov Report

Merging #1063 (ec2b418) into dev (a0538d8) will increase coverage by 0.09%.
The diff coverage is 59.70%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev    #1063      +/-   ##
==========================================
+ Coverage   54.96%   55.05%   +0.09%     
==========================================
  Files         127      129       +2     
  Lines       19235    19276      +41     
  Branches     4590     4601      +11     
==========================================
+ Hits        10572    10613      +41     
+ Misses       5896     5885      -11     
- Partials     2767     2778      +11     
Flag Coverage Δ
tests 55.05% <59.70%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/main/host/affinity.c 23.07% <23.07%> (ø)
src/main/core/worker.c 83.41% <75.86%> (+0.15%) ⬆️
src/main/host/affinity.h 100.00% <100.00%> (ø)
src/main/host/thread.c 60.74% <100.00%> (+1.82%) ⬆️
.../main/core/scheduler/scheduler_policy_host_steal.c 81.67% <0.00%> (-0.39%) ⬇️
src/main/core/scheduler/scheduler.c 77.40% <0.00%> (-0.34%) ⬇️
src/main/core/slave.c 68.95% <0.00%> (+0.65%) ⬆️
src/support/logger/rust_bindings/src/lib.rs 32.90% <0.00%> (+1.02%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0538d8...e7ede00. Read the comment docs.

@rwails rwails linked an issue Jan 1, 2021 that may be closed by this pull request
16 tasks
@rwails rwails added the Priority: High Prioritized ahead of most other issues label Jan 1, 2021
@jtracey
Copy link
Copy Markdown
Contributor

jtracey commented Jan 1, 2021

I've not done any experiments with Phantom, but on the current release version of Shadow, I've run some experiments that show pinning to the A sides of CPUs (effectively disabling hyperthreads) is noticeably faster than using all logical CPUs -- specifically, on some large experiments, setting the number of workers to the number of physical cores and pinning to the A sides was significantly faster than pinning to A and B sides, whether setting the number of workers to the number of physical or logical cores (though the details will likely vary with specific workload, and that was pinning the entire process, not each thread, so there were still migrations happening). Pinning can be accomplished with numactl, but since A and B sides are frequently (but not always) interleaved numerically, a user who doesn't know to do that and runs Shadow with fewer workers than logical CPUs will likely end up with the workers bunched up on the same physical cores when distributing via worker_thread_id % _ncpus.

It might also be worth trying to keep threads on the same socket when multiple are available (and also bind to the memory for those numa nodes, strictly speaking, but I think linux is smart enough to handle that step on its own?). I don't know if there's a standard way to get all this information via glibc/syscalls, but it's available in /proc/cpuinfo.

Copy link
Copy Markdown
Member

@robgjansen robgjansen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we tested the case that @jtracey found works better for classic shadow when we were testing phantom (i.e., effectively disabling hyperthreading); could you summarize our conclusions there?

assert(_ncpus > 0 && pid >= 0);

// We can short-circuit if there's no work to do.
if (!_affinity_enabled || new_cpu_num == AFFINITY_UNINIT || new_cpu_num == old_cpu_num) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If new_cpu_num == AFFINITY_UNINIT then don't we want to "unpin" the process to allow the scheduler to run it anywhere?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using AFFINITY_UNINIT to denote the case when AFFINITY is unknown or has never been set (as opposed to the case when affinity is set to all CPUs). I think it'd be better to introduce a new variable to unpin/"unset" affinity if we want that behavior.

}

if (!set_affinity_suceeded) {
warning("Could not set CPU affinity for PID %d to %d", (int)pid, new_cpu_num);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the cpu-pin option and we can't pin, should that be an error or a warning? IIRC, Jim thought error makes more sense, rather than continuing to run the simulation without pinning and potentially tricking the user that pinning is active when it isn't. I'm thinking that logging a warning makes sense but then refusing to run the experiment. That's more work to code because you'll have to propagate the error back, but does it give us the behavior we want?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I promoted to critical so that the user has a higher likelihood of seeing.

Crashing with an error makes me nervous because there are potentially thousands of affinity changes that could occur over the course of an experiment (during work stealing, for example). If a single one of those affinity changes doesn't succeed, I'm not sure the experiment should crash since it could otherwise complete successfully with negligible performance impact.

Regardless, if we want to make this a crash instead of a warning, I think it should be done at the call site and not inside this module.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crashing with an error makes me nervous because there are potentially thousands of affinity changes that could occur over the course of an experiment

Haven't looked at the code in this PR yet but when I made that comment I believe the proposed pinning was all done during startup. If something goes wrong during startup I think it makes sense to crash. I agree we want to avoid crashing in the middle of a simulation if we can.

#include <sched.h>
#include <sys/types.h>

enum { AFFINITY_UNINIT = -1 };
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional nit] A brief note describing this would be nice.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I added a comment

@rwails
Copy link
Copy Markdown
Collaborator Author

rwails commented Jan 6, 2021

I think we tested the case that @jtracey found works better for classic shadow when we were testing phantom (i.e., effectively disabling hyperthreading); could you summarize our conclusions there?

Yeah, so most of the performance improvements due to pinning I believe is due to reducing the cost of context switching (this only applied in the process-oriented dev branch). By pinning workers and plugins to the same core we achieve very low cost context switching.

I agree with Justin that there's more benefit that could be gained by intelligently pinning workers to particular CPUs, for example taking NUMA nodes into account. Some experiments we've run so far show that pinning to hyperthread cores pessimizes performance, but for the machines I'm running experiments on the HT cores are all high numbered, so by setting -w at num_cores / 2 we avoid pinning to shared cores. I'd like to leave intelligent pinning to a different PR since it could get complicated if we're parsing /proc/cpuinfo or pulling in a dependency to examine the platform.

Thanks for the feedback @jtracey !

@rwails rwails requested a review from robgjansen January 6, 2021 01:59
}

cpu_set_t* cpu_set = NULL;
bool set_affinity_suceeded = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, 'suceeded' (I noticed the typo while reading the discussion :)

@robgjansen
Copy link
Copy Markdown
Member

We decided to close this PR and open 2 new ones in order to make sure we have the pinning feature in both master (classic shadow) and dev (phantom) branches. The first PR will add the feature to master, then we'll merge master to dev, and the second PR will make the changes needed to do the pinning in dev.

@sporksmith
Copy link
Copy Markdown
Contributor

sporksmith commented Jan 11, 2021

We decided to close this PR and open 2 new ones in order to make sure we have the pinning feature in both master (classic shadow) and dev (phantom) branches. The first PR will add the feature to master, then we'll merge master to dev, and the second PR will make the changes needed to do the pinning in dev.

Ok, removing myself as a reviewer on this one, then :)

@sporksmith sporksmith removed their request for review January 11, 2021 21:56
@rwails rwails closed this Feb 3, 2021
@robgjansen robgjansen added the Tag: Performance Related to improving shadow's run-time label Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Priority: High Prioritized ahead of most other issues Tag: Performance Related to improving shadow's run-time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate performance issues in phantom

5 participants