Skip to content

Decouple # of worker threads from max concurrency#1238

Merged
sporksmith merged 1 commit intoshadow:mainfrom
sporksmith:max-concurrency-main2
Apr 2, 2021
Merged

Decouple # of worker threads from max concurrency#1238
sporksmith merged 1 commit intoshadow:mainfrom
sporksmith:max-concurrency-main2

Conversation

@sporksmith
Copy link
Copy Markdown
Contributor

@sporksmith sporksmith commented Mar 30, 2021

  1. Adds a WorkerPool with a callback-style interface instead of having a
    series of coordinated barriers.

  2. Adds a --max-concurrency option that allows us to restrict the number
    of workers that can run concurrently. The WorkerPool also implements
    "worker stealing" to try to keep all --max-concurrency CPUs utilized.
    e.g. setting the # of workers to the # of hosts, max concurrency to
    the number of CPUs, and using the "host" scheduler is somewhat analagous
    to using the work-stealing scheduler.

These changes are intended primarily for the dev/phantom branch, since
ptrace has better performance with a smaller number of tracees per
thread. We're including it here in main for apples-to-apples performance
comparison between the shadow 'classic' and 'phantom' architectures.

Originally I tried to keep 1. as a separate commit from 2., but after a
series of fixup commits 1. and 2. got too tangled together.

@sporksmith sporksmith force-pushed the max-concurrency-main2 branch from 26c3efb to c3892ca Compare March 30, 2021 23:30
@sporksmith sporksmith changed the title Max concurrency main2 Decouple # of worker threads from max concurrency Mar 30, 2021
@sporksmith sporksmith force-pushed the max-concurrency-main2 branch from c3892ca to 8373a12 Compare March 30, 2021 23:42
@sporksmith sporksmith requested a review from robgjansen March 30, 2021 23:43
@sporksmith
Copy link
Copy Markdown
Contributor Author

I haven't debugged the deadlock at end of simulation yet, but based on the stacks you sent it looks like a bug in process or ptrace_thread. Did you ever get to the end of a simulation without (the earlier version of) this PR?

As I was cleaning up comments, it occurred to me that something like "vCpu" would be a bit less unwieldy than "concurrency index". Otoh one might take the name to mean something like a "shadow CPU". Wdyt?

As I was looking a bit more at the "gather info" step where we find the time for the next round, I realized it's almost entirely serialized on the global mutex anyway. We could probably just change the scheduler-policy-callback that currently gives the min time for a single worker to instead iterate over its per-worker data structures and return the global minimum time, and just call that directly from the "main" thread in between rounds. Left it alone for now, though.

@sporksmith sporksmith marked this pull request as ready for review March 30, 2021 23:49
@sporksmith
Copy link
Copy Markdown
Contributor Author

Did some ad-hoc testing and ran into a problem with pthread_setaffinity_np I couldn't figure out. (It kept returning EFAULT, but I verified several ways that the address being passed was valid).

Pushed a fixup to remove the new pthread-based affinity API I'd added and instead use the existing api to set it via tid.

In some ad-hoc testing on a balanced workload, it seems to be slightly faster with the new scheduler in general.

On current main, with work-stealing:

$ time ~/.shadow/bin/shadow --pin-cpus -w 4 ./config.xml > shadow.log
** Starting Shadow v1.13.2-692-g12117c17 2021-03-30--09:30:08 with GLib v2.56.4 and IGraph v0.7.1
** Stopping Shadow, returning code 0 (success)

real	0m13.400s
user	0m40.652s
sys	0m11.434s

On current main, with host scheduler:

$ time ~/.shadow/bin/shadow --pin-cpus -w 4 -t host ./config.xml > shadow.log
** Starting Shadow v1.13.2-692-g12117c17 2021-03-30--09:30:08 with GLib v2.56.4 and IGraph v0.7.1
** Stopping Shadow, returning code 0 (success)

real	0m12.943s
user	0m37.753s
sys	0m11.556s

This PR with work-stealing:

$ time ~/.shadow/bin/shadow --pin-cpus -w 4 ./config.xml > shadow.log
** Starting Shadow v1.13.2-694-gb3b86509 2021-03-31--10:47:44 with GLib v2.56.4 and IGraph v0.7.1
** Stopping Shadow, returning code 0 (success)

real	0m12.943s
user	0m39.014s
sys	0m11.238s

This PR with host scheduler:

$ time ~/.shadow/bin/shadow --pin-cpus -w 4 -t host ./config.xml > shadow.log
** Starting Shadow v1.13.2-694-gb3b86509 2021-03-31--10:47:44 with GLib v2.56.4 and IGraph v0.7.1
** Stopping Shadow, returning code 0 (success)

real	0m12.519s
user	0m36.446s
sys	0m11.343s

This PR with host scheduler and 1 host/worker:

$ time ~/.shadow/bin/shadow --pin-cpus -w 256 --max-concurrency=4 -t host ./config.xml > shadow.log
** Starting Shadow v1.13.2-694-gb3b86509 2021-03-31--10:47:44 with GLib v2.56.4 and IGraph v0.7.1
** Stopping Shadow, returning code 0 (success)

real	0m13.103s
user	0m37.678s
sys	0m13.259s

@sporksmith
Copy link
Copy Markdown
Contributor Author

Reproduced the tor deadlock on dev and am debugging now. (i.e. confirmed unrelated to this PR)

@sporksmith
Copy link
Copy Markdown
Contributor Author

Reproduced the tor deadlock on dev and am debugging now. (i.e. confirmed unrelated to this PR)

Oops, built but didn't install. So far haven't reproduced the deadlock on dev. Still digging...

Copy link
Copy Markdown
Member

@robgjansen robgjansen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is difficult to review and took me a while. I still don't think I understand everything, especially with respect to how CPU affinity is being managed across workers (pthreads) and concurrency queues (logical processors). I'm sending this feedback now so I don't hold you up any longer, and then I can review again.

Did you ever get to the end of a simulation without (the earlier version of) this PR?

Yes, my experiments complete using phantom without the earlier max-concurrency PR.

As I was cleaning up comments, it occurred to me that something like "vCpu" would be a bit less unwieldy than "concurrency index". Otoh one might take the name to mean something like a "shadow CPU". Wdyt?

Yes, I think the concurrency queue thing might be something holding me up. I can see how vCpu might be misinterpreted too. I suggest we use logical processor to refer to these "concurrency queues". A processor isn't quite the same as a cpu, but also implies that it could be run on different physical cores, which is how they work. A logical processor could be shortened to log_proc or lp, and then each worker has an assigned_lp_id. This change would require a bit of renaming, but worth it I think.

the "gather info" step

I think we should just implement the suggestion you made (to do it during the round), which I outlined more fully in a comment below, and then remove the "gather info" step. It seems almost trivial to implement.

Comment on lines +39 to +53
// Create a workerpool with `nThreads` threads, allowing up to `nConcurrent` to
// run at a time.
WorkerPool* workerpool_new(Manager* manager, Scheduler* scheduler, int nThreads,
int nConcurrent);

// Begin executing taskFn(data) on each worker thread in the pool.
void workerpool_startTaskFn(WorkerPool* pool, WorkerPoolTaskFn taskFn,
void* data);
// Await completion of a taskFn on every thread in the pool.
void workerpool_awaitTaskFn(WorkerPool* pool);
int workerpool_getNWorkers(WorkerPool* pool);
// Signal worker threads to exit and wait for them to do so.
void workerpool_joinAll(WorkerPool* pool);
void workerpool_free(WorkerPool* pool);
pthread_t workerpool_getThread(WorkerPool* pool, int threadId);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the worker pool independent enough that it deserve it's own module (file), and then operate on opaque Worker* objects?

Not required. Maybe we should save the effort now and instead spend it on a rust version of a new scheduler.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up creating a LogicalProcessor object and breaking that into its own module. IMO the WorkerPool is too coupled with Worker to be worth breaking out.


pthread_t nextThread = workerpool_getThread(
scheduler->workerPool,
workeri++ % workerpool_getNWorkers(scheduler->workerPool));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we re-use the nWorkers variable from line 349 above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +455 to +466
// XXX Can we merge this into the execution task? e.g. track event times as
// we're scheduling them?
workerpool_startTaskFn(scheduler->workerPool,
_scheduler_collectInfoWorkerTask, scheduler);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right here. We need two things for this to work:

  1. During every round, each worker keeps track of the minimum event time that it ever pushes to a different worker (i.e., a host that is not owned by the pushing worker). By event I mean inter-worker events (i.e.,Packets), but we can ignore same-worker events (i.e., Tasks).
  2. At the end of every round, each worker computes the time of the event at the head of its min-heap event queue.

At the end of a round, each worker reports both 1. and 2. to the manager and then goes to sleep. The manager tracks the global minimum across all such reports from all workers. The global minimum is the used as the next time in place of the call to scheduler->policy->getNextTime(scheduler->policy); on line 256 above. Then as long as we move shadow_logger_flushRecords(shadow_logger_getDefault(), pthread_self()); to the end of each worker's main round event loop (which I think is _scheduler_runEventsWorkerTaskFn), we could remove this "in between rounds" task.

Might be worth a separate PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that we should do it, and tempting to fold it into this PR, but given this is already a pretty big change I think it'd be better to break into its own.

Comment on lines +57 to +58
/* Index into concurrency arrays in WorkerPool. */
int concurrencyIdx;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that concurrency index (and concurrency queues) are a bit non-intuitive. As mentioned in my overall comments, I prefer logical processor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +115 to +116
/* Mutex for each concurrency index, parallel to the following arrays */
GMutex* qMutexes;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little confused about what each of these mutexes are covering. I think qMutexes[i] provides locked access to cpuQs[i], doneCpuQs[i], and idleTimers[i]. If that's correct, I think adding a comment to that effect would help.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully clearer now

Comment on lines +117 to +119
/* cpuId for each concurrency index. Read-only once workers have started, so
* can access without mutex. */
int* cpuIds;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since each of the qMutexes don't apply to the cpuIds, I would move the definition above that of qMutexes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Worker* nextWorker = NULL;

for (int i = 0; i < pool->nConcurrent; ++i) {
int idx = (concurrencyIdx + i) % pool->nConcurrent;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate a comment above this for loop to explain the algorithm here. Is it just a linear scan without considering the next closest processor according to the affinity distance function?

Is it a TODO item to choose the next worker based on the logical processor affinity and affinity distance function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It prefers a worker that last ran on the current logical processor. After that the order is somewhat arbitrary.

Comment on lines +494 to +495
Worker* nextWorker = _workerpool_getWorkerForConcurrencyIdx(
workerPool, worker->concurrencyIdx);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this return the next worker for this idx, or the next best worker for this idx, which may be assigned to a different logical processor? (I think it's the latter.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former. Renamed _workerpool_getNextWorkerForConcurrencyIdx.

Comment on lines +347 to +360
int oldCpuId = worker->concurrencyIdx >= 0
? workerPool->cpuIds[worker->concurrencyIdx]
: AFFINITY_UNINIT;
worker->concurrencyIdx = concurrencyIdx;
int newCpuId = workerPool->cpuIds[concurrencyIdx];

if (workerPool->nWorkers > 0) {
utility_assert(worker->nativeThreadID > 0);
affinity_setProcessAffinity(worker->nativeThreadID, newCpuId, oldCpuId);
} else {
// No pthreads exist; set via pid instead.
affinity_setThisProcessAffinity(/*new_cpu_num=*/newCpuId,
/*old_cpu_num=*/oldCpuId);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that I'm having a hard time understanding. The worker (pthread) is setting its affinity to match that of the logical processor that it got assigned to?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Added a comment.

@robgjansen robgjansen added Component: Main Composing the core Shadow executable Tag: Performance Related to improving shadow's run-time Type: Enhancement New functionality or improved design labels Mar 31, 2021
@sporksmith sporksmith requested review from robgjansen and removed request for robgjansen April 1, 2021 23:30
@sporksmith
Copy link
Copy Markdown
Contributor Author

Oops - need to merge and resolve conflicts. Marked unready again :)

@sporksmith sporksmith requested a review from robgjansen April 2, 2021 00:06
@sporksmith
Copy link
Copy Markdown
Contributor Author

Now ready again

Copy link
Copy Markdown
Member

@robgjansen robgjansen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! The LogicalProcessor module is a big improvement. Thanks for your hard work on the design and on cleaning this up!

LogicalProcessor* lp = _idx(lps, lpi);
for (int i = 0; i < lps_n(lps); ++i) {
// Start with workers that last ran on `lpi`; if none are available
// steal from another in round-robin order.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note to remind myself later: I think this is where we could call some form of _cpuinfo_compare from affinity.c. I'll check with Ryan later (and that could be done in a follow-up PR).

// work stealing scheduler threads spin-wait for each-other to
// finish.
error("Host stealing scheduler is incompatible with "
"--max-concurrency > -w");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: -w --> --workers

1. Adds a WorkerPool with a callback-style interface instead of having a
   series of coordinated barriers.

2. Adds a --max-concurrency option that allows us to restrict the number
   of workers that can run concurrently. The WorkerPool also implements
   "worker stealing" to try to keep all --max-concurrency CPUs utilized.
   e.g. setting the # of workers to the # of hosts, max concurrency to
   the number of CPUs, and using the "host" scheduler is somewhat analagous
   to using the work-stealing scheduler.

These changes are intended primarily for the dev/phantom branch, since
ptrace has better performance with a smaller number of tracees per
thread. We're including it here in main for apples-to-apples performance
comparison between the shadow 'classic' and 'phantom' architectures.

Originally I tried to keep 1. as a separate commit from 2., but after a
series of fixup commits 1. and 2. got too tangled together.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Main Composing the core Shadow executable Tag: Performance Related to improving shadow's run-time Type: Enhancement New functionality or improved design

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants