Decouple # of worker threads from max concurrency by sporksmith · Pull Request #1238 · shadow/shadow

sporksmith · 2021-03-30T19:02:41Z

Adds a WorkerPool with a callback-style interface instead of having a
series of coordinated barriers.
Adds a --max-concurrency option that allows us to restrict the number
of workers that can run concurrently. The WorkerPool also implements
"worker stealing" to try to keep all --max-concurrency CPUs utilized.
e.g. setting the # of workers to the # of hosts, max concurrency to
the number of CPUs, and using the "host" scheduler is somewhat analagous
to using the work-stealing scheduler.

These changes are intended primarily for the dev/phantom branch, since
ptrace has better performance with a smaller number of tracees per
thread. We're including it here in main for apples-to-apples performance
comparison between the shadow 'classic' and 'phantom' architectures.

Originally I tried to keep 1. as a separate commit from 2., but after a
series of fixup commits 1. and 2. got too tangled together.

sporksmith · 2021-03-30T23:49:22Z

I haven't debugged the deadlock at end of simulation yet, but based on the stacks you sent it looks like a bug in process or ptrace_thread. Did you ever get to the end of a simulation without (the earlier version of) this PR?

As I was cleaning up comments, it occurred to me that something like "vCpu" would be a bit less unwieldy than "concurrency index". Otoh one might take the name to mean something like a "shadow CPU". Wdyt?

As I was looking a bit more at the "gather info" step where we find the time for the next round, I realized it's almost entirely serialized on the global mutex anyway. We could probably just change the scheduler-policy-callback that currently gives the min time for a single worker to instead iterate over its per-worker data structures and return the global minimum time, and just call that directly from the "main" thread in between rounds. Left it alone for now, though.

sporksmith · 2021-03-31T16:09:42Z

Did some ad-hoc testing and ran into a problem with pthread_setaffinity_np I couldn't figure out. (It kept returning EFAULT, but I verified several ways that the address being passed was valid).

Pushed a fixup to remove the new pthread-based affinity API I'd added and instead use the existing api to set it via tid.

In some ad-hoc testing on a balanced workload, it seems to be slightly faster with the new scheduler in general.

On current main, with work-stealing:

$ time ~/.shadow/bin/shadow --pin-cpus -w 4 ./config.xml > shadow.log
** Starting Shadow v1.13.2-692-g12117c17 2021-03-30--09:30:08 with GLib v2.56.4 and IGraph v0.7.1
** Stopping Shadow, returning code 0 (success)

real	0m13.400s
user	0m40.652s
sys	0m11.434s

On current main, with host scheduler:

$ time ~/.shadow/bin/shadow --pin-cpus -w 4 -t host ./config.xml > shadow.log
** Starting Shadow v1.13.2-692-g12117c17 2021-03-30--09:30:08 with GLib v2.56.4 and IGraph v0.7.1
** Stopping Shadow, returning code 0 (success)

real	0m12.943s
user	0m37.753s
sys	0m11.556s

This PR with work-stealing:

$ time ~/.shadow/bin/shadow --pin-cpus -w 4 ./config.xml > shadow.log
** Starting Shadow v1.13.2-694-gb3b86509 2021-03-31--10:47:44 with GLib v2.56.4 and IGraph v0.7.1
** Stopping Shadow, returning code 0 (success)

real	0m12.943s
user	0m39.014s
sys	0m11.238s

This PR with host scheduler:

$ time ~/.shadow/bin/shadow --pin-cpus -w 4 -t host ./config.xml > shadow.log
** Starting Shadow v1.13.2-694-gb3b86509 2021-03-31--10:47:44 with GLib v2.56.4 and IGraph v0.7.1
** Stopping Shadow, returning code 0 (success)

real	0m12.519s
user	0m36.446s
sys	0m11.343s

This PR with host scheduler and 1 host/worker:

$ time ~/.shadow/bin/shadow --pin-cpus -w 256 --max-concurrency=4 -t host ./config.xml > shadow.log
** Starting Shadow v1.13.2-694-gb3b86509 2021-03-31--10:47:44 with GLib v2.56.4 and IGraph v0.7.1
** Stopping Shadow, returning code 0 (success)

real	0m13.103s
user	0m37.678s
sys	0m13.259s

sporksmith · 2021-03-31T17:06:59Z

Reproduced the tor deadlock on dev and am debugging now. (i.e. confirmed unrelated to this PR)

sporksmith · 2021-03-31T17:22:44Z

Reproduced the tor deadlock on dev and am debugging now. (i.e. confirmed unrelated to this PR)

Oops, built but didn't install. So far haven't reproduced the deadlock on dev. Still digging...

robgjansen

Sorry, this is difficult to review and took me a while. I still don't think I understand everything, especially with respect to how CPU affinity is being managed across workers (pthreads) and concurrency queues (logical processors). I'm sending this feedback now so I don't hold you up any longer, and then I can review again.

Did you ever get to the end of a simulation without (the earlier version of) this PR?

Yes, my experiments complete using phantom without the earlier max-concurrency PR.

As I was cleaning up comments, it occurred to me that something like "vCpu" would be a bit less unwieldy than "concurrency index". Otoh one might take the name to mean something like a "shadow CPU". Wdyt?

Yes, I think the concurrency queue thing might be something holding me up. I can see how vCpu might be misinterpreted too. I suggest we use logical processor to refer to these "concurrency queues". A processor isn't quite the same as a cpu, but also implies that it could be run on different physical cores, which is how they work. A logical processor could be shortened to log_proc or lp, and then each worker has an assigned_lp_id. This change would require a bit of renaming, but worth it I think.

the "gather info" step

I think we should just implement the suggestion you made (to do it during the round), which I outlined more fully in a comment below, and then remove the "gather info" step. It seems almost trivial to implement.

robgjansen · 2021-03-31T14:35:46Z

src/main/core/worker.h

+// Create a workerpool with `nThreads` threads, allowing up to `nConcurrent` to
+// run at a time.
+WorkerPool* workerpool_new(Manager* manager, Scheduler* scheduler, int nThreads,
+                           int nConcurrent);
+
+// Begin executing taskFn(data) on each worker thread in the pool.
+void workerpool_startTaskFn(WorkerPool* pool, WorkerPoolTaskFn taskFn,
+                            void* data);
+// Await completion of a taskFn on every thread in the pool.
+void workerpool_awaitTaskFn(WorkerPool* pool);
+int workerpool_getNWorkers(WorkerPool* pool);
+// Signal worker threads to exit and wait for them to do so.
+void workerpool_joinAll(WorkerPool* pool);
+void workerpool_free(WorkerPool* pool);
+pthread_t workerpool_getThread(WorkerPool* pool, int threadId);


Is the worker pool independent enough that it deserve it's own module (file), and then operate on opaque Worker* objects?

Not required. Maybe we should save the effort now and instead spend it on a rust version of a new scheduler.

I ended up creating a LogicalProcessor object and breaking that into its own module. IMO the WorkerPool is too coupled with Worker to be worth breaking out.

robgjansen · 2021-03-31T20:00:54Z

src/main/core/scheduler/scheduler.c

-
+            pthread_t nextThread = workerpool_getThread(
+                scheduler->workerPool,
+                workeri++ % workerpool_getNWorkers(scheduler->workerPool));


Can we re-use the nWorkers variable from line 349 above?

robgjansen · 2021-03-31T20:17:41Z

src/main/core/scheduler/scheduler.c

+    // XXX Can we merge this into the execution task? e.g. track event times as
+    // we're scheduling them?
+    workerpool_startTaskFn(scheduler->workerPool,
+                           _scheduler_collectInfoWorkerTask, scheduler);


I think you're right here. We need two things for this to work:

During every round, each worker keeps track of the minimum event time that it ever pushes to a different worker (i.e., a host that is not owned by the pushing worker). By event I mean inter-worker events (i.e.,Packets), but we can ignore same-worker events (i.e., Tasks).

At the end of every round, each worker computes the time of the event at the head of its min-heap event queue.

At the end of a round, each worker reports both 1. and 2. to the manager and then goes to sleep. The manager tracks the global minimum across all such reports from all workers. The global minimum is the used as the next time in place of the call to scheduler->policy->getNextTime(scheduler->policy); on line 256 above. Then as long as we move shadow_logger_flushRecords(shadow_logger_getDefault(), pthread_self()); to the end of each worker's main round event loop (which I think is _scheduler_runEventsWorkerTaskFn), we could remove this "in between rounds" task.

Might be worth a separate PR.

Agreed that we should do it, and tempting to fold it into this PR, but given this is already a pretty big change I think it'd be better to break into its own.

robgjansen · 2021-03-31T20:21:05Z

src/main/core/worker.c

+    /* Index into concurrency arrays in WorkerPool. */
+    int concurrencyIdx;


I agree that concurrency index (and concurrency queues) are a bit non-intuitive. As mentioned in my overall comments, I prefer logical processor.

robgjansen · 2021-03-31T20:28:01Z

src/main/core/worker.c

+    /* Mutex for each concurrency index, parallel to the following arrays */
+    GMutex* qMutexes;


I was a little confused about what each of these mutexes are covering. I think qMutexes[i] provides locked access to cpuQs[i], doneCpuQs[i], and idleTimers[i]. If that's correct, I think adding a comment to that effect would help.

Hopefully clearer now

robgjansen · 2021-03-31T20:29:07Z

src/main/core/worker.c

+    /* cpuId for each concurrency index. Read-only once workers have started, so
+     * can access without mutex. */
+    int* cpuIds;


Since each of the qMutexes don't apply to the cpuIds, I would move the definition above that of qMutexes.

robgjansen · 2021-03-31T20:45:54Z

src/main/core/worker.c

+    Worker* nextWorker = NULL;
+
+    for (int i = 0; i < pool->nConcurrent; ++i) {
+        int idx = (concurrencyIdx + i) % pool->nConcurrent;


I would appreciate a comment above this for loop to explain the algorithm here. Is it just a linear scan without considering the next closest processor according to the affinity distance function?

Is it a TODO item to choose the next worker based on the logical processor affinity and affinity distance function?

Done. It prefers a worker that last ran on the current logical processor. After that the order is somewhat arbitrary.

robgjansen · 2021-03-31T21:55:14Z

src/main/core/worker.c

+        Worker* nextWorker = _workerpool_getWorkerForConcurrencyIdx(
+            workerPool, worker->concurrencyIdx);


Does this return the next worker for this idx, or the next best worker for this idx, which may be assigned to a different logical processor? (I think it's the latter.)

The former. Renamed _workerpool_getNextWorkerForConcurrencyIdx.

robgjansen · 2021-03-31T21:58:45Z

src/main/core/worker.c

+    int oldCpuId = worker->concurrencyIdx >= 0
+                       ? workerPool->cpuIds[worker->concurrencyIdx]
+                       : AFFINITY_UNINIT;
+    worker->concurrencyIdx = concurrencyIdx;
+    int newCpuId = workerPool->cpuIds[concurrencyIdx];
+
+    if (workerPool->nWorkers > 0) {
+        utility_assert(worker->nativeThreadID > 0);
+        affinity_setProcessAffinity(worker->nativeThreadID, newCpuId, oldCpuId);
+    } else {
+        // No pthreads exist; set via pid instead.
+        affinity_setThisProcessAffinity(/*new_cpu_num=*/newCpuId,
+                                        /*old_cpu_num=*/oldCpuId);
+    }


This is the part that I'm having a hard time understanding. The worker (pthread) is setting its affinity to match that of the logical processor that it got assigned to?

Right. Added a comment.

sporksmith · 2021-04-01T23:45:46Z

Oops - need to merge and resolve conflicts. Marked unready again :)

sporksmith · 2021-04-02T00:06:24Z

Now ready again

robgjansen

Looks great! The LogicalProcessor module is a big improvement. Thanks for your hard work on the design and on cleaning this up!

robgjansen · 2021-04-02T01:54:04Z

src/main/core/logical_processor.c

+    LogicalProcessor* lp = _idx(lps, lpi);
+    for (int i = 0; i < lps_n(lps); ++i) {
+        // Start with workers that last ran on `lpi`; if none are available
+        // steal from another in round-robin order.


Just a note to remind myself later: I think this is where we could call some form of _cpuinfo_compare from affinity.c. I'll check with Ryan later (and that could be done in a follow-up PR).

robgjansen · 2021-04-02T02:01:21Z

src/main/core/scheduler/scheduler.c

+                // work stealing scheduler threads spin-wait for each-other to
+                // finish.
+                error("Host stealing scheduler is incompatible with "
+                      "--max-concurrency > -w");


Nit: -w --> --workers

1. Adds a WorkerPool with a callback-style interface instead of having a series of coordinated barriers. 2. Adds a --max-concurrency option that allows us to restrict the number of workers that can run concurrently. The WorkerPool also implements "worker stealing" to try to keep all --max-concurrency CPUs utilized. e.g. setting the # of workers to the # of hosts, max concurrency to the number of CPUs, and using the "host" scheduler is somewhat analagous to using the work-stealing scheduler. These changes are intended primarily for the dev/phantom branch, since ptrace has better performance with a smaller number of tracees per thread. We're including it here in main for apples-to-apples performance comparison between the shadow 'classic' and 'phantom' architectures. Originally I tried to keep 1. as a separate commit from 2., but after a series of fixup commits 1. and 2. got too tangled together.

sporksmith force-pushed the max-concurrency-main2 branch from 26c3efb to c3892ca Compare March 30, 2021 23:30

sporksmith changed the title ~~Max concurrency main2~~ Decouple # of worker threads from max concurrency Mar 30, 2021

sporksmith force-pushed the max-concurrency-main2 branch from c3892ca to 8373a12 Compare March 30, 2021 23:42

sporksmith requested a review from robgjansen March 30, 2021 23:43

sporksmith marked this pull request as ready for review March 30, 2021 23:49

robgjansen requested changes Mar 31, 2021

View reviewed changes

robgjansen assigned sporksmith Mar 31, 2021

robgjansen added Component: Main Composing the core Shadow executable Tag: Performance Related to improving shadow's run-time Type: Enhancement New functionality or improved design labels Mar 31, 2021

robgjansen added this to the Minimum viable prototype of process-based simulation architecture milestone Mar 31, 2021

robgjansen mentioned this pull request Apr 1, 2021

Investigate performance issues in phantom #1041

Closed

16 tasks

sporksmith requested review from robgjansen and removed request for robgjansen April 1, 2021 23:30

sporksmith requested a review from robgjansen April 2, 2021 00:06

robgjansen approved these changes Apr 2, 2021

View reviewed changes

sporksmith force-pushed the max-concurrency-main2 branch from 35105f5 to 41cfc75 Compare April 2, 2021 05:00

sporksmith merged commit 5b8b109 into shadow:main Apr 2, 2021

sporksmith deleted the max-concurrency-main2 branch April 2, 2021 05:07

robgjansen mentioned this pull request Apr 2, 2021

Determine next round time dynamically during round #1249

Closed

sporksmith mentioned this pull request Apr 5, 2021

waitpid is O(# children), resulting in O((# sim processes)^2) in thread_ptrace #1134

Closed

sporksmith mentioned this pull request May 3, 2021

Overhead from ptrace detaching/reattaching when using the work stealing scheduler #1182

Closed

		/* Index into concurrency arrays in WorkerPool. */
		int concurrencyIdx;

		/* Mutex for each concurrency index, parallel to the following arrays */
		GMutex* qMutexes;

		Worker* nextWorker = _workerpool_getWorkerForConcurrencyIdx(
		workerPool, worker->concurrencyIdx);

Conversation

sporksmith commented Mar 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sporksmith commented Mar 30, 2021

Uh oh!

sporksmith commented Mar 31, 2021

Uh oh!

sporksmith commented Mar 31, 2021

Uh oh!

sporksmith commented Mar 31, 2021

Uh oh!

robgjansen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sporksmith commented Apr 1, 2021

Uh oh!

sporksmith commented Apr 2, 2021

Uh oh!

robgjansen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sporksmith commented Mar 30, 2021 •

edited

Loading