Decouple # of worker threads from max concurrency#1238
Decouple # of worker threads from max concurrency#1238sporksmith merged 1 commit intoshadow:mainfrom
Conversation
26c3efb to
c3892ca
Compare
c3892ca to
8373a12
Compare
|
I haven't debugged the deadlock at end of simulation yet, but based on the stacks you sent it looks like a bug in process or ptrace_thread. Did you ever get to the end of a simulation without (the earlier version of) this PR? As I was cleaning up comments, it occurred to me that something like "vCpu" would be a bit less unwieldy than "concurrency index". Otoh one might take the name to mean something like a "shadow CPU". Wdyt? As I was looking a bit more at the "gather info" step where we find the time for the next round, I realized it's almost entirely serialized on the global mutex anyway. We could probably just change the scheduler-policy-callback that currently gives the min time for a single worker to instead iterate over its per-worker data structures and return the global minimum time, and just call that directly from the "main" thread in between rounds. Left it alone for now, though. |
|
Did some ad-hoc testing and ran into a problem with pthread_setaffinity_np I couldn't figure out. (It kept returning EFAULT, but I verified several ways that the address being passed was valid). Pushed a fixup to remove the new pthread-based affinity API I'd added and instead use the existing api to set it via tid. In some ad-hoc testing on a balanced workload, it seems to be slightly faster with the new scheduler in general. On current On current This PR with work-stealing: This PR with host scheduler: This PR with host scheduler and 1 host/worker: |
|
Reproduced the tor deadlock on |
Oops, built but didn't install. So far haven't reproduced the deadlock on dev. Still digging... |
robgjansen
left a comment
There was a problem hiding this comment.
Sorry, this is difficult to review and took me a while. I still don't think I understand everything, especially with respect to how CPU affinity is being managed across workers (pthreads) and concurrency queues (logical processors). I'm sending this feedback now so I don't hold you up any longer, and then I can review again.
Did you ever get to the end of a simulation without (the earlier version of) this PR?
Yes, my experiments complete using phantom without the earlier max-concurrency PR.
As I was cleaning up comments, it occurred to me that something like "vCpu" would be a bit less unwieldy than "concurrency index". Otoh one might take the name to mean something like a "shadow CPU". Wdyt?
Yes, I think the concurrency queue thing might be something holding me up. I can see how vCpu might be misinterpreted too. I suggest we use logical processor to refer to these "concurrency queues". A processor isn't quite the same as a cpu, but also implies that it could be run on different physical cores, which is how they work. A logical processor could be shortened to log_proc or lp, and then each worker has an assigned_lp_id. This change would require a bit of renaming, but worth it I think.
the "gather info" step
I think we should just implement the suggestion you made (to do it during the round), which I outlined more fully in a comment below, and then remove the "gather info" step. It seems almost trivial to implement.
| // Create a workerpool with `nThreads` threads, allowing up to `nConcurrent` to | ||
| // run at a time. | ||
| WorkerPool* workerpool_new(Manager* manager, Scheduler* scheduler, int nThreads, | ||
| int nConcurrent); | ||
|
|
||
| // Begin executing taskFn(data) on each worker thread in the pool. | ||
| void workerpool_startTaskFn(WorkerPool* pool, WorkerPoolTaskFn taskFn, | ||
| void* data); | ||
| // Await completion of a taskFn on every thread in the pool. | ||
| void workerpool_awaitTaskFn(WorkerPool* pool); | ||
| int workerpool_getNWorkers(WorkerPool* pool); | ||
| // Signal worker threads to exit and wait for them to do so. | ||
| void workerpool_joinAll(WorkerPool* pool); | ||
| void workerpool_free(WorkerPool* pool); | ||
| pthread_t workerpool_getThread(WorkerPool* pool, int threadId); |
There was a problem hiding this comment.
Is the worker pool independent enough that it deserve it's own module (file), and then operate on opaque Worker* objects?
Not required. Maybe we should save the effort now and instead spend it on a rust version of a new scheduler.
There was a problem hiding this comment.
I ended up creating a LogicalProcessor object and breaking that into its own module. IMO the WorkerPool is too coupled with Worker to be worth breaking out.
src/main/core/scheduler/scheduler.c
Outdated
|
|
||
| pthread_t nextThread = workerpool_getThread( | ||
| scheduler->workerPool, | ||
| workeri++ % workerpool_getNWorkers(scheduler->workerPool)); |
There was a problem hiding this comment.
Can we re-use the nWorkers variable from line 349 above?
| // XXX Can we merge this into the execution task? e.g. track event times as | ||
| // we're scheduling them? | ||
| workerpool_startTaskFn(scheduler->workerPool, | ||
| _scheduler_collectInfoWorkerTask, scheduler); |
There was a problem hiding this comment.
I think you're right here. We need two things for this to work:
- During every round, each worker keeps track of the minimum event time that it ever pushes to a different worker (i.e., a host that is not owned by the pushing worker). By event I mean inter-worker events (i.e.,
Packets), but we can ignore same-worker events (i.e.,Tasks). - At the end of every round, each worker computes the time of the event at the head of its min-heap event queue.
At the end of a round, each worker reports both 1. and 2. to the manager and then goes to sleep. The manager tracks the global minimum across all such reports from all workers. The global minimum is the used as the next time in place of the call to scheduler->policy->getNextTime(scheduler->policy); on line 256 above. Then as long as we move shadow_logger_flushRecords(shadow_logger_getDefault(), pthread_self()); to the end of each worker's main round event loop (which I think is _scheduler_runEventsWorkerTaskFn), we could remove this "in between rounds" task.
Might be worth a separate PR.
There was a problem hiding this comment.
Agreed that we should do it, and tempting to fold it into this PR, but given this is already a pretty big change I think it'd be better to break into its own.
src/main/core/worker.c
Outdated
| /* Index into concurrency arrays in WorkerPool. */ | ||
| int concurrencyIdx; |
There was a problem hiding this comment.
I agree that concurrency index (and concurrency queues) are a bit non-intuitive. As mentioned in my overall comments, I prefer logical processor.
src/main/core/worker.c
Outdated
| /* Mutex for each concurrency index, parallel to the following arrays */ | ||
| GMutex* qMutexes; |
There was a problem hiding this comment.
I was a little confused about what each of these mutexes are covering. I think qMutexes[i] provides locked access to cpuQs[i], doneCpuQs[i], and idleTimers[i]. If that's correct, I think adding a comment to that effect would help.
There was a problem hiding this comment.
Hopefully clearer now
src/main/core/worker.c
Outdated
| /* cpuId for each concurrency index. Read-only once workers have started, so | ||
| * can access without mutex. */ | ||
| int* cpuIds; |
There was a problem hiding this comment.
Since each of the qMutexes don't apply to the cpuIds, I would move the definition above that of qMutexes.
src/main/core/worker.c
Outdated
| Worker* nextWorker = NULL; | ||
|
|
||
| for (int i = 0; i < pool->nConcurrent; ++i) { | ||
| int idx = (concurrencyIdx + i) % pool->nConcurrent; |
There was a problem hiding this comment.
I would appreciate a comment above this for loop to explain the algorithm here. Is it just a linear scan without considering the next closest processor according to the affinity distance function?
Is it a TODO item to choose the next worker based on the logical processor affinity and affinity distance function?
There was a problem hiding this comment.
Done. It prefers a worker that last ran on the current logical processor. After that the order is somewhat arbitrary.
src/main/core/worker.c
Outdated
| Worker* nextWorker = _workerpool_getWorkerForConcurrencyIdx( | ||
| workerPool, worker->concurrencyIdx); |
There was a problem hiding this comment.
Does this return the next worker for this idx, or the next best worker for this idx, which may be assigned to a different logical processor? (I think it's the latter.)
There was a problem hiding this comment.
The former. Renamed _workerpool_getNextWorkerForConcurrencyIdx.
src/main/core/worker.c
Outdated
| int oldCpuId = worker->concurrencyIdx >= 0 | ||
| ? workerPool->cpuIds[worker->concurrencyIdx] | ||
| : AFFINITY_UNINIT; | ||
| worker->concurrencyIdx = concurrencyIdx; | ||
| int newCpuId = workerPool->cpuIds[concurrencyIdx]; | ||
|
|
||
| if (workerPool->nWorkers > 0) { | ||
| utility_assert(worker->nativeThreadID > 0); | ||
| affinity_setProcessAffinity(worker->nativeThreadID, newCpuId, oldCpuId); | ||
| } else { | ||
| // No pthreads exist; set via pid instead. | ||
| affinity_setThisProcessAffinity(/*new_cpu_num=*/newCpuId, | ||
| /*old_cpu_num=*/oldCpuId); | ||
| } |
There was a problem hiding this comment.
This is the part that I'm having a hard time understanding. The worker (pthread) is setting its affinity to match that of the logical processor that it got assigned to?
There was a problem hiding this comment.
Right. Added a comment.
|
Oops - need to merge and resolve conflicts. Marked unready again :) |
|
Now ready again |
robgjansen
left a comment
There was a problem hiding this comment.
Looks great! The LogicalProcessor module is a big improvement. Thanks for your hard work on the design and on cleaning this up!
| LogicalProcessor* lp = _idx(lps, lpi); | ||
| for (int i = 0; i < lps_n(lps); ++i) { | ||
| // Start with workers that last ran on `lpi`; if none are available | ||
| // steal from another in round-robin order. |
There was a problem hiding this comment.
Just a note to remind myself later: I think this is where we could call some form of _cpuinfo_compare from affinity.c. I'll check with Ryan later (and that could be done in a follow-up PR).
src/main/core/scheduler/scheduler.c
Outdated
| // work stealing scheduler threads spin-wait for each-other to | ||
| // finish. | ||
| error("Host stealing scheduler is incompatible with " | ||
| "--max-concurrency > -w"); |
1. Adds a WorkerPool with a callback-style interface instead of having a series of coordinated barriers. 2. Adds a --max-concurrency option that allows us to restrict the number of workers that can run concurrently. The WorkerPool also implements "worker stealing" to try to keep all --max-concurrency CPUs utilized. e.g. setting the # of workers to the # of hosts, max concurrency to the number of CPUs, and using the "host" scheduler is somewhat analagous to using the work-stealing scheduler. These changes are intended primarily for the dev/phantom branch, since ptrace has better performance with a smaller number of tracees per thread. We're including it here in main for apples-to-apples performance comparison between the shadow 'classic' and 'phantom' architectures. Originally I tried to keep 1. as a separate commit from 2., but after a series of fixup commits 1. and 2. got too tangled together.
35105f5 to
41cfc75
Compare
Adds a WorkerPool with a callback-style interface instead of having a
series of coordinated barriers.
Adds a --max-concurrency option that allows us to restrict the number
of workers that can run concurrently. The WorkerPool also implements
"worker stealing" to try to keep all --max-concurrency CPUs utilized.
e.g. setting the # of workers to the # of hosts, max concurrency to
the number of CPUs, and using the "host" scheduler is somewhat analagous
to using the work-stealing scheduler.
These changes are intended primarily for the dev/phantom branch, since
ptrace has better performance with a smaller number of tracees per
thread. We're including it here in main for apples-to-apples performance
comparison between the shadow 'classic' and 'phantom' architectures.
Originally I tried to keep 1. as a separate commit from 2., but after a
series of fixup commits 1. and 2. got too tangled together.