Prevent idle workers in the work stealing policy by robgjansen · Pull Request #1047 · shadow/shadow

robgjansen · 2020-12-09T21:37:27Z

This fixes two main issues:

Ensures that the very first round is synchronized properly, by waiting until the workers finish setting up all of the host objects before advancing the master clock to scheduling round 1
Ensures that during every round, no thread can steal from another thread until the other thread has reset its "unprocessed" hosts queue

robgjansen · 2020-12-09T22:25:48Z

I'm running a before/after Tor simulation on this to understand the performance implications, but the code could be reviewed in the meantime.

sporksmith

Approving since this looks correct AFAICT, though I have some suggestions for clarity/maintainability.

sporksmith · 2020-12-09T23:14:05Z

src/main/core/scheduler/scheduler.c

+    if (scheduler->policyType != SP_SERIAL_GLOBAL) {
+        /* when everyone finishes starting the hosts, the main thread can prepare
+         * round 1 */
+        countdownlatch_countDownAwait(scheduler->finishBarrier);


I was a bit confused both about reusing finishBarrier here, and why wait on prepareRoundBarrier below doesn't already accomplish what you want.

I see now that prepareRoundBarrier is actually waited at the end of a round in scheduler_continueNextRound. Would moving it to the start of a round solve the bug without having to (re)use another barrier here, as well as being more consistent with its name?

I had first added a new barrier, but then decided to re-use the finishBarrier since it wasn't needed at the start. But I can see how that could lead to issues later, so I will go back to using a separate one.

The prepareRoundBarrier needs to be at the end, because the state that is updated while the workers are waiting at the barrier (i.e., the clock) is what the workers use to break out of their running while loop. If I moved prepareRoundBarrier to the beginning, then we would need an extra check and break statement---which is certainly possible but also more lines of code. (Perhaps transitionRoundBarrier is a better name.)

Ultimately, I'm not really trying to refactor this admittedly ugly code right now, just trying to minimize the changes needed to get to a correct state. That being said, I do want to rewrite the scheduler in rust (in the dev branch) at some point, because (i) I want to implement a continuous round scheduler to replace the current discrete round scheduler, and (ii) I think we can design a more efficient scheduler/scheduling policy now that we don't have to worry about migrating plugin thread-local storage anymore. But this is all for another month :)

src/main/core/scheduler/scheduler.c

src/main/core/scheduler/scheduler_policy_host_steal.c

sporksmith · 2020-12-09T23:53:42Z

src/main/core/scheduler/scheduler_policy_host_steal.c

        g_rw_lock_reader_unlock(&data->lock);
+
+        /* make sure the workload has been updated for this round */
+        while (!__atomic_load_n(&stolenTdata->isStealable, __ATOMIC_ACQUIRE)) {


I think this looks correct, though could potentially spin more than we'd like, and in general atomic operations outside of implementing higher-level sync primitives is a bit obscure. Maybe a barrier could work instead of a raw atomic?

(Or if spinning actually is what you want here, maybe a spin lock?)

I don't think a pthread_barrier will work here, because I want the bool functionality - I only want the threads waiting for it to turn true to wait, but not the thread that sets it to true. I think pthread_barrier requires all threads that decrement the count to wait until it reaches 0.

I was planning to use a CountDownLatch but then got excited when I realized I could do it with an atomic with hopefully less synchronization overhead. (The countdown latch uses both a mutex and a condition variable.) I'm using the atomic to prevent a race condition so I don't expect the wait will be very long, so I think spinning is OK.

pthread_spin_lock could work, but it's a bit kludgy. If we are using N workers then it's possible that N-1 of them reach this point. I just want the N-1 workers to check/wait for the boolean to become true. If I use a spin lock, I would need to use the unlocked state to represent true, meaning that the N-1 workers would acquire and immediately release the lock every time they want to check if they can steal from another worker. For N workers, this happens N^2 times per round. The atomic seems simpler.

Thoughts?

I don't think a pthread_barrier will work here
...
I was planning to use a CountDownLatch

Oops, yeah CountDownLatch was what I meant.

but then got excited when I realized I could do it with an atomic with hopefully less synchronization overhead. (The countdown latch uses both a mutex and a condition variable.) I'm using the atomic to prevent a race condition so I don't expect the wait will be very long, so I think spinning is OK.

I think the fast path without contention would be the ~same? But yeah, the main practical benefit would be avoiding a potential infinite spin. (re?)reading your comment, it sounds like spinning is probably ok.

pthread_spin_lock could work, but it's a bit kludgy

Hmm, I see what you mean.

Maybe a semaphore (or even Ryan's spinning semaphore) would be a more appropriate abstraction?

Anway, I'm fine with leaving as-is. It makes the code a little more opaque to directly use these low level atomic functions here, but it's not a big deal.

OK, I'm leaving as-is (Ryan's spinning semaphore is a great idea but only available in dev branch atm).

robgjansen · 2020-12-10T03:45:46Z

In a small 1% Tor network workload, Shadow run time is nearly identical before and after this PR (though maybe it would help more on other workloads like Ryan's cpu-based workloads).

robgjansen added Priority: Critical Major issue that requires immediate attention Type: Bug Error or flaw producing unexpected results Component: Main Composing the core Shadow executable Status: In Progress Work is currently underway labels Dec 9, 2020

robgjansen added this to the Minimum viable prototype of process-based simulation architecture milestone Dec 9, 2020

robgjansen self-assigned this Dec 9, 2020

robgjansen added 2 commits December 9, 2020 17:14

Make sure all workers are idle before round setup

d68ad9e

Don't try to steal work before the work queue is repopulated

27683b3

robgjansen marked this pull request as ready for review December 9, 2020 22:24

robgjansen requested a review from sporksmith December 9, 2020 22:25

sporksmith approved these changes Dec 9, 2020

View reviewed changes

robgjansen mentioned this pull request Dec 10, 2020

Investigate performance issues in phantom #1041

Closed

16 tasks

Comments and cleanup from code review

7ddac77

robgjansen merged commit 21d5432 into shadow:master Dec 11, 2020

robgjansen deleted the event-sched-bug branch December 11, 2020 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent idle workers in the work stealing policy#1047

Prevent idle workers in the work stealing policy#1047
robgjansen merged 3 commits intoshadow:masterfrom
robgjansen:event-sched-bug

robgjansen commented Dec 9, 2020 •

edited

Loading

Uh oh!

robgjansen commented Dec 9, 2020

Uh oh!

sporksmith left a comment

Uh oh!

sporksmith Dec 9, 2020

Uh oh!

robgjansen Dec 10, 2020

Uh oh!

Uh oh!

Uh oh!

sporksmith Dec 9, 2020

Uh oh!

sporksmith Dec 10, 2020

Uh oh!

robgjansen Dec 10, 2020

Uh oh!

sporksmith Dec 10, 2020

Uh oh!

robgjansen Dec 11, 2020

Uh oh!

robgjansen commented Dec 10, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robgjansen commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robgjansen commented Dec 9, 2020

Uh oh!

sporksmith left a comment

Choose a reason for hiding this comment

Uh oh!

sporksmith Dec 9, 2020

Choose a reason for hiding this comment

Uh oh!

robgjansen Dec 10, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sporksmith Dec 9, 2020

Choose a reason for hiding this comment

Uh oh!

sporksmith Dec 10, 2020

Choose a reason for hiding this comment

Uh oh!

robgjansen Dec 10, 2020

Choose a reason for hiding this comment

Uh oh!

sporksmith Dec 10, 2020

Choose a reason for hiding this comment

Uh oh!

robgjansen Dec 11, 2020

Choose a reason for hiding this comment

Uh oh!

robgjansen commented Dec 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

robgjansen commented Dec 9, 2020 •

edited

Loading

robgjansen commented Dec 10, 2020 •

edited

Loading