Bug in work stealing scheduler policy causing idle workers

The work-stealing scheduling policy is occasionally causing some worker threads to be idle even though there is outstanding work to perform. This is item 1 in #1041, more details below.

There is a bug in the logic that is run at the beginning of each simulation round. When the round changes, each thread moves all of its assigned hosts from its `processed` queue back into the `unprocessed` queue so that the work associated with those hosts in the new round can proceed. There is a race condition here where thread A has nothing to do (no hosts assigned) and tries to start stealing a host from thread B's `unprocessed` queue before thread B has repopulated it for the new round. Thread A then thinks all the work is already done and sits idle, while thread B is stuck with a full queue of hosts that it must process.

Here is a simple program to create a ~5 second cpu workload (on my machine):

```C
#include <stdint.h>
int main(int argc, char* argv[]) {
    volatile uint64_t i = 0;
    while(i < 3000000000)
        i += 1;
}
```

And the shadow config:

```xml
<shadow stoptime="2">
  <topology><![CDATA[<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
  <key attr.name="packetloss" attr.type="double" for="edge" id="d4" />
  <key attr.name="latency" attr.type="double" for="edge" id="d3" />
  <key attr.name="bandwidthup" attr.type="int" for="node" id="d2" />
  <key attr.name="bandwidthdown" attr.type="int" for="node" id="d1" />
  <key attr.name="countrycode" attr.type="string" for="node" id="d0" />
  <graph edgedefault="undirected">
    <node id="poi-1">
      <data key="d0">US</data>
      <data key="d1">10240</data>
      <data key="d2">10240</data>
    </node>
    <edge source="poi-1" target="poi-1">
      <data key="d3">1.0</data>
      <data key="d4">0.0</data>
    </edge>
  </graph>
</graphml>
]]></topology>
  <plugin id="spin" path="/home/rjansen/dev/shadow/build/src/test/spin/libshadow-plugin-test-spin.so"/>
  <host id="testhost" quantity="12">
    <process plugin="spin" starttime="1" arguments=""/>
  </host>
</shadow>
```

If each run of `main` takes 5 seconds, then running 12 hosts and 12 threads should complete in 5 seconds. But sometimes it takes 2 cycles (10 seconds) because some of the workers are idle and others end up processing 2 hosts instead of 1. I was able to observe this behavior consistently when using 12 workers, and about 3 workers on average were idle.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in work stealing scheduler policy causing idle workers #1046

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug in work stealing scheduler policy causing idle workers #1046

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions