Add Scheduling Topology Hints #180

csegarragonz · 2021-11-24T11:14:58Z

In this PR I add support to pass scheduling topology hints as a parameter to callFunctions. Scheduling hints affect the way batches are scheduled to the available resources. The currently supported hints are:

NORMAL: will bin-pack requests to hosts until it runs out of slots, and then will overload the master (default).
NEVER_ALONE: will never schedule a request to a host without other requests of the batch. There is one exception: master requests (first in the batch) will be allocated to a new host (even on their own) if the master is out of resources.

These hints only affect the SchedulingDecision so the separation recently made between making the scheduling decision and actually calling the functions has proven extremely useful for the implementation and testing. I add tooling for testing these and future hints we may add easily, together with a number of tests.

include/faabric/scheduler/Scheduler.h

Shillaker · 2021-11-24T12:42:34Z

include/faabric/scheduler/Scheduler.h

+    faabric::util::SchedulingDecision publicMakeSchedulingDecision(
+      std::shared_ptr<faabric::BatchExecuteRequest> req,
+      bool forceLocal,
+      faabric::util::SchedulingTopologyHint topologyHint);


The need for this function isn't immediately obvious and it feels like a bit of a hack. In general if an API needs to be changed to support a test you have one of two things going on: (i) your test is too invasive and checking too much internal logic; (ii) the API doesn't expose enough information.

In this case I think it's (i). I think it can be changed relatively easily as this and callFunctions have the same signature. It might be possible to change the tests to call callFunctions instead (as they have mock mode turned on). You can then add a check to make sure the underlying function calls have been disaptched to the expected hosts.

If this isn't possible, then we need to work out what's happening in callFunctions that doesn't work properly in mock mode.

Yes, switching to callFunctions is pretty easy with the only caveat that we need to use the TestExecutor and TestExecutorFactory classes that currently lived in ./tests/tests/scheduler/test_executor.cpp.

I have moved the declaration of these classes to fixtures.h and kept the definition where it is.

I also add a check for the recorded messages in the function call client.

include/faabric/util/scheduling.h

Shillaker · 2021-11-24T12:47:50Z

tests/test/util/test_scheduling.cpp

 namespace tests {

-TEST_CASE("Test building scheduling decisions", "[util]")
+TEST_CASE("Test building scheduling decisions", "[util][scheduling-decisions]")


I think these tags are getting a bit too granular. Admittedly the util tag is bloated, but as mentioned above, these tests are really testing the scheduler, so once they've been moved, they can be added to the [scheduler] tag. The usefulness of the tags is to allow someone to quickly run the tests related to the thing that they've changed, however, when they're this granular, I'm not sure it's easy to work out which ones to run.

Catch2 provides various selectors that should make it easy to run subsets of tests in your dev workflow, if you're finding there are too many tests to run in a loop e.g. by file: https://catch2.docsforge.com/v2.13.2/running/command-line/.

I agree thanks. For future reference, you can run all tests in tests/test/scheduler/test_scheduling_decisions.cpp by running:

faabric_tests -# [#test_scheduling_decisions]

tests/test/util/test_scheduling.cpp

Shillaker · 2021-11-24T13:18:03Z

src/scheduler/Scheduler.cpp


-                for (int i = 0; i < nOnThisHost; i++) {
-                    hosts.push_back(h);
+                // Under the pairs topology hint, we never allocate a single


My gut here would instead be:

if(topologyHint == faabric::util::SchedulingTopologyHint::PAIRS && nOnThisHost < 2) { // Move on if we can't colocate function with at least one other continue; }

I think this fixes the issue you mention in the PR description too.

I don't think this would behave as expected. For example:

We have 4 hosts with 4 slots each.

We want to schedule 9 requests (with the NEVER_ALONE hint).

We expect 4 requests scheduled to the first host, and 5 scheduled to the second.

However, I think your solution would schedule 5 requests on the first host and 4 on the second, as it would exhaust all possible hosts (nOnThisHost == 1 for all of them), and resort to overload the master.

After an offline discussion, I use this change together with a change in the overloading logic that makes the issue mentioned in the description disappear.

src/scheduler/Scheduler.cpp

csegarragonz · 2021-11-24T18:39:08Z

tests/test/scheduler/test_executor.cpp

 std::atomic<int> resetCount = 0;

-class TestExecutor final : public Executor
+TestExecutor::TestExecutor(faabric::Message& msg)


Changes here involve moving from a class declaration + definition to just definition and some forceLocal refactor.

csegarragonz · 2021-11-24T18:40:23Z

tests/test/scheduler/test_scheduling_decisions.cpp

+        };
+    }
+
+    SECTION("Decreasing to one and increasing slot distribution")


I add this test that before would have failed with NEVER_ALONE.

include/faabric/util/scheduling.h

src/scheduler/Scheduler.cpp

tests/test/scheduler/test_executor.cpp

Shillaker · 2021-11-24T19:06:27Z

@csegarragonz, I forgot to add a comment to that approval. Couple of nitpicky comments but other than that LGTM

csegarragonz self-assigned this Nov 24, 2021

csegarragonz added enhancement New feature or request scheduler labels Nov 24, 2021

csegarragonz force-pushed the mpi-opt branch from 29f4d4e to 4b0c03c Compare November 24, 2021 11:22

add scheduling topology hint and extensive testing

e3bc10c

csegarragonz force-pushed the mpi-opt branch from 4b0c03c to e3bc10c Compare November 24, 2021 11:29

adding a couple more tests

b9c12e5

csegarragonz requested a review from Shillaker November 24, 2021 12:04

Shillaker requested changes Nov 24, 2021

View reviewed changes

csegarragonz added 6 commits November 24, 2021 14:55

move tests to scheduler folder and change tags

0402cef

move to callFunctions and remove publicMakeSchedulingDecision

850801e

add check for the recorded messages

941804e

refactor hint from PAIRS to NEVER_ALONE

1a91cfc

set force local as a topology hint

79862ab

change overloading logic as discussed offline

93fedb7

csegarragonz commented Nov 24, 2021

View reviewed changes

csegarragonz requested a review from Shillaker November 24, 2021 18:40

Shillaker approved these changes Nov 24, 2021

View reviewed changes

include/faabric/util/scheduling.h Outdated Show resolved Hide resolved

src/scheduler/Scheduler.cpp Show resolved Hide resolved

src/scheduler/Scheduler.cpp Outdated Show resolved Hide resolved

tests/test/scheduler/test_executor.cpp Outdated Show resolved Hide resolved

pr comments

a325bc6

csegarragonz merged commit c7454fa into master Nov 25, 2021

csegarragonz deleted the mpi-opt branch November 25, 2021 10:29

csegarragonz mentioned this pull request Nov 25, 2021

Fix race condition in distributed tests + Syntax changes for new faabric faasm/faasm#540

Merged

csegarragonz mentioned this pull request Feb 23, 2022

Add task to generate release body #233

Merged

Add Scheduling Topology Hints #180

Add Scheduling Topology Hints #180

Uh oh!

Conversation

csegarragonz commented Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Shillaker Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csegarragonz Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Shillaker Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csegarragonz Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shillaker Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csegarragonz Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csegarragonz Nov 24, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

csegarragonz Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csegarragonz Nov 24, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shillaker commented Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

csegarragonz commented Nov 24, 2021 •

edited

Loading

Shillaker Nov 24, 2021 •

edited

Loading

csegarragonz Nov 24, 2021 •

edited

Loading

Shillaker Nov 24, 2021 •

edited

Loading

csegarragonz Nov 24, 2021 •

edited

Loading

Shillaker Nov 24, 2021 •

edited

Loading

csegarragonz Nov 24, 2021 •

edited

Loading

csegarragonz Nov 24, 2021 •

edited

Loading

Shillaker commented Nov 24, 2021 •

edited

Loading