-
Notifications
You must be signed in to change notification settings - Fork 14
Add Scheduling Topology Hints #180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
29f4d4e to
4b0c03c
Compare
4b0c03c to
e3bc10c
Compare
| faabric::util::SchedulingDecision publicMakeSchedulingDecision( | ||
| std::shared_ptr<faabric::BatchExecuteRequest> req, | ||
| bool forceLocal, | ||
| faabric::util::SchedulingTopologyHint topologyHint); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need for this function isn't immediately obvious and it feels like a bit of a hack. In general if an API needs to be changed to support a test you have one of two things going on: (i) your test is too invasive and checking too much internal logic; (ii) the API doesn't expose enough information.
In this case I think it's (i). I think it can be changed relatively easily as this and callFunctions have the same signature. It might be possible to change the tests to call callFunctions instead (as they have mock mode turned on). You can then add a check to make sure the underlying function calls have been disaptched to the expected hosts.
If this isn't possible, then we need to work out what's happening in callFunctions that doesn't work properly in mock mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, switching to callFunctions is pretty easy with the only caveat that we need to use the TestExecutor and TestExecutorFactory classes that currently lived in ./tests/tests/scheduler/test_executor.cpp.
I have moved the declaration of these classes to fixtures.h and kept the definition where it is.
I also add a check for the recorded messages in the function call client.
tests/test/util/test_scheduling.cpp
Outdated
| namespace tests { | ||
|
|
||
| TEST_CASE("Test building scheduling decisions", "[util]") | ||
| TEST_CASE("Test building scheduling decisions", "[util][scheduling-decisions]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these tags are getting a bit too granular. Admittedly the util tag is bloated, but as mentioned above, these tests are really testing the scheduler, so once they've been moved, they can be added to the [scheduler] tag. The usefulness of the tags is to allow someone to quickly run the tests related to the thing that they've changed, however, when they're this granular, I'm not sure it's easy to work out which ones to run.
Catch2 provides various selectors that should make it easy to run subsets of tests in your dev workflow, if you're finding there are too many tests to run in a loop e.g. by file: https://catch2.docsforge.com/v2.13.2/running/command-line/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree thanks. For future reference, you can run all tests in tests/test/scheduler/test_scheduling_decisions.cpp by running:
faabric_tests -# [#test_scheduling_decisions]
src/scheduler/Scheduler.cpp
Outdated
|
|
||
| for (int i = 0; i < nOnThisHost; i++) { | ||
| hosts.push_back(h); | ||
| // Under the pairs topology hint, we never allocate a single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut here would instead be:
if(topologyHint == faabric::util::SchedulingTopologyHint::PAIRS &&
nOnThisHost < 2) {
// Move on if we can't colocate function with at least one other
continue;
}I think this fixes the issue you mention in the PR description too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this would behave as expected. For example:
- We have 4 hosts with 4 slots each.
- We want to schedule 9 requests (with the
NEVER_ALONEhint). - We expect 4 requests scheduled to the first host, and 5 scheduled to the second.
However, I think your solution would schedule 5 requests on the first host and 4 on the second, as it would exhaust all possible hosts (nOnThisHost == 1 for all of them), and resort to overload the master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After an offline discussion, I use this change together with a change in the overloading logic that makes the issue mentioned in the description disappear.
| std::atomic<int> resetCount = 0; | ||
|
|
||
| class TestExecutor final : public Executor | ||
| TestExecutor::TestExecutor(faabric::Message& msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes here involve moving from a class declaration + definition to just definition and some forceLocal refactor.
| }; | ||
| } | ||
|
|
||
| SECTION("Decreasing to one and increasing slot distribution") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add this test that before would have failed with NEVER_ALONE.
|
@csegarragonz, I forgot to add a comment to that approval. Couple of nitpicky comments but other than that LGTM |
In this PR I add support to pass scheduling topology hints as a parameter to
callFunctions. Scheduling hints affect the way batches are scheduled to the available resources. The currently supported hints are:NORMAL: will bin-pack requests to hosts until it runs out of slots, and then will overload the master (default).NEVER_ALONE: will never schedule a request to a host without other requests of the batch. There is one exception: master requests (first in the batch) will be allocated to a new host (even on their own) if the master is out of resources.These hints only affect the
SchedulingDecisionso the separation recently made between making the scheduling decision and actually calling the functions has proven extremely useful for the implementation and testing. I add tooling for testing these and future hints we may add easily, together with a number of tests.