Skip to content

Conversation

@csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented Aug 3, 2023

In this PR, we make the planner take the scheduling decisions in a centralized manner. Intuitively, this means that the callFunctions method is moved from the Scheduler to the PlannerClient.

This PR finalizes most of the work to move from a decentralised to a centralized scheduler, and it includes some relevant changes in functionality (for the tests mostly):

  • By default, we don't overload the executor pool. This means that we need to make sure that the planner sees that the nodes have enough cores to run the requests (even in GHA).
  • Each execution request involves a request round-trip to the planner (a background thread in the tests), which means that we need to be more careful to adequately wait for results (particularly for sanitized builds).
  • Mocking multiple nodes (e.g. for remote threads) becomes harder, as we can't mock one but not the other one. So all remote thread's tests need to happen in the distributed tests, or using a lower-level interface (i.e. not callFunctions but executeBatch or executeThreads).

This PR also removes a lot of functionality from the Scheduler class. This class had become very overloaded, and now is almost unnecessary. In subsequent PRs I will separate the Executor from the Scheduler and decide exactly how to rename the remainders of the Scheduler.

Closes #275
Closes #287

@csegarragonz csegarragonz mentioned this pull request Aug 28, 2023
45 tasks
@csegarragonz csegarragonz marked this pull request as ready for review August 29, 2023 17:38
@codecov
Copy link

codecov bot commented Aug 31, 2023

Codecov Report

Merging #345 (26a90b2) into main (a4e9d23) will decrease coverage by 0.95%.
The diff coverage is 79.72%.

@@            Coverage Diff             @@
##             main     #345      +/-   ##
==========================================
- Coverage   82.35%   81.40%   -0.95%     
==========================================
  Files         103      102       -1     
  Lines        8183     7857     -326     
==========================================
- Hits         6739     6396     -343     
- Misses       1444     1461      +17     
Files Changed Coverage Δ
...clude/faabric/batch-scheduler/SchedulingDecision.h 100.00% <ø> (ø)
include/faabric/scheduler/Scheduler.h 50.00% <ø> (+7.14%) ⬆️
include/faabric/transport/PointToPointBroker.h 100.00% <ø> (ø)
src/batch-scheduler/BinPackScheduler.cpp 97.12% <ø> (ø)
src/runner/FaabricMain.cpp 93.47% <ø> (ø)
src/scheduler/FunctionCallClient.cpp 93.87% <ø> (+5.45%) ⬆️
src/util/ExecGraph.cpp 63.20% <ø> (+0.50%) ⬆️
src/transport/PointToPointBroker.cpp 75.46% <11.11%> (-3.84%) ⬇️
src/mpi/MpiWorld.cpp 80.04% <30.00%> (+2.40%) ⬆️
src/planner/PlannerClient.cpp 82.46% <66.66%> (-1.50%) ⬇️
... and 12 more

... and 7 files with indirect coverage changes

@csegarragonz csegarragonz merged commit 9ccf899 into main Sep 6, 2023
@csegarragonz csegarragonz deleted the call-batch branch September 6, 2023 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migration opportunity tests fail sporadically Deadlock in distributed tests

2 participants