-
Notifications
You must be signed in to change notification settings - Fork 70
Latest faabric #646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latest faabric #646
Conversation
| @@ -1,4 +1,4 @@ | |||
| FROM faasm/faabric-base:0.3.0 | |||
| FROM faasm/faabric-base:0.3.3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to bump this one for the latest Conan
| } | ||
|
|
||
| // TODO - this method is duplicated from faabric's dist tests | ||
| void checkSchedulingFromExecGraph( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this method we can actually check if a migration has taken place.
Shillaker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also pending faasm/faabric#264 as noted in the description.
|
|
||
| // Sleep for a while to let the scheduler schedule the MPI calls, and then | ||
| // update the local slots so that a migration opportunity appears | ||
| SLEEP_MS(500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what this sleep is waiting for. The comment stays waiting for the scheduler to schedule the calls, but surely that's happening in the sch.callFunctions line? What happens without this sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The request we are sending in the sch.callFunctions line is a request for an MPI application. Thus it is just one message, with the field msg.mpiworldsize() set to the worldSize.
Once the request is scheduled and has started executing, we will call MpiWorld::create that will then issue a different call to sch.CallFunctions with worldSize - 1 functions.
The way we make a migration opportunity appear in this test is by manually modifying the resources available in the main host by calling sch.setThisHostResources. On a high level, we do:
# Set low resource availability
# Call function with one function (will internally call the method another time)
sch.callFunctions
# Set high resource availability so that a migration opportunity appears
It is important than both the explicit (and the implicit) call to sch.callFunctions see the low resource availability. Hence the sleep.
Shillaker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once faasm/faabric#264 is merged, this is good to go.
* faabric: latest code version * gh: bump code version * mpi: fix migration and distributed test for migration * faabric: bump version to fix migration plot * faabric: bump version to fix race condition in mpi migration * deps: bump cpp and faabric deps and fix mpi dist test accordingly * docker: bump cpp-root docker image for the latest faabric * faabric: latest code version * dist-tests: add regression checks for mpi distributed tests * dist-tests: use factored-out method in faabric instead of duplicating code * fix: format * dist-tests: more conservative migration check period * cpp: submodule now points to main * faabric: re-factor method to get rank hosts for migrted mpi functions * nit: use whitespaces semantically as line breaks after paragraphs * faabric: bump code version after merge
This PR is the last one to stabilize all the MPI-related stuff that I tinkered with in the rush of the past few days.
It depends on two other PRs being merged in before:
In this PR we fix a number of errors that I encountered whilst re-running the MPI experiments, and I add a couple of additional checks and tests that would catch the regression now.