Skip to content

Conversation

@csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented May 13, 2022

This PR is the last one to stabilize all the MPI-related stuff that I tinkered with in the rush of the past few days.

It depends on two other PRs being merged in before:

In this PR we fix a number of errors that I encountered whilst re-running the MPI experiments, and I add a couple of additional checks and tests that would catch the regression now.

@csegarragonz csegarragonz changed the title Lates faabric Latest faabric May 14, 2022
@csegarragonz csegarragonz marked this pull request as ready for review May 20, 2022 13:50
@@ -1,4 +1,4 @@
FROM faasm/faabric-base:0.3.0
FROM faasm/faabric-base:0.3.3
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to bump this one for the latest Conan

}

// TODO - this method is duplicated from faabric's dist tests
void checkSchedulingFromExecGraph(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this method we can actually check if a migration has taken place.

@csegarragonz csegarragonz self-assigned this May 20, 2022
@csegarragonz csegarragonz requested a review from Shillaker May 20, 2022 17:38
Copy link
Collaborator

@Shillaker Shillaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also pending faasm/faabric#264 as noted in the description.


// Sleep for a while to let the scheduler schedule the MPI calls, and then
// update the local slots so that a migration opportunity appears
SLEEP_MS(500);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what this sleep is waiting for. The comment stays waiting for the scheduler to schedule the calls, but surely that's happening in the sch.callFunctions line? What happens without this sleep?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request we are sending in the sch.callFunctions line is a request for an MPI application. Thus it is just one message, with the field msg.mpiworldsize() set to the worldSize.

Once the request is scheduled and has started executing, we will call MpiWorld::create that will then issue a different call to sch.CallFunctions with worldSize - 1 functions.

The way we make a migration opportunity appear in this test is by manually modifying the resources available in the main host by calling sch.setThisHostResources. On a high level, we do:

# Set low resource availability

# Call function with one function (will internally call the method another time)
sch.callFunctions

# Set high resource availability so that a migration opportunity appears

It is important than both the explicit (and the implicit) call to sch.callFunctions see the low resource availability. Hence the sleep.

@csegarragonz csegarragonz requested a review from Shillaker May 23, 2022 10:27
Copy link
Collaborator

@Shillaker Shillaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once faasm/faabric#264 is merged, this is good to go.

@csegarragonz csegarragonz merged commit 621c968 into main May 25, 2022
@csegarragonz csegarragonz deleted the latest-faabric branch May 25, 2022 17:21
csegarragonz added a commit that referenced this pull request May 30, 2022
* faabric: latest code version

* gh: bump code version

* mpi: fix migration and distributed test for migration

* faabric: bump version to fix migration plot

* faabric: bump version to fix race condition in mpi migration

* deps: bump cpp and faabric deps and fix mpi dist test accordingly

* docker: bump cpp-root docker image for the latest faabric

* faabric: latest code version

* dist-tests: add regression checks for mpi distributed tests

* dist-tests: use factored-out method in faabric instead of duplicating code

* fix: format

* dist-tests: more conservative migration check period

* cpp: submodule now points to main

* faabric: re-factor method to get rank hosts for migrted mpi functions

* nit: use whitespaces semantically as line breaks after paragraphs

* faabric: bump code version after merge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants