-
Notifications
You must be signed in to change notification settings - Fork 14
Various MPI fixes #264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various MPI fixes #264
Conversation
| checkAllocationAndResult(req); | ||
| } | ||
|
|
||
| TEST_CASE_METHOD(MpiDistTestsFixture, "Test MPI all to all many times", "[mpi]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This distributed test fails in the current main branch.
3818ec1 to
df93eef
Compare
Too many open files problem in ZeroMQdf93eef to
689da21
Compare
| // collective communications by all ranks. At this point, all non- | ||
| // leader ranks will be hitting a barrier, for which they don't | ||
| // need the ranks for host map, therefore it is safe to modify it | ||
| if (m.dsthost() == thisHost && m.msg().mpirank() < localLeader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to call initLocalRemote() here, unfortunately that method is too coarse and was clearing the rankForHost structure before filling it again.
This was causing a race condition between migrated and non-migrated ranks.
src/scheduler/ExecGraph.cpp
Outdated
| return getMpiRankHostsFromExecGraphNode(graph.rootNode); | ||
| } | ||
|
|
||
| void getMigratedMpiRankHostsFromExecGraph(const ExecGraph& graph, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this method here so I can use it for Faasm tests as well.
Shillaker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More linebreaks.
In this PR I fix a couple of issues that I ran into whilst reproducing the MPI results.
Too many files openerror in ZeroMQ #266. The current solution is only a patch, and we should address it properly in a separate PR.UNDERFULLtopology hint in MPI to deliberately missplace MPI functions for a migration opportunity to appear.