Skip to content

Conversation

@csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented May 13, 2022

In this PR I fix a couple of issues that I ran into whilst reproducing the MPI results.

  • Failing test and workaround for Too many files open error in ZeroMQ #266. The current solution is only a patch, and we should address it properly in a separate PR.
  • Propagate UNDERFULL topology hint in MPI to deliberately missplace MPI functions for a migration opportunity to appear.
  • Fix race condition in MPI migration.

checkAllocationAndResult(req);
}

TEST_CASE_METHOD(MpiDistTestsFixture, "Test MPI all to all many times", "[mpi]")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This distributed test fails in the current main branch.

@csegarragonz csegarragonz marked this pull request as ready for review May 13, 2022 10:58
@csegarragonz csegarragonz force-pushed the too-many-files branch 2 times, most recently from 3818ec1 to df93eef Compare May 20, 2022 10:59
@csegarragonz csegarragonz changed the title Fix Too many open files problem in ZeroMQ Various MPI fixes May 20, 2022
// collective communications by all ranks. At this point, all non-
// leader ranks will be hitting a barrier, for which they don't
// need the ranks for host map, therefore it is safe to modify it
if (m.dsthost() == thisHost && m.msg().mpirank() < localLeader) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to call initLocalRemote() here, unfortunately that method is too coarse and was clearing the rankForHost structure before filling it again.

This was causing a race condition between migrated and non-migrated ranks.

return getMpiRankHostsFromExecGraphNode(graph.rootNode);
}

void getMigratedMpiRankHostsFromExecGraph(const ExecGraph& graph,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this method here so I can use it for Faasm tests as well.

@csegarragonz csegarragonz requested a review from Shillaker May 20, 2022 16:44
@csegarragonz csegarragonz requested a review from Shillaker May 23, 2022 09:40
Copy link
Collaborator

@Shillaker Shillaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More linebreaks.

@csegarragonz csegarragonz requested a review from Shillaker May 24, 2022 18:33
@csegarragonz csegarragonz enabled auto-merge (squash) May 25, 2022 17:10
@csegarragonz csegarragonz disabled auto-merge May 25, 2022 17:10
@csegarragonz csegarragonz enabled auto-merge (squash) May 25, 2022 17:12
@csegarragonz csegarragonz disabled auto-merge May 25, 2022 17:13
@csegarragonz csegarragonz merged commit 9541f53 into main May 25, 2022
@csegarragonz csegarragonz deleted the too-many-files branch May 25, 2022 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants