Skip to content

Errors out when Openmpi < 2.x.x with distributed.#10015

Closed
ailzhang wants to merge 3 commits intopytorch:masterfrom
ailzhang:mpi_seg
Closed

Errors out when Openmpi < 2.x.x with distributed.#10015
ailzhang wants to merge 3 commits intopytorch:masterfrom
ailzhang:mpi_seg

Conversation

@ailzhang
Copy link
Contributor

@ailzhang ailzhang commented Jul 30, 2018

This PR fixes #9418 .
Openmpi 1.10 segfaults in MPI_Bcast with CUDA buffer. And it's a retired openmpi version.
I've tested on 2.1.1 and 3.0.0 and they work well.

// OMPI_* is specific to Openmpi implementation.
// Openmpi v1.10 segfaults in MPI_Bcast with CUDA buffer.
if (int(OMPI_MAJOR_VERSION) < 2) {
throw std::runtime_error("Please use MPI 2.x.x and above for distributed.");

This comment was marked as off-topic.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@avmgithub
Copy link
Contributor

@soumith Did PyTorch deprecate the use of OpenMPI version 1 ? Does that mean to use OpenMPI with PyTorch , it needs to be version 2 ? Both RH 7.5 and Ubuntu 16.04 still ships standard with version 1.

@soumith
Copy link
Collaborator

soumith commented Aug 7, 2018

it looks like it, from @ailzhang 's investigations that we never supported OpenMPI v1 correctly. This just enforces error checking better.

goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
Summary:
This PR fixes pytorch#9418 .
Openmpi 1.10 segfaults in MPI_Bcast with CUDA buffer. And it's a retired openmpi version.
I've tested on 2.1.1 and 3.0.0 and they work well.
Pull Request resolved: pytorch#10015

Reviewed By: soumith

Differential Revision: D9088103

Pulled By: ailzhang

fbshipit-source-id: fc0a45e5cd016093ef0dbb9f371cbf67170d7045
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segmentation Fault using dist.broadcast() with openmpi

5 participants