Skip to content

Conversation

@awan-10
Copy link
Contributor

@awan-10 awan-10 commented May 27, 2022

This PR introduces the DeepSpeed Comm. Backend (v1).

Current advanced communication schemes rely on mixing python-level communication packages (e.g. torch.distributed, mpi4py for 1-bit Adam). In order to simplify comms prototypes, we're looking to add support for custom communication backends within DeepSpeed built directly on top of their respective libraries (e.g. NCCL, MPI, etc).

This PR completes the first phase towards this goal by introducing:

  • The new comms interface deepspeed.comms
  • A complete wrapper around torch.distributed called TorchBackend for backwards-compatibility
  • A rough skeleton for custom backends that we can use for phase 2

Co-authored-by: Quentin Anthony qganthony@yahoo.com
Co-authored-by: Ammar Ahmad Awan ammar.awan@microsoft.com
Co-authored-by: Jeff Rasley jerasley@microsoft.com

Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Copy link
Contributor Author

@awan-10 awan-10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this PR with Quentin. He will take care of minor comments. @jeffra, please review this one.

@jeffra
Copy link
Collaborator

jeffra commented May 31, 2022

@SeanNaren can you take a look at the lightning test failure? We're seeing this on other PRs and in master right now as well. It appears to be a protobuf issue, have you seen this on your side before?

@Quentin-Anthony
Copy link
Contributor

@jeffra and @tjruwase -- Any further comments?

@deepspeedai deepspeedai deleted a comment from rocm-mici Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants