[C10D] Avoid lazily creating P2P communicators by wconstab · Pull Request #129147 · pytorch/pytorch

wconstab · 2024-06-20T16:59:03Z

Stack from ghstack (oldest at bottom):

Users that opt-into eager initialization (enabled by passing device_id
to init_process_group) will now be able to take advantage of reusing
the existing communicator for the processgroup for send/recv ops rather
than creating new 2-rank communicators for every pair of ranks
performing send/recv.

Existing users not passing device_id to init_process_group will now get
a warning suggesting they do so, but they will still get the
functionality they have today, automatic creation of pair-wise
communicators.

Fixes #129140

Test plan

I didn't figure out a good way to unit test this change. (specifically, to make sure we avoid creating extra communicators when we opt-into the eager init path).

In the meantime, i've locally verified that a script that issues a send/recv gets the WARNING printed about the fallback path, and if I modify the script to either pass device_id=torch.device("cuda:{local_rank}") to init_process_group or issue an allreduce before the send/recv, in both cases the warning about the fallback path does not appear.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @d4l3k @c-p-i-o @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @penguinwu @tianyu-l @yf225 @chauhang

Differential Revision: D58842474

[ghstack-poisoned]

pytorch-bot · 2024-06-20T16:59:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129147

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 3 Unrelated Failures

As of commit 0958986 with merge base failed to retrieve merge base, please contact dev infra:

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_c10d_nccl.py::NCCLTraceTest::test_batched_send_recv_op_sizes_per_coalesce0_timing_enabled_True
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_c10d_nccl.py::NCCLTraceTest::test_batched_send_recv_op_sizes_per_coalesce0_timing_enabled_False
trunk / linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu) (gh)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
trunk / linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_c10d_nccl.py::NCCLTraceTest::test_batched_send_recv_op_sizes_per_coalesce0_timing_enabled_False
trunk / linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test / test (distributed, 3, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_c10d_nccl.py::NCCLTraceTest::test_batched_send_recv_op_sizes_per_coalesce0_timing_enabled_True
trunk / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 1, linux.rocm.gpu) (gh)
distributed/test_c10d_nccl.py::NCCLTraceTest::test_batched_send_recv_op_sizes_per_coalesce0_timing_enabled_True

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / linux-focal-rocm6.1-py3.8 / test (default, 1, 2, linux.rocm.gpu) (gh) (similar failure)
test_torch.py::TestTorchDeviceTypeCUDA::test_conv_transposed_large_cuda
trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-13) (gh) (similar failure)
test_mps.py::TestMPS::test_mps_allocator_module
trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14) (gh) (similar failure)
test_mps.py::TestMPS::test_mps_allocator_module

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pavanbalaji

LGTM!

[ghstack-poisoned]

wconstab · 2024-06-20T22:18:31Z

@wconstab has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

[ghstack-poisoned]

wconstab · 2024-06-21T23:39:08Z

@pytorchbot merge

pytorchmergebot · 2024-06-21T23:40:56Z

Merge failed

Reason: Approvers from one of the following sets are needed:

Distributed (mrshenli, pritamdamania87, zhaojuanmao, rohan-varma, wanchaol, ...)
superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

nvcastet · 2024-06-24T22:27:22Z

Is the goal for P2P ops to overlap with other communication ops in the same PG?

Correct.

For example the megatron-lm interleaved pipeline schedule will overlap send/receive ops targeting different peers using the same PG.

pavanbalaji · 2024-06-24T23:08:23Z

Is the goal for P2P ops to overlap with other communication ops in the same PG?

Correct.

For example the megatron-lm interleaved pipeline schedule will overlap send/receive ops targeting different peers using the same PG.

You can use the same NCCL communicator (PyTorch PG) but issue different P2P operations on different streams. That won't be serialized.

wconstab · 2024-06-25T00:07:20Z

I think the issue is that c10d manages the stream used for p2p ops, and its bundled together 1:1 with nccl communicator today.

I amended my RFC to account for this: #129140

@nvcastet do you think this amendment would solve your issue? I can try to make a PR to do this if so.

edit: i updated this PR to attempt to decouple nccl comm from nccl stream. It might be fairly straightforward to do this, but i need to re-examine it with fresh eyes and i assume i may have missed something.

[ghstack-poisoned]

Users that opt-into eager initialization (enabled by passing device_id to init_process_group) will now be able to take advantage of reusing the existing communicator for the processgroup for send/recv ops rather than creating new 2-rank communicators for every pair of ranks performing send/recv. Existing users not passing device_id to init_process_group will now get a warning suggesting they do so, but they will still get the functionality they have today, automatic creation of pair-wise communicators. When reusing an existing communicator, a dedicated nccl stream will still be used for each pair of P2P ranks so that pair-wise comm ops can overlap with each other rather than being serialized on a single stream per PG. Fixes #129140 ghstack-source-id: 3db38c6 Pull Request resolved: #129147

nvcastet · 2024-06-25T14:06:47Z

@pavanbalaji @wconstab
Unfortunately, to overlap 2 NCCL comm ops, you need at least those 2 conditions:

Use different NCCL communicators
Place ops on different CUDA streams

NCCL communicator will serialize the ops even if they are put on different streams (because they compete for the NCCL communicator internal resources: internal staging buffers etc...)

nvcastet · 2024-06-25T15:47:49Z

So to preserve overlap behavior, we would still need to create those p2p communicators in the PG.

The only other option I see (besides the obvious one to put this RFE/PR on the shelf for now) to avoid those extra communicators is to have an explicit config setting on the process group to disable the creation of p2p communicators (and documenting that unbatched p2p ops of this PG will be serialized with that setting).

As a side note, NCCL team is actively working on reducing communicator init cost, so I would not be surprise to see improvement in upcoming releases.

d4l3k · 2024-06-25T17:45:08Z

    bool isSendRecvSelf,
-    std::optional<const std::string> streamKey) {
+    std::optional<const std::string> streamKey,
+    bool onlyCached) {


Thoughts on having a getOrCreateNCCLComm and then just a getNCCLComm? It's a bit unintutive that this function does both and splitting the behavior might be better than adding a bool to an already complicated function signature

d4l3k · 2024-06-25T17:46:43Z

+
+  // Note on keys
+  // devKey identifies this gpu device and is used for accessing a nccl
+  // Communicator for this PG per device p2pKey identifies a pair of ranks doing


Missing period/new line between device and p2pKey?

thanks. lintrunner totally hosed me here.

d4l3k · 2024-06-25T18:07:52Z

+  auto ncclStream = ncclStreams_.at(p2pKey);
  // First let NCCL streams wait for input tensors allocation streams
-  syncStream(device, ncclEvents_[key], ncclStream);
+  syncStream(device, ncclEvents_[p2pKey], ncclStream);


In the old logic this is conditionally the p2pKey or the devKey -- is it intentional to always use the p2p key now?

yes- it is intentional to always use the p2pkey for the stream, based on the wrong assumption that using the same comm but different stream would allow overlap between p2p ops involving different peers.

but i suspect i missed something- i probably should have kept this as devKey for batched-p2p ops and only made this p2pkey for true p2p ops.

pavanbalaji · 2024-07-22T02:30:37Z

@pavanbalaji @wconstab Unfortunately, to overlap 2 NCCL comm ops, you need at least those 2 conditions:

Use different NCCL communicators

Place ops on different CUDA streams

NCCL communicator will serialize the ops even if they are put on different streams (because they compete for the NCCL communicator internal resources: internal staging buffers etc...)

Hi @nvcastet - we should discuss this. It's not clear why NCCL needs to serialize point-to-point operations on the same communicator. I understand that collective operations need to be serialized, but p2p operations should be independent of each other. NCCL should be able to handle internal resources correctly in such cases. Is there a technical reason for this restriction or is it just an artifact of the current implementation? If it's an artifact of the current implementation, PyTorch shouldn't be working around that. We should fix it in NCCL.

nvcastet · 2024-07-22T19:02:01Z

@pavanbalaji

It's not clear why NCCL needs to serialize point-to-point operations on the same communicator.

NCCL communicator will serialize ungrouped ops because they share internal resources (net buffers etc...).
For the megatron-lm use case mentioned early on we don't group p2p ops to get finer overlapping.

pavanbalaji · 2024-07-26T02:03:57Z

@pavanbalaji

It's not clear why NCCL needs to serialize point-to-point operations on the same communicator.

NCCL communicator will serialize ungrouped ops because they share internal resources (net buffers etc...). For the megatron-lm use case mentioned early on we don't group p2p ops to get finer overlapping.

Hi @nvcastet - This seems to be overly restrictive and is different from what other communication libraries (such as MPI) provide. Creating a new communicator for every point-to-point pair that we need to talk to is very expensive with respect to number of resources used (and performance in some cases).

nvcastet · 2024-07-26T14:29:06Z

You only need to create a new communicator for pt-to-pt if you are going to overlap it with another NCCL Op.
That is the current semantics of the NCCL library which is what we need to look at for this PR.
I would encourage you to move the discussion to the NCCL repo by opening a discussion/RFE there so that the NCCL engineers can scope your proposal.

github-actions · 2024-09-24T14:36:07Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

kwen2501 · 2024-10-25T06:29:22Z

Hi @nvcastet thanks for your comments, I'd like to follow up a bit.

Does megatron use eager init (i.e. passing a device to the device_id of init_process_group) or lazy init?

For lazy init, we can keep the dedicated P2P comms -- they will be required anyway, because we cannot assume the "whole" comm is ready at the time P2P is called. For eager init, since we know that the whole comm is ready, we'd like to use the whole comm for P2P.

If megatron has been relying on lazy init (which is the traditional option), this change will not pose a perf regression for megatron. Does that make sense?

Re relaxing the serialization in NCCL

I can understand why the serialization is needed, as you mentioned, intermediate buffers are not easy to schedule for sharing. Luckily, I think some recent NCCL advances may help to relax this serialization, in particular for P2P. Let's say zero-copy is enabled for P2P, be it network-based zero copy or GPU-GPU zero copy, these ops themselves will not need intermediate buffers, because data is directly fetched from user buffers. In this case, it would seem possible to allow multiple P2P ops to run on parallel streams? It seems to me that this may be even easier for network-based P2P because it may not even need to launch SMs in this case.

Cc: @wconstab @eqy @pavanbalaji

nvcastet · 2024-10-30T14:46:49Z

Does megatron use eager init (i.e. passing a device to the device_id of init_process_group) or lazy init?

They used to have just lazy init but migrated to eager init to leverage the NCCL comm split feature.

Re relaxing the serialization in NCCL

I can understand why the serialization is needed, as you mentioned, intermediate buffers are not easy to schedule for sharing. Luckily, I think some recent NCCL advances may help to relax this serialization, in particular for P2P. Let's say zero-copy is enabled for P2P, be it network-based zero copy or GPU-GPU zero copy, these ops themselves will not need intermediate buffers, because data is directly fetched from user buffers. In this case, it would seem possible to allow multiple P2P ops to run on parallel streams? It seems to me that this may be even easier for network-based P2P because it may not even need to launch SMs in this case.

Agreed, if NCCL removes the serialization they perform for p2p ops sharing the same communicator, that would solve the issue. that would be great. For zero-copy to be beneficial and avoiding constant registrations, we would need stability of ptrs between iterations or use the CUDA graph feature, right?

Update

cd61a29

[ghstack-poisoned]

pytorch-bot Bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 20, 2024

wconstab requested review from H-Huang, chipturner, eqy, fegin, kwen2501 and pavanbalaji and removed request for kwen2501 June 20, 2024 17:43

pavanbalaji approved these changes Jun 20, 2024

View reviewed changes

Update

8e22bb9

[ghstack-poisoned]

Update

e8c746f

[ghstack-poisoned]

wconstab added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 20, 2024

wconstab requested a review from kwen2501 June 21, 2024 00:48

c-p-i-o reviewed Jun 21, 2024

View reviewed changes

Comment thread torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Outdated

Update

97eb3df

[ghstack-poisoned]

wconstab mentioned this pull request Jun 21, 2024

[C10D] Make new_group eager when used with comm_split #129284

Closed

Update

01ea6e0

[ghstack-poisoned]

wconstab requested a review from shuqiangzhang June 21, 2024 23:36

pytorchmergebot added the merging label Jun 21, 2024

pytorchmergebot removed the merging label Jun 21, 2024

Update

a0d3d0b

[ghstack-poisoned]

wconstab mentioned this pull request Jun 25, 2024

[C10D] Separate deviceKey from streamKey in getNCCLComm #129435

Closed

Update

db2de5f

[ghstack-poisoned]

Update

4e2d354

[ghstack-poisoned]

Update

0958986

[ghstack-poisoned]

d4l3k reviewed Jun 25, 2024

View reviewed changes

nvcastet mentioned this pull request Jul 15, 2024

ncclCommSplit optimization does not take effect #129865

Closed

github-actions Bot added the Stale label Sep 24, 2024

github-actions Bot closed this Oct 24, 2024

kwen2501 added no-stale and removed Stale labels Oct 25, 2024

github-actions Bot deleted the gh/wconstab/309/head branch November 30, 2024 02:07

This was referenced Apr 25, 2025

Use device_id in dist init to reduce NCCL communicator warmup & creation overhead sgl-project/sglang#5728

Merged

[C10D] Allow NCCL single P2P ops to use parent/collective communicator #152220

Open

Conversation

wconstab commented Jun 20, 2024 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

pytorch-bot Bot commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129147

❌ 6 New Failures, 3 Unrelated Failures

Uh oh!

pavanbalaji left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Jun 20, 2024

Uh oh!

Uh oh!

wconstab commented Jun 21, 2024

Uh oh!

pytorchmergebot commented Jun 21, 2024

Merge failed

Uh oh!

nvcastet commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pavanbalaji commented Jun 24, 2024

Uh oh!

wconstab commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvcastet commented Jun 25, 2024

Uh oh!

nvcastet commented Jun 25, 2024

Uh oh!

d4l3k Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

d4l3k Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

d4l3k Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

pavanbalaji commented Jul 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvcastet commented Jul 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pavanbalaji commented Jul 26, 2024

Uh oh!

nvcastet commented Jul 26, 2024

Uh oh!

github-actions Bot commented Sep 24, 2024

Uh oh!

kwen2501 commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvcastet commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

wconstab commented Jun 20, 2024 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Jun 20, 2024 •

edited

Loading

nvcastet commented Jun 24, 2024 •

edited

Loading

wconstab commented Jun 25, 2024 •

edited

Loading

pavanbalaji commented Jul 22, 2024 •

edited

Loading

nvcastet commented Jul 22, 2024 •

edited

Loading

kwen2501 commented Oct 25, 2024 •

edited

Loading

nvcastet commented Oct 30, 2024 •

edited

Loading