[CudaIpc 3/3]: p2p get-Zcopy by samnordmann · Pull Request #3911 · NVIDIA/Fuser

samnordmann · 2025-02-17T11:46:22Z

On top of:

Pending on issue:

Error with driver API's lazy load of cuStream ops #3907

github-actions · 2025-02-17T11:47:23Z

Review updated until commit 289a713

Description

Added CUDA P2P get-Zcopy support
Removed Gloo support
Updated communicator backend to include CUDA
Added tests for CUDA P2P communication

Changes walkthrough 📝

Relevant files

Enhancement

6 files

executor.cpp `Added CUDA P2P get-Zcopy handling`	+43/-11
communicator.cpp `Removed Gloo backend and added CUDA`	+2/-13
cuda_p2p.cpp `Implemented CUDA P2P get-Zcopy functions`	+70/-0
communicator.h `Updated communicator backend enum`	+0/-3
cuda_p2p.h `Added CUDA P2P get-Zcopy function declarations`	+22/-0
multidevice.h `Updated communicator backend enum`	+3/-0

Tests

1 files

test_multidevice_communications.cpp `Added CUDA P2P communication test`	+71/-0

Configuration changes

3 files

.gitmodules `Removed Gloo submodule`	+0/-3
CMakeLists.txt `Added CUDA P2P source file and removed Gloo include`	+1/-1
gloo `Removed Gloo submodule`	+0/-1

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Error Handling

The error handling for communication->type() in the handle method for P2PCommunication could be improved. Currently, it only checks for RECV when the backend is not kCuda. It might be beneficial to add more comprehensive error handling for different communication types and backends.

NVF_ERROR(
    communication->type() == P2PCommunicationType::RECV,
    "Wrong communication type");
works_[communication] = postSingleCommunication(

Backend Support

The removal of Gloo support might have implications for users who rely on it. Ensure that there is a clear migration path or alternative solution for users who were using Gloo.

    auto pg_opts = c10::make_intrusive<::c10d::ProcessGroupNCCL::Options>();
    return c10::make_intrusive<::c10d::ProcessGroupNCCL>(
        store, rank, size, pg_opts);
  }
#endif

#if defined(USE_C10D_UCC) && defined(NVFUSER_BUILD_WITH_UCC)
  if (backend == CommunicatorBackend::kUcc) {
    constexpr auto timeout = std::chrono::milliseconds(30 * 60 * 1000);
    return c10d::ProcessGroupUCC::createProcessGroupUCC(

Semaphore Usage

The use of semaphores in recvPost, sendPost, and sendWait functions should be validated for correctness and performance. Ensure that the semaphore operations are correctly synchronized and do not introduce unnecessary overhead.

void recvPost(const P2pIpcHandle& ipc_handles, int64_t count, CUstream stream) {
  // wait for sender to be ready
  NVFUSER_CUDA_SAFE_CALL(cuStreamWaitValue32(
      stream,
      reinterpret_cast<CUdeviceptr>(ipc_handles.local().semaphore()),
      (cuuint32_t)(IpcSemaphore::kInUse),
      CU_STREAM_WAIT_VALUE_EQ));
  // RDMA get the data from the sender
  NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpyAsync(
      ipc_handles.local().ptr(),
      ipc_handles.peer().ptr(),
      count,
      cudaMemcpyDeviceToDevice,
      stream));
  // Signals completion to self
  NVFUSER_CUDA_SAFE_CALL(cuStreamWriteValue32(
      stream,
      reinterpret_cast<CUdeviceptr>(ipc_handles.local().semaphore()),
      (cuuint32_t)(IpcSemaphore::kReady),
      CU_STREAM_WRITE_VALUE_DEFAULT));
  // Signals completion to sender
  NVFUSER_CUDA_SAFE_CALL(cuStreamWriteValue32(
      stream,
      reinterpret_cast<CUdeviceptr>(ipc_handles.peer().semaphore()),
      (cuuint32_t)(IpcSemaphore::kReady),
      CU_STREAM_WRITE_VALUE_DEFAULT));
}

void sendPost(const P2pIpcHandle& ipc_handles, CUstream stream) {
  // signal to self that transfer is in progress
  NVFUSER_CUDA_SAFE_CALL(cuStreamWriteValue32(
      stream,
      reinterpret_cast<CUdeviceptr>(ipc_handles.local().semaphore()),
      (cuuint32_t)(IpcSemaphore::kInUse),
      CU_STREAM_WRITE_VALUE_DEFAULT));
  // signal to receiver that the buffer is ready
  NVFUSER_CUDA_SAFE_CALL(cuStreamWriteValue32(
      stream,
      reinterpret_cast<CUdeviceptr>(ipc_handles.peer().semaphore()),
      (cuuint32_t)(IpcSemaphore::kInUse),
      CU_STREAM_WRITE_VALUE_DEFAULT)); // passing
                                       // CU_STREAM_WRITE_VALUE_NO_MEMORY_BARRIER
                                       // gives an error
}

void sendWait(const P2pIpcHandle& ipc_handles, CUstream stream) {
  NVFUSER_CUDA_SAFE_CALL(cuStreamWaitValue32(
      stream,
      reinterpret_cast<CUdeviceptr>(ipc_handles.local().semaphore()),
      (cuuint32_t)(IpcSemaphore::kReady),
      CU_STREAM_WAIT_VALUE_EQ));
}

This PR is a small self-contained part belonging to the larger PR - #3911 # What - Add the backend type as an argument to P2PCommunication*

…p_gzcpy

samnordmann · 2025-02-24T16:42:54Z

!test

wujingyue

Thanks for the PR! Again, I'm pretty sure this PR implements a valid solution and is a strict improvement. Many questions from me are about what are the alternatives and why certain solutions are preferred.

csrc/multidevice/multidevice.h

csrc/multidevice/cuda_p2p.cpp

csrc/host_ir/executor.cpp

wujingyue · 2025-03-06T04:58:11Z

csrc/multidevice/cuda_p2p.h

+void RecvPost(const P2pIpcHandle& ipc_handles, int64_t count, CUstream stream);
+void SendPost(const P2pIpcHandle& ipc_handles, CUstream stream);
+void SendWait(const P2pIpcHandle& ipc_handles, CUstream stream);


These can probably become methods of P2pIpcHandle so you can hide implementation details like local, peer and semaphore

I would prefer to leave it separated here. The logic is that ipc_handle set the data structure (allocation, exporting/importing the semaphore) on the control path, while these function implement a runtime primitive on the data path. We will later write other p2p and collective algorithm (e.g. put_zcopy), in addition to compute/comms kernels, and they all will rely on the common ipc_handle data structure.

local, peer, semaphore is kind of the minimal set of public methods from ipc_handle that allow the implementation of many different communication patterns

csrc/host_ir/executor.cpp

wujingyue · 2025-03-06T05:12:39Z

csrc/multidevice/cuda_p2p.cpp

+  // wait for sender to be ready
+  NVFUSER_CUDA_SAFE_CALL(cuStreamWaitValue32(
+      stream,
+      reinterpret_cast<CUdeviceptr>(ipc_handles.local().semaphore()),


Why two semaphores (local and peer)? AFAICT, a semaphore is shared between sender and receiver so both devices see the same value. Apparently, one semaphore per P2pIpcHandle indicating NOT_READY, READY, or COMPLETE ought to be enough?

You are right that we could use only one semaphore. The question would then be where to allocate it (on the sender of receiver GPU).

The idea behind why we use two semaphores here, is to make sure that cuStreamWaitValue32 always polls a local buffer. That is considered good practice -- we always want the listener to poll a local buffer and not a remote one to avoid too much network transactions. That is why the semaphore is duplicated, one on the recv and one on the sender GPU. We pay the cost of duplicating the cuStreamWriteValue32 calls, to update both semaphore each time, but that's a very minor drawback.

I have not run any performance benchmark on this, though.

wujingyue · 2025-03-06T05:15:25Z

csrc/multidevice/cuda_p2p.cpp

+      count,
+      cudaMemcpyDeviceToDevice,
+      stream));
+  // Signals completion to self


Can you explain why this is needed in addition to the other completion signal below?

This step is needed to reset the semaphore in the case of a future reuse.

see also #3911 (comment)

Got it -- both semaphores need to have the same value. This way, waiting for local is equivalent to waiting for peer.

csrc/multidevice/cuda_p2p.cpp

…p_gzcpy

wujingyue · 2025-03-10T06:28:58Z

tests/cpp/test_multidevice_communications.cpp

+      recv_peer_val,
+      CommunicatorBackend::kCuda);
+  std::vector<P2PCommunication*> grouped_communications = {send, recv};
+  auto share_mem_handles = IrBuilder::create<hir::ShareMemHandles>(


Note to myself

csrc/host_ir/executor.cpp

wujingyue · 2025-03-10T06:44:18Z

csrc/multidevice/cuda_p2p.cpp

+      CU_STREAM_WRITE_VALUE_DEFAULT));
+}
+
+void sendPost(const P2pIpcHandle& ipc_handles, CUstream stream) {


I couldn't map this implementation to this slide so can you clarify? I was expecting the first step of sendPost to write kReady?

Your link sends me to https://dlrequest/GroupID/Home/Index

I was expecting the first step of sendPost to write kReady?

the first step is indeed to write to the semaphore. It writes kInUse, to signal it is ready-to-receive, while kReady signals the default semaphore state before the p2p starts. Are you asking why the name "kInUse" and not another naming?

Fixed the link -- the short link generated by nv/ had a . in it, confusing GitHub's markdown renderer.

the first step is indeed to write to the semaphore. It writes kInUse, to signal it is ready-to-receive, while kReady signals the default semaphore state before the p2p starts. Are you asking why the name "kInUse" and not another naming?

Yes. That answered my question. I wasn't sure about the difference between "ready-to-receive" in the text and kReady in the code. Will read the code again based on the new understanding...

wujingyue · 2025-03-11T05:05:23Z

CMakeLists.txt


 target_compile_definitions(codegen_internal PRIVATE "-DTORCH_CUDA_BUILD_MAIN_LIB")
 target_include_directories(codegen_internal SYSTEM PUBLIC
-  ${CMAKE_SOURCE_DIR}/third_party/gloo # TODO: guard this on usage


Next time, try to https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/getting-started/helping-others-review-your-changes#write-small-prs

wujingyue · 2025-03-11T05:21:36Z

csrc/multidevice/cuda_p2p.cpp

+      count,
+      cudaMemcpyDeviceToDevice,
+      stream));
+  // Signals completion to self


Got it -- both semaphores need to have the same value. This way, waiting for local is equivalent to waiting for peer.

On top of - #3909 prerequesite to: - #3911 # What - Set up the infrastructure needed for ipc handle exchange and caching - Add an `Expr` node `hir::ShareMemHandles` to represent this op. We cannot embed the op in the Send/Recv semantics because we need to group the handle exchange between matching sends and recv to avoid deadlocks # How Most of the implementation is in `multidevice/ipc_handle.cpp` - Define the class `IpcHandle` representing the ipc handle that is exchanged. This class is supplemented with a semaphore, which is a local cuda buffer allocated on the exporter's device. - Define `IpcHandleCache` which handles exchanging and caching the ipc handles. Caching is made on with respect to a combination of runtime and symbolic ingredients: `(runtime peer, at::Tensor, Expr*)`. This caching allows to have arbitrary number of p2p comms between pairs of ranks.

samnordmann added 3 commits February 14, 2025 05:21

add cuda p2p get zcopy support

073b476

fix typo: replace rank to local_rank

040c970

Merge branch 'fix_rank_to_local_rank' into cuda_p2p_gzcpy

3bf1fb6

samnordmann mentioned this pull request Feb 17, 2025

[CudaIpc 3/3]: p2p get-Zcopy #3894

Closed

samnordmann added 2 commits February 17, 2025 03:47

fix linter

e74cee5

fix linter

5ba6fdf

samnordmann changed the title ~~Cuda p2p gzcpy~~ [CudaIpc 3/3]: p2p get-Zcopy Feb 17, 2025

samnordmann mentioned this pull request Feb 17, 2025

[CudaIpc 1/3]P2PCommunication: add backend type #3909

Merged

samnordmann added 2 commits February 17, 2025 06:37

Merge branch 'ipc_handle_infra' into cuda_p2p_gzcpy

5c90282

lint

7f7a630

samnordmann mentioned this pull request Feb 17, 2025

[CudaIpc 2/3]: Ipc handle exchange #3910

Merged

samnordmann added a commit that referenced this pull request Feb 21, 2025

[CudaIpc 1/3]P2PCommunication: add backend type (#3909)

644eaec

This PR is a small self-contained part belonging to the larger PR - #3911 # What - Add the backend type as an argument to P2PCommunication*

samnordmann added 6 commits February 24, 2025 06:39

Merge branch 'ipc_handle_infra' into cuda_p2p_gzcpy

05533db

Merge branch 'ipc_handle_infra' into cuda_p2p_gzcpy

8ba5f1c

lint

8afe0d0

Merge branch 'ipc_handle_infra' into cuda_p2p_gzcpy

62f4171

Merge branch 'ipc_handle_infra' into cuda_p2p_gzcpy

072bbdb

Merge branch 'cuda_p2p_gzcpy' of github.com:NVIDIA/Fuser into cuda_p2…

5212fbf

…p_gzcpy

samnordmann mentioned this pull request Feb 25, 2025

[CudaIpc Tutorial] Minimal snippet example #3912

Merged

samnordmann requested review from nsarka and wujingyue March 4, 2025 13:12

wujingyue reviewed Mar 6, 2025

View reviewed changes

samnordmann added 3 commits March 6, 2025 13:00

remove gloo support

7153bbd

minor comments

f7dd83a

Merge branch 'cuda_p2p_gzcpy' of github.com:NVIDIA/Fuser into cuda_p2…

34f8b8b

…p_gzcpy

samnordmann requested a review from wujingyue March 6, 2025 13:13

wujingyue reviewed Mar 10, 2025

View reviewed changes

minor comment

289a713

samnordmann requested a review from wujingyue March 10, 2025 12:26

wujingyue approved these changes Mar 11, 2025

View reviewed changes

samnordmann merged commit 677b84c into ipc_handle_infra Mar 12, 2025
5 of 9 checks passed

samnordmann deleted the cuda_p2p_gzcpy branch March 12, 2025 22:40

Conversation

samnordmann commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

samnordmann commented Feb 24, 2025

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wujingyue Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samnordmann Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samnordmann commented Feb 17, 2025 •

edited

Loading

github-actions bot commented Feb 17, 2025 •

edited

Loading

wujingyue Mar 10, 2025 •

edited

Loading

samnordmann Mar 10, 2025 •

edited

Loading