Skip to content

[CudaIpc 3/3]: p2p get-Zcopy#3911

Merged
samnordmann merged 17 commits intoipc_handle_infrafrom
cuda_p2p_gzcpy
Mar 12, 2025
Merged

[CudaIpc 3/3]: p2p get-Zcopy#3911
samnordmann merged 17 commits intoipc_handle_infrafrom
cuda_p2p_gzcpy

Conversation

@samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Feb 17, 2025

@github-actions
Copy link

github-actions bot commented Feb 17, 2025

Review updated until commit 289a713

Description

  • Added CUDA P2P get-Zcopy support

  • Removed Gloo support

  • Updated communicator backend to include CUDA

  • Added tests for CUDA P2P communication


Changes walkthrough 📝

Relevant files
Enhancement
6 files
executor.cpp
Added CUDA P2P get-Zcopy handling                                               
+43/-11 
communicator.cpp
Removed Gloo backend and added CUDA                                           
+2/-13   
cuda_p2p.cpp
Implemented CUDA P2P get-Zcopy functions                                 
+70/-0   
communicator.h
Updated communicator backend enum                                               
+0/-3     
cuda_p2p.h
Added CUDA P2P get-Zcopy function declarations                     
+22/-0   
multidevice.h
Updated communicator backend enum                                               
+3/-0     
Tests
1 files
test_multidevice_communications.cpp
Added CUDA P2P communication test                                               
+71/-0   
Configuration changes
3 files
.gitmodules
Removed Gloo submodule                                                                     
+0/-3     
CMakeLists.txt
Added CUDA P2P source file and removed Gloo include           
+1/-1     
gloo
Removed Gloo submodule                                                                     
+0/-1     

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review

Error Handling

The error handling for communication->type() in the handle method for P2PCommunication could be improved. Currently, it only checks for RECV when the backend is not kCuda. It might be beneficial to add more comprehensive error handling for different communication types and backends.

NVF_ERROR(
    communication->type() == P2PCommunicationType::RECV,
    "Wrong communication type");
works_[communication] = postSingleCommunication(
Backend Support

The removal of Gloo support might have implications for users who rely on it. Ensure that there is a clear migration path or alternative solution for users who were using Gloo.

    auto pg_opts = c10::make_intrusive<::c10d::ProcessGroupNCCL::Options>();
    return c10::make_intrusive<::c10d::ProcessGroupNCCL>(
        store, rank, size, pg_opts);
  }
#endif

#if defined(USE_C10D_UCC) && defined(NVFUSER_BUILD_WITH_UCC)
  if (backend == CommunicatorBackend::kUcc) {
    constexpr auto timeout = std::chrono::milliseconds(30 * 60 * 1000);
    return c10d::ProcessGroupUCC::createProcessGroupUCC(
Semaphore Usage

The use of semaphores in recvPost, sendPost, and sendWait functions should be validated for correctness and performance. Ensure that the semaphore operations are correctly synchronized and do not introduce unnecessary overhead.

void recvPost(const P2pIpcHandle& ipc_handles, int64_t count, CUstream stream) {
  // wait for sender to be ready
  NVFUSER_CUDA_SAFE_CALL(cuStreamWaitValue32(
      stream,
      reinterpret_cast<CUdeviceptr>(ipc_handles.local().semaphore()),
      (cuuint32_t)(IpcSemaphore::kInUse),
      CU_STREAM_WAIT_VALUE_EQ));
  // RDMA get the data from the sender
  NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpyAsync(
      ipc_handles.local().ptr(),
      ipc_handles.peer().ptr(),
      count,
      cudaMemcpyDeviceToDevice,
      stream));
  // Signals completion to self
  NVFUSER_CUDA_SAFE_CALL(cuStreamWriteValue32(
      stream,
      reinterpret_cast<CUdeviceptr>(ipc_handles.local().semaphore()),
      (cuuint32_t)(IpcSemaphore::kReady),
      CU_STREAM_WRITE_VALUE_DEFAULT));
  // Signals completion to sender
  NVFUSER_CUDA_SAFE_CALL(cuStreamWriteValue32(
      stream,
      reinterpret_cast<CUdeviceptr>(ipc_handles.peer().semaphore()),
      (cuuint32_t)(IpcSemaphore::kReady),
      CU_STREAM_WRITE_VALUE_DEFAULT));
}

void sendPost(const P2pIpcHandle& ipc_handles, CUstream stream) {
  // signal to self that transfer is in progress
  NVFUSER_CUDA_SAFE_CALL(cuStreamWriteValue32(
      stream,
      reinterpret_cast<CUdeviceptr>(ipc_handles.local().semaphore()),
      (cuuint32_t)(IpcSemaphore::kInUse),
      CU_STREAM_WRITE_VALUE_DEFAULT));
  // signal to receiver that the buffer is ready
  NVFUSER_CUDA_SAFE_CALL(cuStreamWriteValue32(
      stream,
      reinterpret_cast<CUdeviceptr>(ipc_handles.peer().semaphore()),
      (cuuint32_t)(IpcSemaphore::kInUse),
      CU_STREAM_WRITE_VALUE_DEFAULT)); // passing
                                       // CU_STREAM_WRITE_VALUE_NO_MEMORY_BARRIER
                                       // gives an error
}

void sendWait(const P2pIpcHandle& ipc_handles, CUstream stream) {
  NVFUSER_CUDA_SAFE_CALL(cuStreamWaitValue32(
      stream,
      reinterpret_cast<CUdeviceptr>(ipc_handles.local().semaphore()),
      (cuuint32_t)(IpcSemaphore::kReady),
      CU_STREAM_WAIT_VALUE_EQ));
}

@samnordmann samnordmann changed the title Cuda p2p gzcpy [CudaIpc 3/3]: p2p get-Zcopy Feb 17, 2025
samnordmann added a commit that referenced this pull request Feb 21, 2025
This PR is a small self-contained part belonging to the larger PR
- #3911

# What 

- Add the backend type as an argument to P2PCommunication*
@samnordmann
Copy link
Collaborator Author

!test

Copy link
Collaborator

@wujingyue wujingyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Again, I'm pretty sure this PR implements a valid solution and is a strict improvement. Many questions from me are about what are the alternatives and why certain solutions are preferred.

Comment on lines +16 to +18
void RecvPost(const P2pIpcHandle& ipc_handles, int64_t count, CUstream stream);
void SendPost(const P2pIpcHandle& ipc_handles, CUstream stream);
void SendWait(const P2pIpcHandle& ipc_handles, CUstream stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can probably become methods of P2pIpcHandle so you can hide implementation details like local, peer and semaphore

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to leave it separated here. The logic is that ipc_handle set the data structure (allocation, exporting/importing the semaphore) on the control path, while these function implement a runtime primitive on the data path. We will later write other p2p and collective algorithm (e.g. put_zcopy), in addition to compute/comms kernels, and they all will rely on the common ipc_handle data structure.

local, peer, semaphore is kind of the minimal set of public methods from ipc_handle that allow the implementation of many different communication patterns

// wait for sender to be ready
NVFUSER_CUDA_SAFE_CALL(cuStreamWaitValue32(
stream,
reinterpret_cast<CUdeviceptr>(ipc_handles.local().semaphore()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why two semaphores (local and peer)? AFAICT, a semaphore is shared between sender and receiver so both devices see the same value. Apparently, one semaphore per P2pIpcHandle indicating NOT_READY, READY, or COMPLETE ought to be enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that we could use only one semaphore. The question would then be where to allocate it (on the sender of receiver GPU).

The idea behind why we use two semaphores here, is to make sure that cuStreamWaitValue32 always polls a local buffer. That is considered good practice -- we always want the listener to poll a local buffer and not a remote one to avoid too much network transactions. That is why the semaphore is duplicated, one on the recv and one on the sender GPU. We pay the cost of duplicating the cuStreamWriteValue32 calls, to update both semaphore each time, but that's a very minor drawback.

I have not run any performance benchmark on this, though.

count,
cudaMemcpyDeviceToDevice,
stream));
// Signals completion to self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is needed in addition to the other completion signal below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step is needed to reset the semaphore in the case of a future reuse.

see also #3911 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- both semaphores need to have the same value. This way, waiting for local is equivalent to waiting for peer.

@samnordmann samnordmann requested a review from wujingyue March 6, 2025 13:13
recv_peer_val,
CommunicatorBackend::kCuda);
std::vector<P2PCommunication*> grouped_communications = {send, recv};
auto share_mem_handles = IrBuilder::create<hir::ShareMemHandles>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself

CU_STREAM_WRITE_VALUE_DEFAULT));
}

void sendPost(const P2pIpcHandle& ipc_handles, CUstream stream) {
Copy link
Collaborator

@wujingyue wujingyue Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't map this implementation to this slide so can you clarify? I was expecting the first step of sendPost to write kReady?

Copy link
Collaborator Author

@samnordmann samnordmann Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your link sends me to https://dlrequest/GroupID/Home/Index

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting the first step of sendPost to write kReady?

the first step is indeed to write to the semaphore. It writes kInUse, to signal it is ready-to-receive, while kReady signals the default semaphore state before the p2p starts. Are you asking why the name "kInUse" and not another naming?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the link -- the short link generated by nv/ had a . in it, confusing GitHub's markdown renderer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first step is indeed to write to the semaphore. It writes kInUse, to signal it is ready-to-receive, while kReady signals the default semaphore state before the p2p starts. Are you asking why the name "kInUse" and not another naming?

Yes. That answered my question. I wasn't sure about the difference between "ready-to-receive" in the text and kReady in the code. Will read the code again based on the new understanding...

@samnordmann samnordmann requested a review from wujingyue March 10, 2025 12:26

target_compile_definitions(codegen_internal PRIVATE "-DTORCH_CUDA_BUILD_MAIN_LIB")
target_include_directories(codegen_internal SYSTEM PUBLIC
${CMAKE_SOURCE_DIR}/third_party/gloo # TODO: guard this on usage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count,
cudaMemcpyDeviceToDevice,
stream));
// Signals completion to self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- both semaphores need to have the same value. This way, waiting for local is equivalent to waiting for peer.

@samnordmann samnordmann merged commit 677b84c into ipc_handle_infra Mar 12, 2025
5 of 9 checks passed
@samnordmann samnordmann deleted the cuda_p2p_gzcpy branch March 12, 2025 22:40
samnordmann added a commit that referenced this pull request Apr 14, 2025
On top of 
- #3909

prerequesite to:
- #3911

# What
- Set up the infrastructure needed for ipc handle exchange and caching
- Add an `Expr` node `hir::ShareMemHandles` to represent this op. We
cannot embed the op in the Send/Recv semantics because we need to group
the handle exchange between matching sends and recv to avoid deadlocks

# How
Most of the implementation is in `multidevice/ipc_handle.cpp`
- Define the class `IpcHandle` representing the ipc handle that is
exchanged. This class is supplemented with a semaphore, which is a local
cuda buffer allocated on the exporter's device.
- Define `IpcHandleCache` which handles exchanging and caching the ipc
handles. Caching is made on with respect to a combination of runtime and
symbolic ingredients: `(runtime peer, at::Tensor, Expr*)`. This caching
allows to have arbitrary number of p2p comms between pairs of ranks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants