[CudaIpc 3/3]: p2p get-Zcopy#3911
Conversation
|
Review updated until commit 289a713 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
This PR is a small self-contained part belonging to the larger PR - #3911 # What - Add the backend type as an argument to P2PCommunication*
|
!test |
wujingyue
left a comment
There was a problem hiding this comment.
Thanks for the PR! Again, I'm pretty sure this PR implements a valid solution and is a strict improvement. Many questions from me are about what are the alternatives and why certain solutions are preferred.
csrc/multidevice/cuda_p2p.h
Outdated
| void RecvPost(const P2pIpcHandle& ipc_handles, int64_t count, CUstream stream); | ||
| void SendPost(const P2pIpcHandle& ipc_handles, CUstream stream); | ||
| void SendWait(const P2pIpcHandle& ipc_handles, CUstream stream); |
There was a problem hiding this comment.
These can probably become methods of P2pIpcHandle so you can hide implementation details like local, peer and semaphore
There was a problem hiding this comment.
I would prefer to leave it separated here. The logic is that ipc_handle set the data structure (allocation, exporting/importing the semaphore) on the control path, while these function implement a runtime primitive on the data path. We will later write other p2p and collective algorithm (e.g. put_zcopy), in addition to compute/comms kernels, and they all will rely on the common ipc_handle data structure.
local, peer, semaphore is kind of the minimal set of public methods from ipc_handle that allow the implementation of many different communication patterns
| // wait for sender to be ready | ||
| NVFUSER_CUDA_SAFE_CALL(cuStreamWaitValue32( | ||
| stream, | ||
| reinterpret_cast<CUdeviceptr>(ipc_handles.local().semaphore()), |
There was a problem hiding this comment.
Why two semaphores (local and peer)? AFAICT, a semaphore is shared between sender and receiver so both devices see the same value. Apparently, one semaphore per P2pIpcHandle indicating NOT_READY, READY, or COMPLETE ought to be enough?
There was a problem hiding this comment.
You are right that we could use only one semaphore. The question would then be where to allocate it (on the sender of receiver GPU).
The idea behind why we use two semaphores here, is to make sure that cuStreamWaitValue32 always polls a local buffer. That is considered good practice -- we always want the listener to poll a local buffer and not a remote one to avoid too much network transactions. That is why the semaphore is duplicated, one on the recv and one on the sender GPU. We pay the cost of duplicating the cuStreamWriteValue32 calls, to update both semaphore each time, but that's a very minor drawback.
I have not run any performance benchmark on this, though.
| count, | ||
| cudaMemcpyDeviceToDevice, | ||
| stream)); | ||
| // Signals completion to self |
There was a problem hiding this comment.
Can you explain why this is needed in addition to the other completion signal below?
There was a problem hiding this comment.
This step is needed to reset the semaphore in the case of a future reuse.
see also #3911 (comment)
There was a problem hiding this comment.
Got it -- both semaphores need to have the same value. This way, waiting for local is equivalent to waiting for peer.
| recv_peer_val, | ||
| CommunicatorBackend::kCuda); | ||
| std::vector<P2PCommunication*> grouped_communications = {send, recv}; | ||
| auto share_mem_handles = IrBuilder::create<hir::ShareMemHandles>( |
| CU_STREAM_WRITE_VALUE_DEFAULT)); | ||
| } | ||
|
|
||
| void sendPost(const P2pIpcHandle& ipc_handles, CUstream stream) { |
There was a problem hiding this comment.
I couldn't map this implementation to this slide so can you clarify? I was expecting the first step of sendPost to write kReady?
There was a problem hiding this comment.
Your link sends me to https://dlrequest/GroupID/Home/Index
There was a problem hiding this comment.
I was expecting the first step of sendPost to write kReady?
the first step is indeed to write to the semaphore. It writes kInUse, to signal it is ready-to-receive, while kReady signals the default semaphore state before the p2p starts. Are you asking why the name "kInUse" and not another naming?
There was a problem hiding this comment.
Fixed the link -- the short link generated by nv/ had a . in it, confusing GitHub's markdown renderer.
There was a problem hiding this comment.
the first step is indeed to write to the semaphore. It writes
kInUse, to signal it is ready-to-receive, whilekReadysignals the default semaphore state before the p2p starts. Are you asking why the name "kInUse" and not another naming?
Yes. That answered my question. I wasn't sure about the difference between "ready-to-receive" in the text and kReady in the code. Will read the code again based on the new understanding...
|
|
||
| target_compile_definitions(codegen_internal PRIVATE "-DTORCH_CUDA_BUILD_MAIN_LIB") | ||
| target_include_directories(codegen_internal SYSTEM PUBLIC | ||
| ${CMAKE_SOURCE_DIR}/third_party/gloo # TODO: guard this on usage |
There was a problem hiding this comment.
| count, | ||
| cudaMemcpyDeviceToDevice, | ||
| stream)); | ||
| // Signals completion to self |
There was a problem hiding this comment.
Got it -- both semaphores need to have the same value. This way, waiting for local is equivalent to waiting for peer.
On top of - #3909 prerequesite to: - #3911 # What - Set up the infrastructure needed for ipc handle exchange and caching - Add an `Expr` node `hir::ShareMemHandles` to represent this op. We cannot embed the op in the Send/Recv semantics because we need to group the handle exchange between matching sends and recv to avoid deadlocks # How Most of the implementation is in `multidevice/ipc_handle.cpp` - Define the class `IpcHandle` representing the ipc handle that is exchanged. This class is supplemented with a semaphore, which is a local cuda buffer allocated on the exporter's device. - Define `IpcHandleCache` which handles exchanging and caching the ipc handles. Caching is made on with respect to a combination of runtime and symbolic ingredients: `(runtime peer, at::Tensor, Expr*)`. This caching allows to have arbitrary number of p2p comms between pairs of ranks.
On top of:
Pending on issue: