[CudaIpc Tutorial] Minimal snippet example#3912
Conversation
|
Review updated until commit 9ba278a Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
!test |
15e4dbd to
543acd2
Compare
|
!test |
wujingyue
left a comment
There was a problem hiding this comment.
Thanks -- this is super useful to show how to use cuda IPC bare metal. I'll review the code logic later today.
| CUDA_CALL(cudaIpcGetMemHandle(&ipc_handle, d_ptr)); | ||
|
|
||
| auto store = communicator_->getTcpStore(); | ||
| store->set("ipc_handle_" + std::to_string(rank), toBytes(ipc_handle)); |
There was a problem hiding this comment.
You may want to handle endianness sooner or later. Code as is can be problematic when communicating across nodes with different byte orders. Thus the reason for functions like https://linux.die.net/man/3/htonl
There was a problem hiding this comment.
I am not sure to understand how you suggest endianness comes into play here. As far as I understand, everything is safe, even accross nodes, as ensured by the c10d::TCPStore implementation (which btw is already extensively used in nvFuser and so many clients, e.g., to back ProcessGroups)
There was a problem hiding this comment.
It's toBytes and fromBytes that are potentially problematic; not TCPStore. TCPStore sends/receives bytes and therefore follow network order. I don't have a good reference at hand for host order vs network order, but maybe https://www.perplexity.ai/search/host-order-vs-network-order-MbDAwE1qS162Lfdm3Bcirw#0
There was a problem hiding this comment.
I am not sure to understand. If we are not talking about the TCP transfer (i.e. the network), and focus only on fromBytes and toBytes, it is only host order. Those functions are merely a recast.
Are you suggesting that the bit representation of uint8_t or other datatype can vary from host to host? I don't think that can be the case -- if that would be, this problem would show up anytime we communicate data between processes, including for example NCCL comms, where data is transmitted as void* and recasted back to the right datatype on the receiver side
There was a problem hiding this comment.
the bit representation of uint8_t or other datatype can vary from host to host
Sort of. The in-memory representation of primitive types larger than one byte (e.g. uint64_t) can vary from host to host.
Little Endian vs. Big Endian
Endianness refers to how bytes are ordered when storing multi-byte data types (e.g., 16-bit, 32-bit, or 64-bit values) in computer memory.
1. Little Endian
- Definition: The least significant byte (LSB) is stored first (at the lowest memory address), and the most significant byte (MSB) is stored last (at the highest memory address).
- Example (32-bit number
0x12345678):Memory Address → 0x00 0x01 0x02 0x03 Data (bytes) → 0x78 0x56 0x34 0x12 - Used By:
- x86 and x86-64 architectures (Intel, AMD)
- ARM (defaults to little-endian but can switch)
2. Big Endian
- Definition: The most significant byte (MSB) is stored first (at the lowest memory address), and the least significant byte (LSB) is stored last (at the highest memory address).
- Example (32-bit number
0x12345678):Memory Address → 0x00 0x01 0x02 0x03 Data (bytes) → 0x12 0x34 0x56 0x78 - Used By:
- Network protocols (e.g., TCP/IP, IP headers)
- Older architectures (e.g., Motorola 68k, SPARC)
- Some RISC architectures (e.g., PowerPC)
|
FYI, there's apparently a real error in CI: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/144287043#L1416 |
Unfortunately, I am a bit stuck with this one. Without explicitely linking to cuda, the Driver API errors out at runtime... #3907 |
|
!test |
963e302 to
1cce770
Compare
|
!test |
1 similar comment
|
!test |
874a09b to
bff2fad
Compare
|
!test |
|
!test |
This reverts commit df1af39.
Reverts #3912, which showed real errors before it was merged.
Pending on issue:
Minimal self-contained example for reference demonstrating using cudaIpc API. The provided tests show how to export/import ipc handles and use them to do RDMA write, with the important caveat that the exported handle always point to the start of the allocated buffer and not the offseted pointer