[CudaIpc Tutorial] Minimal snippet example by samnordmann · Pull Request #3912 · NVIDIA/Fuser

samnordmann · 2025-02-17T11:50:24Z

Pending on issue:

Error with driver API's lazy load of cuStream ops #3907

Minimal self-contained example for reference demonstrating using cudaIpc API. The provided tests show how to export/import ipc handles and use them to do RDMA write, with the important caveat that the exported handle always point to the start of the allocated buffer and not the offseted pointer

github-actions · 2025-02-17T11:51:11Z

Review updated until commit 9ba278a

Description

Added CUDA IPC tests for multi-device communication
Demonstrates exporting/importing IPC handles
Includes tests for pointer arithmetic on sender/receiver sides
Added synchronization tests using CUDA driver API

Changes walkthrough 📝

Relevant files

Tests

test_multidevice_ipc.cpp `Add CUDA IPC memory handle tests` tests/cpp/test_multidevice_ipc.cpp Added new tests for CUDA IPC memory handle operations Included tests for pointer arithmetic on sender/receiver sides Added synchronization tests using CUDA driver API	+200/-0

Configuration changes

CMakeLists.txt `Update CMakeLists.txt for new test` CMakeLists.txt Added new test file to CMake configuration	+1/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Pointer Arithmetic

The PR demonstrates that pointer arithmetic can be performed on the importer side but not on the exporter side. Ensure this behavior is well-documented and understood by users.

#ifdef NVFUSER_DISTRIBUTED
  // TL;DR: We CANNOT do pointer arithmetic on the exporter side! The IPC handle
  // points to the beginning of the allocated buffer.

  // Allocate GPU memory. Set up a buffer with two int values.
  constexpr size_t kBufferSize = 2 * sizeof(int64_t);
  const int64_t num_devices = communicator_->size();
  const int64_t rank = communicator_->deviceId();
  const int64_t peer_rank = (rank + 1) % num_devices;
  int64_t* d_ptr;
  NVFUSER_CUDA_RT_SAFE_CALL(cudaMalloc(&d_ptr, kBufferSize));

  std::vector<int64_t> values;
  values.push_back(2 * rank);
  values.push_back(2 * rank + 1);
  NVFUSER_CUDA_RT_SAFE_CALL(
      cudaMemcpy(d_ptr, values.data(), kBufferSize, cudaMemcpyHostToDevice));

  // Export Ipc Handle
  cudaIpcMemHandle_t ipc_handle;
  NVFUSER_CUDA_RT_SAFE_CALL(cudaIpcGetMemHandle(&ipc_handle, d_ptr + 1));
  auto store = communicator_->getTcpStore();
  store->set("ipc_handle_" + std::to_string(rank), toBytes(ipc_handle));

  // Wait for all ranks to finish exporting the IPC handle

CUDA Driver API Usage

The use of CUDA driver API functions cuStreamWriteValue32 and cuStreamWaitValue32 should be validated for compatibility and performance implications.

// cuStreamWriteValue32 and cuStreamWaitValue32 are CUDA driver API used in the
// context of synchronization in p2p communication over cudaIpcHandle
using StreamOpTest = NVFuserTest;
TEST_F(StreamOpTest, StreamWriteValue32) {
  cudaStream_t stream;
  void* buf;
  int value = 0;
  constexpr int new_value = 42;
  NVFUSER_CUDA_RT_SAFE_CALL(cudaSetDevice(0));
  NVFUSER_CUDA_RT_SAFE_CALL(cudaStreamCreate(&stream));
  NVFUSER_CUDA_RT_SAFE_CALL(cudaMalloc(&buf, sizeof(int)));
  NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpyAsync(
      buf, &value, sizeof(int), cudaMemcpyHostToDevice, stream));
  NVFUSER_CUDA_SAFE_CALL(cuStreamWriteValue32(
      stream, (CUdeviceptr)buf, new_value, CU_STREAM_WRITE_VALUE_DEFAULT));
  NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpyAsync(
      &value, buf, sizeof(int), cudaMemcpyDeviceToHost, stream));
  NVFUSER_CUDA_RT_SAFE_CALL(cudaStreamSynchronize(stream));
  EXPECT_EQ(value, new_value);
}

samnordmann · 2025-02-24T16:52:24Z

!test

samnordmann · 2025-02-24T16:55:04Z

!test

wujingyue

Thanks -- this is super useful to show how to use cuda IPC bare metal. I'll review the code logic later today.

tests/cpp/test_multidevice_gpu_comms.cpp

wujingyue · 2025-02-25T06:42:47Z

tests/cpp/test_multidevice_gpu_comms.cpp

+  CUDA_CALL(cudaIpcGetMemHandle(&ipc_handle, d_ptr));
+
+  auto store = communicator_->getTcpStore();
+  store->set("ipc_handle_" + std::to_string(rank), toBytes(ipc_handle));


You may want to handle endianness sooner or later. Code as is can be problematic when communicating across nodes with different byte orders. Thus the reason for functions like https://linux.die.net/man/3/htonl

I am not sure to understand how you suggest endianness comes into play here. As far as I understand, everything is safe, even accross nodes, as ensured by the c10d::TCPStore implementation (which btw is already extensively used in nvFuser and so many clients, e.g., to back ProcessGroups)

It's toBytes and fromBytes that are potentially problematic; not TCPStore. TCPStore sends/receives bytes and therefore follow network order. I don't have a good reference at hand for host order vs network order, but maybe https://www.perplexity.ai/search/host-order-vs-network-order-MbDAwE1qS162Lfdm3Bcirw#0

I am not sure to understand. If we are not talking about the TCP transfer (i.e. the network), and focus only on fromBytes and toBytes, it is only host order. Those functions are merely a recast.
Are you suggesting that the bit representation of uint8_t or other datatype can vary from host to host? I don't think that can be the case -- if that would be, this problem would show up anytime we communicate data between processes, including for example NCCL comms, where data is transmitted as void* and recasted back to the right datatype on the receiver side

the bit representation of uint8_t or other datatype can vary from host to host

Sort of. The in-memory representation of primitive types larger than one byte (e.g. uint64_t) can vary from host to host.

Little Endian vs. Big Endian

Endianness refers to how bytes are ordered when storing multi-byte data types (e.g., 16-bit, 32-bit, or 64-bit values) in computer memory.

1. Little Endian

Definition: The least significant byte (LSB) is stored first (at the lowest memory address), and the most significant byte (MSB) is stored last (at the highest memory address).

Example (32-bit number 0x12345678):
Memory Address → 0x00 0x01 0x02 0x03 Data (bytes) → 0x78 0x56 0x34 0x12

Used By:

x86 and x86-64 architectures (Intel, AMD)

ARM (defaults to little-endian but can switch)

2. Big Endian

Definition: The most significant byte (MSB) is stored first (at the lowest memory address), and the least significant byte (LSB) is stored last (at the highest memory address).

Example (32-bit number 0x12345678):
Memory Address → 0x00 0x01 0x02 0x03 Data (bytes) → 0x12 0x34 0x56 0x78

Used By:

Network protocols (e.g., TCP/IP, IP headers)

Older architectures (e.g., Motorola 68k, SPARC)

Some RISC architectures (e.g., PowerPC)

tests/cpp/test_multidevice_gpu_comms.cpp

wujingyue

LGTM otherwise!

Thanks -- this clarifies #3910 a lot

tests/cpp/test_multidevice_gpu_comms.cpp

wujingyue · 2025-02-25T07:01:58Z

FYI, there's apparently a real error in CI: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/144287043#L1416

samnordmann · 2025-02-25T09:27:45Z

FYI, there's apparently a real error in CI: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/144287043#L1416

Unfortunately, I am a bit stuck with this one. Without explicitely linking to cuda, the Driver API errors out at runtime... #3907

samnordmann · 2025-04-10T07:17:25Z

!test

samnordmann · 2025-04-10T13:59:15Z

!test

samnordmann · 2025-04-11T11:11:04Z

!test

samnordmann · 2025-04-11T13:55:45Z

!test

samnordmann · 2025-04-14T12:31:11Z

!test

This reverts commit df1af39.

Reverts #3912, which showed real errors before it was merged.

Fix #3912 after it has been reverted by #4248

samnordmann mentioned this pull request Feb 17, 2025

[CudaIpc Tuto] Minimal snippet example #3895

Closed

samnordmann requested a review from wujingyue February 24, 2025 16:52

CudaIpc Tuto tests

543acd2

samnordmann force-pushed the cuda_ipc_tuto branch from 15e4dbd to 543acd2 Compare February 24, 2025 16:53

wujingyue reviewed Feb 24, 2025

View reviewed changes

wujingyue reviewed Feb 25, 2025

View reviewed changes

tests/cpp/test_multidevice_gpu_comms.cpp Show resolved Hide resolved

wujingyue changed the title ~~[CudaIpc Tuto] Minimal snippet example~~ [CudaIpc Tutorial] Minimal snippet example Feb 25, 2025

samnordmann added 2 commits February 25, 2025 01:21

minor comments

5f76535

lint

7f07dc4

wujingyue approved these changes Feb 25, 2025

View reviewed changes

minor comment

36f8ee7

wujingyue mentioned this pull request Feb 28, 2025

[CudaIpc 2/3]: Ipc handle exchange #3910

Merged

samnordmann added 2 commits April 10, 2025 06:58

remove explicit linking with cuda

a469f66

Merge branch 'main' of github.com:NVIDIA/Fuser into cuda_ipc_tuto

1cce770

samnordmann force-pushed the cuda_ipc_tuto branch from 963e302 to 1cce770 Compare April 10, 2025 13:58

samnordmann added 2 commits April 11, 2025 06:47

guard TCPStore method with NVFUSER_DISTRIBUTED

b115628

Merge branch 'main' of github.com:NVIDIA/Fuser into cuda_ipc_tuto

bff2fad

samnordmann force-pushed the cuda_ipc_tuto branch from 874a09b to bff2fad Compare April 11, 2025 13:47

lint

e72dda5

Merge branch 'main' of github.com:NVIDIA/Fuser into cuda_ipc_tuto

9ba278a

samnordmann merged commit df1af39 into main Apr 14, 2025
35 of 38 checks passed

samnordmann deleted the cuda_ipc_tuto branch April 14, 2025 14:45

wujingyue added a commit that referenced this pull request Apr 14, 2025

Revert "[CudaIpc Tutorial] Minimal snippet example (#3912)"

09f62f8

This reverts commit df1af39.

wujingyue mentioned this pull request Apr 14, 2025

Revert "[CudaIpc Tutorial] Minimal snippet example" #4248

Merged

naoyam pushed a commit that referenced this pull request Apr 14, 2025

Revert "[CudaIpc Tutorial] Minimal snippet example" (#4248)

2bdb6d7

Reverts #3912, which showed real errors before it was merged.

samnordmann mentioned this pull request Apr 15, 2025

Fix Cuda Ipc Tuto #4251

Merged

samnordmann added a commit that referenced this pull request Apr 16, 2025

Fix Cuda Ipc Tuto (#4251)

91b1801

Fix #3912 after it has been reverted by #4248

Conversation

samnordmann commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

samnordmann commented Feb 24, 2025

Uh oh!

samnordmann commented Feb 24, 2025

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wujingyue Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

samnordmann Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wujingyue Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

samnordmann Feb 26, 2025

Choose a reason for hiding this comment

Uh oh!

wujingyue Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Little Endian vs. Big Endian

1. Little Endian

2. Big Endian

Uh oh!

Uh oh!

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wujingyue commented Feb 25, 2025

Uh oh!

samnordmann commented Feb 25, 2025

Uh oh!

samnordmann commented Apr 10, 2025

Uh oh!

samnordmann commented Apr 10, 2025

Uh oh!

samnordmann commented Apr 11, 2025

Uh oh!

samnordmann commented Apr 11, 2025

Uh oh!

samnordmann commented Apr 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samnordmann commented Feb 17, 2025 •

edited

Loading

github-actions bot commented Feb 17, 2025 •

edited

Loading

samnordmann Feb 25, 2025 •

edited

Loading

wujingyue Feb 27, 2025 •

edited

Loading