Skip to content

[CudaIpc Tutorial] Minimal snippet example#3912

Merged
samnordmann merged 10 commits intomainfrom
cuda_ipc_tuto
Apr 14, 2025
Merged

[CudaIpc Tutorial] Minimal snippet example#3912
samnordmann merged 10 commits intomainfrom
cuda_ipc_tuto

Conversation

@samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Feb 17, 2025

Pending on issue:

Minimal self-contained example for reference demonstrating using cudaIpc API. The provided tests show how to export/import ipc handles and use them to do RDMA write, with the important caveat that the exported handle always point to the start of the allocated buffer and not the offseted pointer

@github-actions
Copy link

github-actions bot commented Feb 17, 2025

Review updated until commit 9ba278a

Description

  • Added CUDA IPC tests for multi-device communication

  • Demonstrates exporting/importing IPC handles

  • Includes tests for pointer arithmetic on sender/receiver sides

  • Added synchronization tests using CUDA driver API


Changes walkthrough 📝

Relevant files
Tests
test_multidevice_ipc.cpp
Add CUDA IPC memory handle tests                                                 

tests/cpp/test_multidevice_ipc.cpp

  • Added new tests for CUDA IPC memory handle operations
  • Included tests for pointer arithmetic on sender/receiver sides
  • Added synchronization tests using CUDA driver API
  • +200/-0 
    Configuration changes
    CMakeLists.txt
    Update CMakeLists.txt for new test                                             

    CMakeLists.txt

    • Added new test file to CMake configuration
    +1/-0     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Pointer Arithmetic

    The PR demonstrates that pointer arithmetic can be performed on the importer side but not on the exporter side. Ensure this behavior is well-documented and understood by users.

    #ifdef NVFUSER_DISTRIBUTED
      // TL;DR: We CANNOT do pointer arithmetic on the exporter side! The IPC handle
      // points to the beginning of the allocated buffer.
    
      // Allocate GPU memory. Set up a buffer with two int values.
      constexpr size_t kBufferSize = 2 * sizeof(int64_t);
      const int64_t num_devices = communicator_->size();
      const int64_t rank = communicator_->deviceId();
      const int64_t peer_rank = (rank + 1) % num_devices;
      int64_t* d_ptr;
      NVFUSER_CUDA_RT_SAFE_CALL(cudaMalloc(&d_ptr, kBufferSize));
    
      std::vector<int64_t> values;
      values.push_back(2 * rank);
      values.push_back(2 * rank + 1);
      NVFUSER_CUDA_RT_SAFE_CALL(
          cudaMemcpy(d_ptr, values.data(), kBufferSize, cudaMemcpyHostToDevice));
    
      // Export Ipc Handle
      cudaIpcMemHandle_t ipc_handle;
      NVFUSER_CUDA_RT_SAFE_CALL(cudaIpcGetMemHandle(&ipc_handle, d_ptr + 1));
      auto store = communicator_->getTcpStore();
      store->set("ipc_handle_" + std::to_string(rank), toBytes(ipc_handle));
    
      // Wait for all ranks to finish exporting the IPC handle
    CUDA Driver API Usage

    The use of CUDA driver API functions cuStreamWriteValue32 and cuStreamWaitValue32 should be validated for compatibility and performance implications.

    // cuStreamWriteValue32 and cuStreamWaitValue32 are CUDA driver API used in the
    // context of synchronization in p2p communication over cudaIpcHandle
    using StreamOpTest = NVFuserTest;
    TEST_F(StreamOpTest, StreamWriteValue32) {
      cudaStream_t stream;
      void* buf;
      int value = 0;
      constexpr int new_value = 42;
      NVFUSER_CUDA_RT_SAFE_CALL(cudaSetDevice(0));
      NVFUSER_CUDA_RT_SAFE_CALL(cudaStreamCreate(&stream));
      NVFUSER_CUDA_RT_SAFE_CALL(cudaMalloc(&buf, sizeof(int)));
      NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpyAsync(
          buf, &value, sizeof(int), cudaMemcpyHostToDevice, stream));
      NVFUSER_CUDA_SAFE_CALL(cuStreamWriteValue32(
          stream, (CUdeviceptr)buf, new_value, CU_STREAM_WRITE_VALUE_DEFAULT));
      NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpyAsync(
          &value, buf, sizeof(int), cudaMemcpyDeviceToHost, stream));
      NVFUSER_CUDA_RT_SAFE_CALL(cudaStreamSynchronize(stream));
      EXPECT_EQ(value, new_value);
    }

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann
    Copy link
    Collaborator Author

    !test

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Thanks -- this is super useful to show how to use cuda IPC bare metal. I'll review the code logic later today.

    CUDA_CALL(cudaIpcGetMemHandle(&ipc_handle, d_ptr));

    auto store = communicator_->getTcpStore();
    store->set("ipc_handle_" + std::to_string(rank), toBytes(ipc_handle));
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    You may want to handle endianness sooner or later. Code as is can be problematic when communicating across nodes with different byte orders. Thus the reason for functions like https://linux.die.net/man/3/htonl

    Copy link
    Collaborator Author

    @samnordmann samnordmann Feb 25, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I am not sure to understand how you suggest endianness comes into play here. As far as I understand, everything is safe, even accross nodes, as ensured by the c10d::TCPStore implementation (which btw is already extensively used in nvFuser and so many clients, e.g., to back ProcessGroups)

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    It's toBytes and fromBytes that are potentially problematic; not TCPStore. TCPStore sends/receives bytes and therefore follow network order. I don't have a good reference at hand for host order vs network order, but maybe https://www.perplexity.ai/search/host-order-vs-network-order-MbDAwE1qS162Lfdm3Bcirw#0

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I am not sure to understand. If we are not talking about the TCP transfer (i.e. the network), and focus only on fromBytes and toBytes, it is only host order. Those functions are merely a recast.
    Are you suggesting that the bit representation of uint8_t or other datatype can vary from host to host? I don't think that can be the case -- if that would be, this problem would show up anytime we communicate data between processes, including for example NCCL comms, where data is transmitted as void* and recasted back to the right datatype on the receiver side

    Copy link
    Collaborator

    @wujingyue wujingyue Feb 27, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    the bit representation of uint8_t or other datatype can vary from host to host

    Sort of. The in-memory representation of primitive types larger than one byte (e.g. uint64_t) can vary from host to host.

    Little Endian vs. Big Endian

    Endianness refers to how bytes are ordered when storing multi-byte data types (e.g., 16-bit, 32-bit, or 64-bit values) in computer memory.


    1. Little Endian

    • Definition: The least significant byte (LSB) is stored first (at the lowest memory address), and the most significant byte (MSB) is stored last (at the highest memory address).
    • Example (32-bit number 0x12345678):
      Memory Address  →  0x00   0x01   0x02   0x03
      Data (bytes)   →  0x78   0x56   0x34   0x12
      
    • Used By:
      • x86 and x86-64 architectures (Intel, AMD)
      • ARM (defaults to little-endian but can switch)

    2. Big Endian

    • Definition: The most significant byte (MSB) is stored first (at the lowest memory address), and the least significant byte (LSB) is stored last (at the highest memory address).
    • Example (32-bit number 0x12345678):
      Memory Address  →  0x00   0x01   0x02   0x03
      Data (bytes)   →  0x12   0x34   0x56   0x78
      
    • Used By:
      • Network protocols (e.g., TCP/IP, IP headers)
      • Older architectures (e.g., Motorola 68k, SPARC)
      • Some RISC architectures (e.g., PowerPC)

    Copy link
    Collaborator

    @wujingyue wujingyue left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM otherwise!

    Thanks -- this clarifies #3910 a lot

    @wujingyue
    Copy link
    Collaborator

    FYI, there's apparently a real error in CI: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/144287043#L1416

    @wujingyue wujingyue changed the title [CudaIpc Tuto] Minimal snippet example [CudaIpc Tutorial] Minimal snippet example Feb 25, 2025
    @samnordmann
    Copy link
    Collaborator Author

    FYI, there's apparently a real error in CI: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/144287043#L1416

    Unfortunately, I am a bit stuck with this one. Without explicitely linking to cuda, the Driver API errors out at runtime... #3907

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann
    Copy link
    Collaborator Author

    !test

    1 similar comment
    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann
    Copy link
    Collaborator Author

    !test

    @samnordmann samnordmann merged commit df1af39 into main Apr 14, 2025
    35 of 38 checks passed
    @samnordmann samnordmann deleted the cuda_ipc_tuto branch April 14, 2025 14:45
    wujingyue added a commit that referenced this pull request Apr 14, 2025
    naoyam pushed a commit that referenced this pull request Apr 14, 2025
    Reverts #3912, which showed real errors before it was
    merged.
    @samnordmann samnordmann mentioned this pull request Apr 15, 2025
    samnordmann added a commit that referenced this pull request Apr 16, 2025
    Fix #3912 after it has been reverted
    by #4248
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants