Skip to content

[Issue]: Host memory sharing through CUMEM is broken in devices without p2p access #1838

@youkaichao

Description

@youkaichao

How is this issue impacting you?

Application crash

Share Your Debug Logs

See https://gist.github.com/youkaichao/d3e336df19a7ab0df7e5452c924aa3a8 for the verbose output log.

Steps to Reproduce the Issue

Use this docker image:

docker run -it -v /dev/shm:/dev/shm --gpus all public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:c0243db7c4db936d090d0214e308e771f8485168

and run:

cd /vllm-workspace/tests
NCCL_DEBUG_SUBSYS=ALL NCCL_DEBUG=TRACE TP_SIZE=1 DP_SIZE=2 pytest -v -s "v1/test_async_llm_dp.py::test_load[True-mp-RequestOutputKind.DELTA]"

on a 4xL4 node, it will crash and give the above error.

NCCL Version

2.27.3+cu128

Your platform details

No response

Error Message & Behavior

The error is:

[2025-09-09 23:51:24] 9b405963aa12:774:854 [0] transport/shm.cc:590 NCCL WARN Cuda failure 217 'peer access is not supported between these two devices'

The error clearly comes from this line:

CUCHECK(cuMemImportFromShareableHandle(&handle, (void *)(uintptr_t)fd, type));

When I add NCCL_CUMEM_HOST_ENABLE=0 , the test can pass.

I'm not sure if this is a NCCL bug or a driver bug. Does sharing host memory through cumem require device p2p access?

More background: Since I closed #1234 , I added the PR vllm-project/vllm#24141 in vllm to remove the env var override, so that nccl can use cumem apis as expected. However, we start to see ci failures, and finally I traced it down here.

Since vLLM is preparing to release a new version soon, we have several options:

  1. revert [distributed][rl] remove nccl cumem env var override vllm-project/vllm#24141 to keep backward compatibility. Good for vLLM users, but NCCL cannot use advanced features depending on cumem APIs.
  2. keep [distributed][rl] remove nccl cumem env var override vllm-project/vllm#24141 but override NCCL_CUMEM_HOST_ENABLE=0 instead. This only affects NCCL host memory allocation, good for NCCL but creates another complexity for vLLM users (they need to be aware of this change when connecting vLLM with their training processes).
  3. The NCCL team investigate this and tell us how to fix it from vLLM side, perfect for both world.

We hope to take action 3, but if it cannot be solved in time, maybe we will prefer 1 🙏

Note: I try to come up with a standalone reproducible example on the same node, but cannot succeed. I'm not sure when

CUCHECK(cuMemImportFromShareableHandle(&handle, (void *)(uintptr_t)fd, type));
will be used.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions