[Issue]: Host memory sharing through CUMEM is broken in devices without p2p access

### How is this issue impacting you?

Application crash

### Share Your Debug Logs

See https://gist.github.com/youkaichao/d3e336df19a7ab0df7e5452c924aa3a8 for the verbose output log.

### Steps to Reproduce the Issue

Use this docker image:

`docker run -it -v /dev/shm:/dev/shm --gpus all public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:c0243db7c4db936d090d0214e308e771f8485168`

and run:

```shell
cd /vllm-workspace/tests
NCCL_DEBUG_SUBSYS=ALL NCCL_DEBUG=TRACE TP_SIZE=1 DP_SIZE=2 pytest -v -s "v1/test_async_llm_dp.py::test_load[True-mp-RequestOutputKind.DELTA]"
```

on a 4xL4 node, it will crash and give the above error.

### NCCL Version

2.27.3+cu128

### Your platform details

_No response_

### Error Message & Behavior

The error is:

```
[2025-09-09 23:51:24] 9b405963aa12:774:854 [0] transport/shm.cc:590 NCCL WARN Cuda failure 217 'peer access is not supported between these two devices'
```

The error clearly comes from this line: https://github.com/NVIDIA/nccl/blob/f1308997d0420148b1be1c24d63f19d902ae589b/src/transport/shm.cc#L590

When I add `NCCL_CUMEM_HOST_ENABLE=0` , the test can pass.

I'm not sure if this is a NCCL bug or a driver bug. Does sharing host memory through cumem require device p2p access?

More background: Since I closed https://github.com/NVIDIA/nccl/issues/1234 , I added the PR https://github.com/vllm-project/vllm/pull/24141 in vllm to remove the env var override, so that nccl can use cumem apis as expected. However, we start to see ci failures, and finally I traced it down here.

Since vLLM is preparing to release a new version soon, we have several options:
1. revert https://github.com/vllm-project/vllm/pull/24141 to keep backward compatibility. Good for vLLM users, but NCCL cannot use advanced features depending on cumem APIs.
2. keep https://github.com/vllm-project/vllm/pull/24141 but override `NCCL_CUMEM_HOST_ENABLE=0` instead. This only affects NCCL host memory allocation, good for NCCL but creates another complexity for vLLM users (they need to be aware of this change when connecting vLLM with their training processes).
3. The NCCL team investigate this and tell us how to fix it from vLLM side, perfect for both world.

We hope to take action 3, but if it cannot be solved in time, maybe we will prefer 1 🙏 

Note: I try to come up with a standalone reproducible example on the same node, but cannot succeed. I'm not sure when https://github.com/NVIDIA/nccl/blob/f1308997d0420148b1be1c24d63f19d902ae589b/src/transport/shm.cc#L590 will be used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Host memory sharing through CUMEM is broken in devices without p2p access #1838

How is this issue impacting you?

Share Your Debug Logs

Steps to Reproduce the Issue

NCCL Version

Your platform details

Error Message & Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: Host memory sharing through CUMEM is broken in devices without p2p access #1838

Description

How is this issue impacting you?

Share Your Debug Logs

Steps to Reproduce the Issue

NCCL Version

Your platform details

Error Message & Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions