You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docker run -it -v /dev/shm:/dev/shm --gpus all public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:c0243db7c4db936d090d0214e308e771f8485168
and run:
cd /vllm-workspace/tests
NCCL_DEBUG_SUBSYS=ALL NCCL_DEBUG=TRACE TP_SIZE=1 DP_SIZE=2 pytest -v -s "v1/test_async_llm_dp.py::test_load[True-mp-RequestOutputKind.DELTA]"
on a 4xL4 node, it will crash and give the above error.
NCCL Version
2.27.3+cu128
Your platform details
No response
Error Message & Behavior
The error is:
[2025-09-09 23:51:24] 9b405963aa12:774:854 [0] transport/shm.cc:590 NCCL WARN Cuda failure 217 'peer access is not supported between these two devices'
When I add NCCL_CUMEM_HOST_ENABLE=0 , the test can pass.
I'm not sure if this is a NCCL bug or a driver bug. Does sharing host memory through cumem require device p2p access?
More background: Since I closed #1234 , I added the PR vllm-project/vllm#24141 in vllm to remove the env var override, so that nccl can use cumem apis as expected. However, we start to see ci failures, and finally I traced it down here.
Since vLLM is preparing to release a new version soon, we have several options:
keep [distributed][rl] remove nccl cumem env var override vllm-project/vllm#24141 but override NCCL_CUMEM_HOST_ENABLE=0 instead. This only affects NCCL host memory allocation, good for NCCL but creates another complexity for vLLM users (they need to be aware of this change when connecting vLLM with their training processes).
The NCCL team investigate this and tell us how to fix it from vLLM side, perfect for both world.
We hope to take action 3, but if it cannot be solved in time, maybe we will prefer 1 🙏
Note: I try to come up with a standalone reproducible example on the same node, but cannot succeed. I'm not sure when
How is this issue impacting you?
Application crash
Share Your Debug Logs
See https://gist.github.com/youkaichao/d3e336df19a7ab0df7e5452c924aa3a8 for the verbose output log.
Steps to Reproduce the Issue
Use this docker image:
docker run -it -v /dev/shm:/dev/shm --gpus all public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:c0243db7c4db936d090d0214e308e771f8485168and run:
on a 4xL4 node, it will crash and give the above error.
NCCL Version
2.27.3+cu128
Your platform details
No response
Error Message & Behavior
The error is:
The error clearly comes from this line:
nccl/src/transport/shm.cc
Line 590 in f130899
When I add
NCCL_CUMEM_HOST_ENABLE=0, the test can pass.I'm not sure if this is a NCCL bug or a driver bug. Does sharing host memory through cumem require device p2p access?
More background: Since I closed #1234 , I added the PR vllm-project/vllm#24141 in vllm to remove the env var override, so that nccl can use cumem apis as expected. However, we start to see ci failures, and finally I traced it down here.
Since vLLM is preparing to release a new version soon, we have several options:
NCCL_CUMEM_HOST_ENABLE=0instead. This only affects NCCL host memory allocation, good for NCCL but creates another complexity for vLLM users (they need to be aware of this change when connecting vLLM with their training processes).We hope to take action 3, but if it cannot be solved in time, maybe we will prefer 1 🙏
Note: I try to come up with a standalone reproducible example on the same node, but cannot succeed. I'm not sure when
nccl/src/transport/shm.cc
Line 590 in f130899