bugfix(nvlink): Add explicit P2P access enablement and error handling for NvlinkTransport#683
Conversation
Add static getNumDevices() function to cache CUDA device count and reuse it in supportFabricMem() to avoid multiple cudaGetDeviceCount calls. Signed-off-by: staryxchen <staryxchen@tencent.com>
- Added checkCudaError helper function for consistent error handling - Implemented enableP2PAccess function to explicitly enable bidirectional P2P access - Modified NvlinkTransport constructor to enable P2P access between all device pairs Signed-off-by: staryxchen <staryxchen@tencent.com>
…operly handle CUDA errors by returning false on failure. Signed-off-by: staryxchen <staryxchen@tencent.com>
There was a problem hiding this comment.
Pull Request Overview
This PR adds explicit peer-to-peer (P2P) access enablement and improved error handling to the NvlinkTransport class to ensure reliable GPU-to-GPU transfers in multi-GPU environments.
- Added helper functions for consistent CUDA error handling
- Implemented explicit bidirectional P2P access enablement between all GPU device pairs
- Enhanced error logging with detailed CUDA error codes and messages
- Remove unused checkCudaError function
- Update getNumDevices to use checkCudaErrorReturn
- Fix typo in error message ("not device" to "no device")
- Improve peer access error handling with detailed logging
Signed-off-by: staryxchen <staryxchen@tencent.com>
551bae1 to
2cf9dfd
Compare
|
Seems like this PR breaks the MNNVL usage: [2025-09-19 09:42:44 TP0] Using KV cache dtype: torch.bfloat16
[2025-09-19 09:42:45 TP1] KV Cache is allocated. #tokens: 1745576, K size: 59.93 GB, V size: 59.93 GB
[2025-09-19 09:42:45 TP0] KV Cache is allocated. #tokens: 1745576, K size: 59.93 GB, V size: 59.93 GB
[2025-09-19 09:42:45 TP0] Memory pool end. avail mem=53.92 GB
[2025-09-19 09:42:45 TP0] max_total_num_tokens=1745576, chunked_prefill_size=393216, max_prefill_tokens=16384, max_running_requests=4096, context_len=40960, available_gpu_mem=53.25 GB
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20250919 09:42:46.478288 80263 transfer_engine.cpp:422] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I20250919 09:42:46.478338 80263 transfer_engine.cpp:44] Transfer Engine starting. Server: 192.168.3.226, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I20250919 09:42:46.478355 80263 transfer_engine.cpp:63] Transfer Engine parseHostNameWithPort. server_name: 192.168.3.226 port: 12001
I20250919 09:42:46.478390 80263 transfer_engine.cpp:114] Transfer Engine RPC using P2P handshake, listening on 192.168.3.226:16145
I20250919 09:42:46.478504 80263 transfer_engine.cpp:138] Auto-discovering topology...
W20250919 09:42:46.478768 80263 topology.cpp:58] No RDMA devices found, check your device installation
I20250919 09:42:46.478857 80263 transfer_engine.cpp:153] Topology discovery complete. Found 0 HCAs.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20250919 09:42:46.487957 80264 transfer_engine.cpp:422] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I20250919 09:42:46.488008 80264 transfer_engine.cpp:44] Transfer Engine starting. Server: 192.168.3.226, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I20250919 09:42:46.488024 80264 transfer_engine.cpp:63] Transfer Engine parseHostNameWithPort. server_name: 192.168.3.226 port: 12001
I20250919 09:42:46.488061 80264 transfer_engine.cpp:114] Transfer Engine RPC using P2P handshake, listening on 192.168.3.226:15225
I20250919 09:42:46.488163 80264 transfer_engine.cpp:138] Auto-discovering topology...
W20250919 09:42:46.488391 80264 topology.cpp:58] No RDMA devices found, check your device installation
I20250919 09:42:46.488485 80264 transfer_engine.cpp:153] Topology discovery complete. Found 0 HCAs.
W20250919 09:42:47.287942 80264 nvlink_transport.cpp:375] Memory region 0x4989e940 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288010 80264 nvlink_transport.cpp:375] Memory region 0x49947ac0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288023 80264 nvlink_transport.cpp:375] Memory region 0x49b24640 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288034 80264 nvlink_transport.cpp:375] Memory region 0x492de900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288043 80264 nvlink_transport.cpp:375] Memory region 0x49ce5900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288049 80264 nvlink_transport.cpp:375] Memory region 0xe42bb3ff0040 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303099 80263 nvlink_transport.cpp:375] Memory region 0x47a8c3c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303155 80263 nvlink_transport.cpp:375] Memory region 0x4304d680 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303170 80263 nvlink_transport.cpp:375] Memory region 0x4781fb80 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303179 80263 nvlink_transport.cpp:375] Memory region 0x432031c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303186 80263 nvlink_transport.cpp:375] Memory region 0x491a3880 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303194 80263 nvlink_transport.cpp:375] Memory region 0xfa174bff0040 is not allocated by cuMemCreate, but it can be used as local buffer
[2025-09-19 09:42:47] INFO: Started server process [79873]
[2025-09-19 09:42:47] INFO: Waiting for application startup.
[2025-09-19 09:42:47] INFO: Application startup complete.
[2025-09-19 09:42:47] INFO: Uvicorn running on http://192.168.3.226:30000 (Press CTRL+C to quit)
[2025-09-19 09:42:48] INFO: 192.168.3.226:42042 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-19 09:42:48] Start of pd disaggregation warmup ...
[2025-09-19 09:42:48 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 0, token usage: 0.00, #unbootstrapped-req: 0, #queue-req: 0, #transferring-req: 0, input throughput (token/s): 0.00,
[2025-09-19 09:42:48 TP1] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/rootprimary_synced/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 143, in forward_thread_func
self.forward_thread_func_()
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/rootprimary_synced/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 172, in forward_thread_func_
resolve_future_token_ids(input_ids, self.future_token_ids_map)
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1495, in __call__
return self._torchdynamo_orig_callable(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1272, in __call__
result = self._inner_convert(
^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 629, in __call__
return _compile(
^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1111, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_utils_internal.py", line 97, in wrapper_function
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 793, in compile_inner
return _compile_inner(code, one_graph, hooks, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 832, in _compile_inner
out_code = transform_code_object(code, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/bytecode_transformation.py", line 1424, in transform_code_object
transformations(instructions, code_options)
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 267, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 753, in transform
tracer.run()
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 3497, in run
super().run()
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 1363, in run
while self.step():
^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 1267, in step
self.dispatch_table[inst.opcode](self, inst)
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 834, in wrapper
return inner_fn(self, inst)
^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2910, in CALL
self._call(inst)
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2904, in _call
self.call_function(fn, args, kwargs)
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 1193, in call_function
self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/lazy.py", line 201, in realize_and_forward
return getattr(self.realize(), name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/torch.py", line 1338, in call_function
tensor_variable = wrap_fx_proxy(
^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/builder.py", line 2559, in wrap_fx_proxy
return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/builder.py", line 2625, in wrap_fx_proxy_cls
return _wrap_fx_proxy(
^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/builder.py", line 2723, in _wrap_fx_proxy
example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3355, in get_fake_value
raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3253, in get_fake_value
ret_val = wrap_fake_exception(
^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 2753, in wrap_fake_exception
return fn()
^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3254, in <lambda>
lambda: run_node(tx.output, node, args, kwargs, nnmodule)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3462, in run_node
raise RuntimeError(make_error_message(e)).with_traceback(
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3421, in run_node
return node.target(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/utils/_stats.py", line 28, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1352, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2058, in dispatch
return self._cached_dispatch_impl(func, types, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1487, in _cached_dispatch_impl
output = self._dispatch_impl(func, types, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2601, in _dispatch_impl
decomposition_table[func](*args, **kwargs)
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_prims_common/wrappers.py", line 309, in _fn
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_compile.py", line 53, in inner
return disable_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_prims_common/wrappers.py", line 149, in _fn
result = fn(**bound.arguments)
^^^^^^^^^^^^^^^^^^^^^
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_refs/__init__.py", line 1966, in where
utils.check_same_device(pred, a, b, allow_cpu_scalar_tensors=True)
File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_prims_common/__init__.py", line 838, in check_same_device
raise RuntimeError(msg)
torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in method where of type object at 0xe4353fbb8900>(*(FakeTensor(..., device='cuda:3', size=(s72,), dtype=torch.bool), FakeTensor(..., device='cuda:1', size=(s72,), dtype=torch.int64), FakeTensor(..., device='cuda:3', size=(s72,), dtype=torch.int64)), **{}): got RuntimeError('Tensor on device cuda:1 is not on the expected device cuda:3!')
from user code:
File "/rootprimary_synced/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 52, in resolve_future_token_ids
input_ids[:] = torch.where(
[2025-09-19 09:42:48] Received sigquit from a child process. It usually means the child failed.
KilledI will revert it first. And we can revisit and fix it later. |
|
Hi @ShangmingCai After you revert the code, did the error disappear? It appears there are some issues with GPUs that support the CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED attribute. My test machine(H20) does not support this attribute, so the tests here were not covered. My apologies. I reviewed some documentation and found no indication that CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED conflicts with enabling P2PAccess. However, based on this discussion, even if we obtain the CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED attribute, calling cuMemRetainAllocationHandle will still result in an error return CUDA_ERROR_NOT_PERMITTED. Perhaps we could print the specific error code when the issue occurs to further pinpoint the problem. |
@staryxchen Yes, it works for GB200 after revert. I think the problem is here: torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in method where of type object at 0xe4353fbb8900>(*(FakeTensor(..., device='cuda:3', size=(s72,), dtype=torch.bool), FakeTensor(..., device='cuda:1', size=(s72,), dtype=torch.int64), FakeTensor(..., device='cuda:3', size=(s72,), dtype=torch.int64)), **{}): got RuntimeError('Tensor on device cuda:1 is not on the expected device cuda:3!')
|
… for NvlinkTransport (kvcache-ai#683) * refactor(nvlink_transport): cache device count to avoid repeated queries Add static getNumDevices() function to cache CUDA device count and reuse it in supportFabricMem() to avoid multiple cudaGetDeviceCount calls. Signed-off-by: staryxchen <staryxchen@tencent.com> * bugfix(nvlink): add explicit P2P access enablement and error handling - Added checkCudaError helper function for consistent error handling - Implemented enableP2PAccess function to explicitly enable bidirectional P2P access - Modified NvlinkTransport constructor to enable P2P access between all device pairs Signed-off-by: staryxchen <staryxchen@tencent.com> * Added checkCudaErrorReturn function and updated enableP2PAccess to properly handle CUDA errors by returning false on failure. Signed-off-by: staryxchen <staryxchen@tencent.com> * update something according to comment - Remove unused checkCudaError function - Update getNumDevices to use checkCudaErrorReturn - Fix typo in error message ("not device" to "no device") - Improve peer access error handling with detailed logging Signed-off-by: staryxchen <staryxchen@tencent.com> --------- Signed-off-by: staryxchen <staryxchen@tencent.com>
* Fix nvlink_transport bug: revert kvcache-ai#683 * fix lint
… for NvlinkTransport (kvcache-ai#683) * refactor(nvlink_transport): cache device count to avoid repeated queries Add static getNumDevices() function to cache CUDA device count and reuse it in supportFabricMem() to avoid multiple cudaGetDeviceCount calls. Signed-off-by: staryxchen <staryxchen@tencent.com> * bugfix(nvlink): add explicit P2P access enablement and error handling - Added checkCudaError helper function for consistent error handling - Implemented enableP2PAccess function to explicitly enable bidirectional P2P access - Modified NvlinkTransport constructor to enable P2P access between all device pairs Signed-off-by: staryxchen <staryxchen@tencent.com> * Added checkCudaErrorReturn function and updated enableP2PAccess to properly handle CUDA errors by returning false on failure. Signed-off-by: staryxchen <staryxchen@tencent.com> * update something according to comment - Remove unused checkCudaError function - Update getNumDevices to use checkCudaErrorReturn - Fix typo in error message ("not device" to "no device") - Improve peer access error handling with detailed logging Signed-off-by: staryxchen <staryxchen@tencent.com> --------- Signed-off-by: staryxchen <staryxchen@tencent.com>
* Fix nvlink_transport bug: revert kvcache-ai#683 * fix lint
Summary
This PR adds explicit peer-to-peer (P2P) access enablement and improved error handling to the NvlinkTransport class. The changes ensure that bidirectional P2P access is properly established between all GPU device pairs during transport initialization.
Changes
Added
checkCudaError()helper function for consistent CUDA error handlingenableP2PAccess()function to explicitly enable bidirectional P2P access between device pairsModified
getNumDevices()now uses the newcheckCudaError()helper for consistent error handlingNvlinkTransportconstructor now iterates through all device pairs and enables P2P accessTechnical Details
The implementation follows NVIDIA's recommended practices for P2P access:
cudaDeviceCanAccessPeer()to check if devices support P2P accessReferences
Impact
This fix resolves potential issues where P2P access might not be properly established, leading to runtime errors during GPU-to-GPU transfers. The explicit enablement ensures reliable P2P communication in multi-GPU environments.