bugfix(nvlink): Add explicit P2P access enablement and error handling for NvlinkTransport by staryxchen · Pull Request #683 · kvcache-ai/Mooncake

staryxchen · 2025-07-28T08:39:39Z

Summary

This PR adds explicit peer-to-peer (P2P) access enablement and improved error handling to the NvlinkTransport class. The changes ensure that bidirectional P2P access is properly established between all GPU device pairs during transport initialization.

Changes

Added

checkCudaError() helper function for consistent CUDA error handling
enableP2PAccess() function to explicitly enable bidirectional P2P access between device pairs
Enhanced constructor logic to enable P2P access between all available GPU devices
Improved error logging with detailed CUDA error codes and messages

Modified

getNumDevices() now uses the new checkCudaError() helper for consistent error handling
NvlinkTransport constructor now iterates through all device pairs and enables P2P access
Added trace logging for P2P access enablement status

Technical Details

The implementation follows NVIDIA's recommended practices for P2P access:

Peer Access Query: Uses cudaDeviceCanAccessPeer() to check if devices support P2P access
Bidirectional Enablement: Enables P2P access in both directions (src→dst and dst→src)
Error Handling: Comprehensive error checking with detailed error messages

References

Impact

This fix resolves potential issues where P2P access might not be properly established, leading to runtime errors during GPU-to-GPU transfers. The explicit enablement ensures reliable P2P communication in multi-GPU environments.

Add static getNumDevices() function to cache CUDA device count and reuse it in supportFabricMem() to avoid multiple cudaGetDeviceCount calls. Signed-off-by: staryxchen <staryxchen@tencent.com>

- Added checkCudaError helper function for consistent error handling - Implemented enableP2PAccess function to explicitly enable bidirectional P2P access - Modified NvlinkTransport constructor to enable P2P access between all device pairs Signed-off-by: staryxchen <staryxchen@tencent.com>

…operly handle CUDA errors by returning false on failure. Signed-off-by: staryxchen <staryxchen@tencent.com>

Copilot

Pull Request Overview

This PR adds explicit peer-to-peer (P2P) access enablement and improved error handling to the NvlinkTransport class to ensure reliable GPU-to-GPU transfers in multi-GPU environments.

Added helper functions for consistent CUDA error handling
Implemented explicit bidirectional P2P access enablement between all GPU device pairs
Enhanced error logging with detailed CUDA error codes and messages

- Remove unused checkCudaError function - Update getNumDevices to use checkCudaErrorReturn - Fix typo in error message ("not device" to "no device") - Improve peer access error handling with detailed logging Signed-off-by: staryxchen <staryxchen@tencent.com>

alogfans

LGTM

ShangmingCai · 2025-09-19T09:49:50Z

Seems like this PR breaks the MNNVL usage:

[2025-09-19 09:42:44 TP0] Using KV cache dtype: torch.bfloat16
[2025-09-19 09:42:45 TP1] KV Cache is allocated. #tokens: 1745576, K size: 59.93 GB, V size: 59.93 GB
[2025-09-19 09:42:45 TP0] KV Cache is allocated. #tokens: 1745576, K size: 59.93 GB, V size: 59.93 GB
[2025-09-19 09:42:45 TP0] Memory pool end. avail mem=53.92 GB
[2025-09-19 09:42:45 TP0] max_total_num_tokens=1745576, chunked_prefill_size=393216, max_prefill_tokens=16384, max_running_requests=4096, context_len=40960, available_gpu_mem=53.25 GB
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20250919 09:42:46.478288 80263 transfer_engine.cpp:422] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I20250919 09:42:46.478338 80263 transfer_engine.cpp:44] Transfer Engine starting. Server: 192.168.3.226, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I20250919 09:42:46.478355 80263 transfer_engine.cpp:63] Transfer Engine parseHostNameWithPort. server_name: 192.168.3.226 port: 12001
I20250919 09:42:46.478390 80263 transfer_engine.cpp:114] Transfer Engine RPC using P2P handshake, listening on 192.168.3.226:16145
I20250919 09:42:46.478504 80263 transfer_engine.cpp:138] Auto-discovering topology...
W20250919 09:42:46.478768 80263 topology.cpp:58] No RDMA devices found, check your device installation
I20250919 09:42:46.478857 80263 transfer_engine.cpp:153] Topology discovery complete. Found 0 HCAs.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20250919 09:42:46.487957 80264 transfer_engine.cpp:422] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I20250919 09:42:46.488008 80264 transfer_engine.cpp:44] Transfer Engine starting. Server: 192.168.3.226, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I20250919 09:42:46.488024 80264 transfer_engine.cpp:63] Transfer Engine parseHostNameWithPort. server_name: 192.168.3.226 port: 12001
I20250919 09:42:46.488061 80264 transfer_engine.cpp:114] Transfer Engine RPC using P2P handshake, listening on 192.168.3.226:15225
I20250919 09:42:46.488163 80264 transfer_engine.cpp:138] Auto-discovering topology...
W20250919 09:42:46.488391 80264 topology.cpp:58] No RDMA devices found, check your device installation
I20250919 09:42:46.488485 80264 transfer_engine.cpp:153] Topology discovery complete. Found 0 HCAs.
W20250919 09:42:47.287942 80264 nvlink_transport.cpp:375] Memory region 0x4989e940 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288010 80264 nvlink_transport.cpp:375] Memory region 0x49947ac0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288023 80264 nvlink_transport.cpp:375] Memory region 0x49b24640 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288034 80264 nvlink_transport.cpp:375] Memory region 0x492de900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288043 80264 nvlink_transport.cpp:375] Memory region 0x49ce5900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288049 80264 nvlink_transport.cpp:375] Memory region 0xe42bb3ff0040 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303099 80263 nvlink_transport.cpp:375] Memory region 0x47a8c3c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303155 80263 nvlink_transport.cpp:375] Memory region 0x4304d680 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303170 80263 nvlink_transport.cpp:375] Memory region 0x4781fb80 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303179 80263 nvlink_transport.cpp:375] Memory region 0x432031c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303186 80263 nvlink_transport.cpp:375] Memory region 0x491a3880 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303194 80263 nvlink_transport.cpp:375] Memory region 0xfa174bff0040 is not allocated by cuMemCreate, but it can be used as local buffer
[2025-09-19 09:42:47] INFO:     Started server process [79873]
[2025-09-19 09:42:47] INFO:     Waiting for application startup.
[2025-09-19 09:42:47] INFO:     Application startup complete.
[2025-09-19 09:42:47] INFO:     Uvicorn running on http://192.168.3.226:30000 (Press CTRL+C to quit)
[2025-09-19 09:42:48] INFO:     192.168.3.226:42042 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-19 09:42:48] Start of pd disaggregation warmup ...
[2025-09-19 09:42:48 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 0, token usage: 0.00, #unbootstrapped-req: 0, #queue-req: 0, #transferring-req: 0, input throughput (token/s): 0.00,
[2025-09-19 09:42:48 TP1] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/rootprimary_synced/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 143, in forward_thread_func
    self.forward_thread_func_()
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/rootprimary_synced/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 172, in forward_thread_func_
    resolve_future_token_ids(input_ids, self.future_token_ids_map)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1495, in __call__
    return self._torchdynamo_orig_callable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1272, in __call__
    result = self._inner_convert(
             ^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 629, in __call__
    return _compile(
           ^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1111, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_utils_internal.py", line 97, in wrapper_function
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 793, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 832, in _compile_inner
    out_code = transform_code_object(code, transform)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/bytecode_transformation.py", line 1424, in transform_code_object
    transformations(instructions, code_options)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 267, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 753, in transform
    tracer.run()
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 3497, in run
    super().run()
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 1363, in run
    while self.step():
          ^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 1267, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 834, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2910, in CALL
    self._call(inst)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2904, in _call
    self.call_function(fn, args, kwargs)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 1193, in call_function
    self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/lazy.py", line 201, in realize_and_forward
    return getattr(self.realize(), name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/torch.py", line 1338, in call_function
    tensor_variable = wrap_fx_proxy(
                      ^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/builder.py", line 2559, in wrap_fx_proxy
    return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/builder.py", line 2625, in wrap_fx_proxy_cls
    return _wrap_fx_proxy(
           ^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/builder.py", line 2723, in _wrap_fx_proxy
    example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3355, in get_fake_value
    raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3253, in get_fake_value
    ret_val = wrap_fake_exception(
              ^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 2753, in wrap_fake_exception
    return fn()
           ^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3254, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3462, in run_node
    raise RuntimeError(make_error_message(e)).with_traceback(
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3421, in run_node
    return node.target(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/utils/_stats.py", line 28, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1352, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2058, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1487, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2601, in _dispatch_impl
    decomposition_table[func](*args, **kwargs)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_prims_common/wrappers.py", line 309, in _fn
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_compile.py", line 53, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_prims_common/wrappers.py", line 149, in _fn
    result = fn(**bound.arguments)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_refs/__init__.py", line 1966, in where
    utils.check_same_device(pred, a, b, allow_cpu_scalar_tensors=True)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_prims_common/__init__.py", line 838, in check_same_device
    raise RuntimeError(msg)
torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in method where of type object at 0xe4353fbb8900>(*(FakeTensor(..., device='cuda:3', size=(s72,), dtype=torch.bool), FakeTensor(..., device='cuda:1', size=(s72,), dtype=torch.int64), FakeTensor(..., device='cuda:3', size=(s72,), dtype=torch.int64)), **{}): got RuntimeError('Tensor on device cuda:1 is not on the expected device cuda:3!')

from user code:
   File "/rootprimary_synced/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 52, in resolve_future_token_ids
    input_ids[:] = torch.where(


[2025-09-19 09:42:48] Received sigquit from a child process. It usually means the child failed.
Killed

I will revert it first. And we can revisit and fix it later.

* Fix nvlink_transport bug: revert #683 * fix lint

staryxchen · 2025-09-19T14:24:27Z

Hi @ShangmingCai
Are these the critical error logs?

W20250919 09:42:47.287942 80264 nvlink_transport.cpp:375] Memory region 0x4989e940 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288010 80264 nvlink_transport.cpp:375] Memory region 0x49947ac0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288023 80264 nvlink_transport.cpp:375] Memory region 0x49b24640 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288034 80264 nvlink_transport.cpp:375] Memory region 0x492de900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288043 80264 nvlink_transport.cpp:375] Memory region 0x49ce5900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288049 80264 nvlink_transport.cpp:375] Memory region 0xe42bb3ff0040 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303099 80263 nvlink_transport.cpp:375] Memory region 0x47a8c3c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303155 80263 nvlink_transport.cpp:375] Memory region 0x4304d680 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303170 80263 nvlink_transport.cpp:375] Memory region 0x4781fb80 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303179 80263 nvlink_transport.cpp:375] Memory region 0x432031c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303186 80263 nvlink_transport.cpp:375] Memory region 0x491a3880 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303194 80263 nvlink_transport.cpp:375] Memory region 0xfa174bff0040 is not allocated by cuMemCreate, but it can be used as local buffer

After you revert the code, did the error disappear? It appears there are some issues with GPUs that support the CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED attribute. My test machine(H20) does not support this attribute, so the tests here were not covered. My apologies.

I reviewed some documentation and found no indication that CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED conflicts with enabling P2PAccess. However, based on this discussion, even if we obtain the CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED attribute, calling cuMemRetainAllocationHandle will still result in an error return CUDA_ERROR_NOT_PERMITTED. Perhaps we could print the specific error code when the issue occurs to further pinpoint the problem.

ShangmingCai · 2025-09-19T17:22:57Z

Hi @ShangmingCai Are these the critical error logs?
W20250919 09:42:47.287942 80264 nvlink_transport.cpp:375] Memory region 0x4989e940 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288010 80264 nvlink_transport.cpp:375] Memory region 0x49947ac0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288023 80264 nvlink_transport.cpp:375] Memory region 0x49b24640 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288034 80264 nvlink_transport.cpp:375] Memory region 0x492de900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288043 80264 nvlink_transport.cpp:375] Memory region 0x49ce5900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288049 80264 nvlink_transport.cpp:375] Memory region 0xe42bb3ff0040 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303099 80263 nvlink_transport.cpp:375] Memory region 0x47a8c3c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303155 80263 nvlink_transport.cpp:375] Memory region 0x4304d680 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303170 80263 nvlink_transport.cpp:375] Memory region 0x4781fb80 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303179 80263 nvlink_transport.cpp:375] Memory region 0x432031c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303186 80263 nvlink_transport.cpp:375] Memory region 0x491a3880 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303194 80263 nvlink_transport.cpp:375] Memory region 0xfa174bff0040 is not allocated by cuMemCreate, but it can be used as local buffer
After you revert the code, did the error disappear? It appears there are some issues with GPUs that support the CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED attribute. My test machine(H20) does not support this attribute, so the tests here were not covered. My apologies.

I reviewed some documentation and found no indication that CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED conflicts with enabling P2PAccess. However, based on this discussion, even if we obtain the CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED attribute, calling cuMemRetainAllocationHandle will still result in an error return CUDA_ERROR_NOT_PERMITTED. Perhaps we could print the specific error code when the issue occurs to further pinpoint the problem.

@staryxchen Yes, it works for GB200 after revert.

I think the problem is here:

torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in method where of type object at 0xe4353fbb8900>(*(FakeTensor(..., device='cuda:3', size=(s72,), dtype=torch.bool), FakeTensor(..., device='cuda:1', size=(s72,), dtype=torch.int64), FakeTensor(..., device='cuda:3', size=(s72,), dtype=torch.int64)), **{}): got RuntimeError('Tensor on device cuda:1 is not on the expected device cuda:3!')

… for NvlinkTransport (kvcache-ai#683) * refactor(nvlink_transport): cache device count to avoid repeated queries Add static getNumDevices() function to cache CUDA device count and reuse it in supportFabricMem() to avoid multiple cudaGetDeviceCount calls. Signed-off-by: staryxchen <staryxchen@tencent.com> * bugfix(nvlink): add explicit P2P access enablement and error handling - Added checkCudaError helper function for consistent error handling - Implemented enableP2PAccess function to explicitly enable bidirectional P2P access - Modified NvlinkTransport constructor to enable P2P access between all device pairs Signed-off-by: staryxchen <staryxchen@tencent.com> * Added checkCudaErrorReturn function and updated enableP2PAccess to properly handle CUDA errors by returning false on failure. Signed-off-by: staryxchen <staryxchen@tencent.com> * update something according to comment - Remove unused checkCudaError function - Update getNumDevices to use checkCudaErrorReturn - Fix typo in error message ("not device" to "no device") - Improve peer access error handling with detailed logging Signed-off-by: staryxchen <staryxchen@tencent.com> --------- Signed-off-by: staryxchen <staryxchen@tencent.com>

* Fix nvlink_transport bug: revert kvcache-ai#683 * fix lint

… for NvlinkTransport (kvcache-ai#683) * refactor(nvlink_transport): cache device count to avoid repeated queries Add static getNumDevices() function to cache CUDA device count and reuse it in supportFabricMem() to avoid multiple cudaGetDeviceCount calls. Signed-off-by: staryxchen <staryxchen@tencent.com> * bugfix(nvlink): add explicit P2P access enablement and error handling - Added checkCudaError helper function for consistent error handling - Implemented enableP2PAccess function to explicitly enable bidirectional P2P access - Modified NvlinkTransport constructor to enable P2P access between all device pairs Signed-off-by: staryxchen <staryxchen@tencent.com> * Added checkCudaErrorReturn function and updated enableP2PAccess to properly handle CUDA errors by returning false on failure. Signed-off-by: staryxchen <staryxchen@tencent.com> * update something according to comment - Remove unused checkCudaError function - Update getNumDevices to use checkCudaErrorReturn - Fix typo in error message ("not device" to "no device") - Improve peer access error handling with detailed logging Signed-off-by: staryxchen <staryxchen@tencent.com> --------- Signed-off-by: staryxchen <staryxchen@tencent.com>

* Fix nvlink_transport bug: revert kvcache-ai#683 * fix lint

staryxchen added 3 commits July 28, 2025 08:30

refactor(nvlink_transport): cache device count to avoid repeated queries

2680738

Add static getNumDevices() function to cache CUDA device count and reuse it in supportFabricMem() to avoid multiple cudaGetDeviceCount calls. Signed-off-by: staryxchen <staryxchen@tencent.com>

Added checkCudaErrorReturn function and updated enableP2PAccess to pr…

2b065b3

…operly handle CUDA errors by returning false on failure. Signed-off-by: staryxchen <staryxchen@tencent.com>

alogfans reviewed Jul 29, 2025

View reviewed changes

Comment thread mooncake-transfer-engine/src/transport/nvlink_transport/nvlink_transport.cpp

stmatengss requested a review from Copilot July 30, 2025 03:27

Copilot AI reviewed Jul 30, 2025

View reviewed changes

staryxchen force-pushed the bugfix/nvlink branch from 551bae1 to 2cf9dfd Compare July 30, 2025 11:21

alogfans approved these changes Jul 31, 2025

View reviewed changes

alogfans approved these changes Aug 1, 2025

View reviewed changes

alogfans merged commit bdba25b into kvcache-ai:main Aug 1, 2025
10 checks passed

staryxchen deleted the bugfix/nvlink branch August 1, 2025 08:01

ShangmingCai added a commit that referenced this pull request Sep 19, 2025

Fix nvlink_transport bug: revert #683

3a129e8

alogfans pushed a commit that referenced this pull request Sep 19, 2025

Fix nvlink_transport bug: revert #683 (#869)

c629aa9

* Fix nvlink_transport bug: revert #683 * fix lint

15050188022 mentioned this pull request Oct 28, 2025

[Bug]: p2p nvlink usage #965

Open

1 task

wanyue-wy pushed a commit to wanyue-wy/Mooncake that referenced this pull request Dec 14, 2025

Fix nvlink_transport bug: revert kvcache-ai#683 (kvcache-ai#869)

a29ee60

* Fix nvlink_transport bug: revert kvcache-ai#683 * fix lint

JasonZhang517 pushed a commit to JasonZhang517/Mooncake that referenced this pull request Feb 9, 2026

Fix nvlink_transport bug: revert kvcache-ai#683 (kvcache-ai#869)

3c3856d

* Fix nvlink_transport bug: revert kvcache-ai#683 * fix lint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix(nvlink): Add explicit P2P access enablement and error handling for NvlinkTransport#683

bugfix(nvlink): Add explicit P2P access enablement and error handling for NvlinkTransport#683
alogfans merged 4 commits intokvcache-ai:mainfrom
staryxchen:bugfix/nvlink

staryxchen commented Jul 28, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alogfans left a comment

Uh oh!

Uh oh!

ShangmingCai commented Sep 19, 2025

Uh oh!

staryxchen commented Sep 19, 2025

Uh oh!

ShangmingCai commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

staryxchen commented Jul 28, 2025

Summary

Changes

Added

Modified

Technical Details

References

Impact

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alogfans left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShangmingCai commented Sep 19, 2025

Uh oh!

staryxchen commented Sep 19, 2025

Uh oh!

ShangmingCai commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants