Skip to content

bugfix(nvlink): Add explicit P2P access enablement and error handling for NvlinkTransport#683

Merged
alogfans merged 4 commits intokvcache-ai:mainfrom
staryxchen:bugfix/nvlink
Aug 1, 2025
Merged

bugfix(nvlink): Add explicit P2P access enablement and error handling for NvlinkTransport#683
alogfans merged 4 commits intokvcache-ai:mainfrom
staryxchen:bugfix/nvlink

Conversation

@staryxchen
Copy link
Copy Markdown
Collaborator

Summary

This PR adds explicit peer-to-peer (P2P) access enablement and improved error handling to the NvlinkTransport class. The changes ensure that bidirectional P2P access is properly established between all GPU device pairs during transport initialization.

Changes

Added

  • checkCudaError() helper function for consistent CUDA error handling
  • enableP2PAccess() function to explicitly enable bidirectional P2P access between device pairs
  • Enhanced constructor logic to enable P2P access between all available GPU devices
  • Improved error logging with detailed CUDA error codes and messages

Modified

  • getNumDevices() now uses the new checkCudaError() helper for consistent error handling
  • NvlinkTransport constructor now iterates through all device pairs and enables P2P access
  • Added trace logging for P2P access enablement status

Technical Details

The implementation follows NVIDIA's recommended practices for P2P access:

  1. Peer Access Query: Uses cudaDeviceCanAccessPeer() to check if devices support P2P access
  2. Bidirectional Enablement: Enables P2P access in both directions (src→dst and dst→src)
  3. Error Handling: Comprehensive error checking with detailed error messages

References

Impact

This fix resolves potential issues where P2P access might not be properly established, leading to runtime errors during GPU-to-GPU transfers. The explicit enablement ensures reliable P2P communication in multi-GPU environments.

Add static getNumDevices() function to cache CUDA device count and reuse it in supportFabricMem() to avoid multiple cudaGetDeviceCount calls.

Signed-off-by: staryxchen <staryxchen@tencent.com>
- Added checkCudaError helper function for consistent error handling
- Implemented enableP2PAccess function to explicitly enable bidirectional P2P access
- Modified NvlinkTransport constructor to enable P2P access between all device pairs

Signed-off-by: staryxchen <staryxchen@tencent.com>
…operly handle CUDA errors by returning false on failure.

Signed-off-by: staryxchen <staryxchen@tencent.com>
@stmatengss stmatengss requested a review from Copilot July 30, 2025 03:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds explicit peer-to-peer (P2P) access enablement and improved error handling to the NvlinkTransport class to ensure reliable GPU-to-GPU transfers in multi-GPU environments.

  • Added helper functions for consistent CUDA error handling
  • Implemented explicit bidirectional P2P access enablement between all GPU device pairs
  • Enhanced error logging with detailed CUDA error codes and messages

Comment thread mooncake-transfer-engine/src/transport/nvlink_transport/nvlink_transport.cpp Outdated
Comment thread mooncake-transfer-engine/src/transport/nvlink_transport/nvlink_transport.cpp Outdated
Comment thread mooncake-transfer-engine/src/transport/nvlink_transport/nvlink_transport.cpp Outdated
Comment thread mooncake-transfer-engine/src/transport/nvlink_transport/nvlink_transport.cpp Outdated
- Remove unused checkCudaError function
- Update getNumDevices to use checkCudaErrorReturn
- Fix typo in error message ("not device" to "no device")
- Improve peer access error handling with detailed logging

Signed-off-by: staryxchen <staryxchen@tencent.com>
Copy link
Copy Markdown
Collaborator

@alogfans alogfans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alogfans alogfans merged commit bdba25b into kvcache-ai:main Aug 1, 2025
10 checks passed
@staryxchen staryxchen deleted the bugfix/nvlink branch August 1, 2025 08:01
@ShangmingCai
Copy link
Copy Markdown
Collaborator

Seems like this PR breaks the MNNVL usage:

[2025-09-19 09:42:44 TP0] Using KV cache dtype: torch.bfloat16
[2025-09-19 09:42:45 TP1] KV Cache is allocated. #tokens: 1745576, K size: 59.93 GB, V size: 59.93 GB
[2025-09-19 09:42:45 TP0] KV Cache is allocated. #tokens: 1745576, K size: 59.93 GB, V size: 59.93 GB
[2025-09-19 09:42:45 TP0] Memory pool end. avail mem=53.92 GB
[2025-09-19 09:42:45 TP0] max_total_num_tokens=1745576, chunked_prefill_size=393216, max_prefill_tokens=16384, max_running_requests=4096, context_len=40960, available_gpu_mem=53.25 GB
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20250919 09:42:46.478288 80263 transfer_engine.cpp:422] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I20250919 09:42:46.478338 80263 transfer_engine.cpp:44] Transfer Engine starting. Server: 192.168.3.226, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I20250919 09:42:46.478355 80263 transfer_engine.cpp:63] Transfer Engine parseHostNameWithPort. server_name: 192.168.3.226 port: 12001
I20250919 09:42:46.478390 80263 transfer_engine.cpp:114] Transfer Engine RPC using P2P handshake, listening on 192.168.3.226:16145
I20250919 09:42:46.478504 80263 transfer_engine.cpp:138] Auto-discovering topology...
W20250919 09:42:46.478768 80263 topology.cpp:58] No RDMA devices found, check your device installation
I20250919 09:42:46.478857 80263 transfer_engine.cpp:153] Topology discovery complete. Found 0 HCAs.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20250919 09:42:46.487957 80264 transfer_engine.cpp:422] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I20250919 09:42:46.488008 80264 transfer_engine.cpp:44] Transfer Engine starting. Server: 192.168.3.226, Metadata: P2PHANDSHAKE, ip_or_host_name: , rpc_port: 0
I20250919 09:42:46.488024 80264 transfer_engine.cpp:63] Transfer Engine parseHostNameWithPort. server_name: 192.168.3.226 port: 12001
I20250919 09:42:46.488061 80264 transfer_engine.cpp:114] Transfer Engine RPC using P2P handshake, listening on 192.168.3.226:15225
I20250919 09:42:46.488163 80264 transfer_engine.cpp:138] Auto-discovering topology...
W20250919 09:42:46.488391 80264 topology.cpp:58] No RDMA devices found, check your device installation
I20250919 09:42:46.488485 80264 transfer_engine.cpp:153] Topology discovery complete. Found 0 HCAs.
W20250919 09:42:47.287942 80264 nvlink_transport.cpp:375] Memory region 0x4989e940 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288010 80264 nvlink_transport.cpp:375] Memory region 0x49947ac0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288023 80264 nvlink_transport.cpp:375] Memory region 0x49b24640 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288034 80264 nvlink_transport.cpp:375] Memory region 0x492de900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288043 80264 nvlink_transport.cpp:375] Memory region 0x49ce5900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288049 80264 nvlink_transport.cpp:375] Memory region 0xe42bb3ff0040 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303099 80263 nvlink_transport.cpp:375] Memory region 0x47a8c3c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303155 80263 nvlink_transport.cpp:375] Memory region 0x4304d680 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303170 80263 nvlink_transport.cpp:375] Memory region 0x4781fb80 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303179 80263 nvlink_transport.cpp:375] Memory region 0x432031c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303186 80263 nvlink_transport.cpp:375] Memory region 0x491a3880 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303194 80263 nvlink_transport.cpp:375] Memory region 0xfa174bff0040 is not allocated by cuMemCreate, but it can be used as local buffer
[2025-09-19 09:42:47] INFO:     Started server process [79873]
[2025-09-19 09:42:47] INFO:     Waiting for application startup.
[2025-09-19 09:42:47] INFO:     Application startup complete.
[2025-09-19 09:42:47] INFO:     Uvicorn running on http://192.168.3.226:30000 (Press CTRL+C to quit)
[2025-09-19 09:42:48] INFO:     192.168.3.226:42042 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-19 09:42:48] Start of pd disaggregation warmup ...
[2025-09-19 09:42:48 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 0, token usage: 0.00, #unbootstrapped-req: 0, #queue-req: 0, #transferring-req: 0, input throughput (token/s): 0.00,
[2025-09-19 09:42:48 TP1] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/rootprimary_synced/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 143, in forward_thread_func
    self.forward_thread_func_()
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/rootprimary_synced/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 172, in forward_thread_func_
    resolve_future_token_ids(input_ids, self.future_token_ids_map)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1495, in __call__
    return self._torchdynamo_orig_callable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1272, in __call__
    result = self._inner_convert(
             ^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 629, in __call__
    return _compile(
           ^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 1111, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_utils_internal.py", line 97, in wrapper_function
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 793, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 832, in _compile_inner
    out_code = transform_code_object(code, transform)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/bytecode_transformation.py", line 1424, in transform_code_object
    transformations(instructions, code_options)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 267, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 753, in transform
    tracer.run()
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 3497, in run
    super().run()
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 1363, in run
    while self.step():
          ^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 1267, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 834, in wrapper
    return inner_fn(self, inst)
           ^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2910, in CALL
    self._call(inst)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 2904, in _call
    self.call_function(fn, args, kwargs)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert.py", line 1193, in call_function
    self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/lazy.py", line 201, in realize_and_forward
    return getattr(self.realize(), name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/torch.py", line 1338, in call_function
    tensor_variable = wrap_fx_proxy(
                      ^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/builder.py", line 2559, in wrap_fx_proxy
    return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/builder.py", line 2625, in wrap_fx_proxy_cls
    return _wrap_fx_proxy(
           ^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/variables/builder.py", line 2723, in _wrap_fx_proxy
    example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3355, in get_fake_value
    raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3253, in get_fake_value
    ret_val = wrap_fake_exception(
              ^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 2753, in wrap_fake_exception
    return fn()
           ^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3254, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3462, in run_node
    raise RuntimeError(make_error_message(e)).with_traceback(
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/utils.py", line 3421, in run_node
    return node.target(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/utils/_stats.py", line 28, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1352, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2058, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 1487, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_subclasses/fake_tensor.py", line 2601, in _dispatch_impl
    decomposition_table[func](*args, **kwargs)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_prims_common/wrappers.py", line 309, in _fn
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_compile.py", line 53, in inner
    return disable_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_prims_common/wrappers.py", line 149, in _fn
    result = fn(**bound.arguments)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_refs/__init__.py", line 1966, in where
    utils.check_same_device(pred, a, b, allow_cpu_scalar_tensors=True)
  File "/rootvenvs/sgl/lib/python3.12/site-packages/torch/_prims_common/__init__.py", line 838, in check_same_device
    raise RuntimeError(msg)
torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in method where of type object at 0xe4353fbb8900>(*(FakeTensor(..., device='cuda:3', size=(s72,), dtype=torch.bool), FakeTensor(..., device='cuda:1', size=(s72,), dtype=torch.int64), FakeTensor(..., device='cuda:3', size=(s72,), dtype=torch.int64)), **{}): got RuntimeError('Tensor on device cuda:1 is not on the expected device cuda:3!')

from user code:
   File "/rootprimary_synced/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 52, in resolve_future_token_ids
    input_ids[:] = torch.where(


[2025-09-19 09:42:48] Received sigquit from a child process. It usually means the child failed.
Killed

I will revert it first. And we can revisit and fix it later.

ShangmingCai added a commit that referenced this pull request Sep 19, 2025
alogfans pushed a commit that referenced this pull request Sep 19, 2025
* Fix nvlink_transport bug: revert #683

* fix lint
@staryxchen
Copy link
Copy Markdown
Collaborator Author

Hi @ShangmingCai
Are these the critical error logs?

W20250919 09:42:47.287942 80264 nvlink_transport.cpp:375] Memory region 0x4989e940 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288010 80264 nvlink_transport.cpp:375] Memory region 0x49947ac0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288023 80264 nvlink_transport.cpp:375] Memory region 0x49b24640 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288034 80264 nvlink_transport.cpp:375] Memory region 0x492de900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288043 80264 nvlink_transport.cpp:375] Memory region 0x49ce5900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288049 80264 nvlink_transport.cpp:375] Memory region 0xe42bb3ff0040 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303099 80263 nvlink_transport.cpp:375] Memory region 0x47a8c3c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303155 80263 nvlink_transport.cpp:375] Memory region 0x4304d680 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303170 80263 nvlink_transport.cpp:375] Memory region 0x4781fb80 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303179 80263 nvlink_transport.cpp:375] Memory region 0x432031c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303186 80263 nvlink_transport.cpp:375] Memory region 0x491a3880 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303194 80263 nvlink_transport.cpp:375] Memory region 0xfa174bff0040 is not allocated by cuMemCreate, but it can be used as local buffer

After you revert the code, did the error disappear? It appears there are some issues with GPUs that support the CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED attribute. My test machine(H20) does not support this attribute, so the tests here were not covered. My apologies.

I reviewed some documentation and found no indication that CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED conflicts with enabling P2PAccess. However, based on this discussion, even if we obtain the CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED attribute, calling cuMemRetainAllocationHandle will still result in an error return CUDA_ERROR_NOT_PERMITTED. Perhaps we could print the specific error code when the issue occurs to further pinpoint the problem.

@ShangmingCai
Copy link
Copy Markdown
Collaborator

Hi @ShangmingCai Are these the critical error logs?

W20250919 09:42:47.287942 80264 nvlink_transport.cpp:375] Memory region 0x4989e940 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288010 80264 nvlink_transport.cpp:375] Memory region 0x49947ac0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288023 80264 nvlink_transport.cpp:375] Memory region 0x49b24640 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288034 80264 nvlink_transport.cpp:375] Memory region 0x492de900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288043 80264 nvlink_transport.cpp:375] Memory region 0x49ce5900 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.288049 80264 nvlink_transport.cpp:375] Memory region 0xe42bb3ff0040 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303099 80263 nvlink_transport.cpp:375] Memory region 0x47a8c3c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303155 80263 nvlink_transport.cpp:375] Memory region 0x4304d680 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303170 80263 nvlink_transport.cpp:375] Memory region 0x4781fb80 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303179 80263 nvlink_transport.cpp:375] Memory region 0x432031c0 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303186 80263 nvlink_transport.cpp:375] Memory region 0x491a3880 is not allocated by cuMemCreate, but it can be used as local buffer
W20250919 09:42:47.303194 80263 nvlink_transport.cpp:375] Memory region 0xfa174bff0040 is not allocated by cuMemCreate, but it can be used as local buffer

After you revert the code, did the error disappear? It appears there are some issues with GPUs that support the CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED attribute. My test machine(H20) does not support this attribute, so the tests here were not covered. My apologies.

I reviewed some documentation and found no indication that CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED conflicts with enabling P2PAccess. However, based on this discussion, even if we obtain the CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED attribute, calling cuMemRetainAllocationHandle will still result in an error return CUDA_ERROR_NOT_PERMITTED. Perhaps we could print the specific error code when the issue occurs to further pinpoint the problem.

@staryxchen Yes, it works for GB200 after revert.

I think the problem is here:

torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in method where of type object at 0xe4353fbb8900>(*(FakeTensor(..., device='cuda:3', size=(s72,), dtype=torch.bool), FakeTensor(..., device='cuda:1', size=(s72,), dtype=torch.int64), FakeTensor(..., device='cuda:3', size=(s72,), dtype=torch.int64)), **{}): got RuntimeError('Tensor on device cuda:1 is not on the expected device cuda:3!')

@15050188022 15050188022 mentioned this pull request Oct 28, 2025
1 task
wanyue-wy pushed a commit to wanyue-wy/Mooncake that referenced this pull request Dec 14, 2025
… for NvlinkTransport (kvcache-ai#683)

* refactor(nvlink_transport): cache device count to avoid repeated queries

Add static getNumDevices() function to cache CUDA device count and reuse it in supportFabricMem() to avoid multiple cudaGetDeviceCount calls.

Signed-off-by: staryxchen <staryxchen@tencent.com>

* bugfix(nvlink): add explicit P2P access enablement and error handling

- Added checkCudaError helper function for consistent error handling
- Implemented enableP2PAccess function to explicitly enable bidirectional P2P access
- Modified NvlinkTransport constructor to enable P2P access between all device pairs

Signed-off-by: staryxchen <staryxchen@tencent.com>

* Added checkCudaErrorReturn function and updated enableP2PAccess to properly handle CUDA errors by returning false on failure.

Signed-off-by: staryxchen <staryxchen@tencent.com>

* update something according to comment

- Remove unused checkCudaError function
- Update getNumDevices to use checkCudaErrorReturn
- Fix typo in error message ("not device" to "no device")
- Improve peer access error handling with detailed logging

Signed-off-by: staryxchen <staryxchen@tencent.com>

---------

Signed-off-by: staryxchen <staryxchen@tencent.com>
wanyue-wy pushed a commit to wanyue-wy/Mooncake that referenced this pull request Dec 14, 2025
JasonZhang517 pushed a commit to JasonZhang517/Mooncake that referenced this pull request Feb 9, 2026
… for NvlinkTransport (kvcache-ai#683)

* refactor(nvlink_transport): cache device count to avoid repeated queries

Add static getNumDevices() function to cache CUDA device count and reuse it in supportFabricMem() to avoid multiple cudaGetDeviceCount calls.

Signed-off-by: staryxchen <staryxchen@tencent.com>

* bugfix(nvlink): add explicit P2P access enablement and error handling

- Added checkCudaError helper function for consistent error handling
- Implemented enableP2PAccess function to explicitly enable bidirectional P2P access
- Modified NvlinkTransport constructor to enable P2P access between all device pairs

Signed-off-by: staryxchen <staryxchen@tencent.com>

* Added checkCudaErrorReturn function and updated enableP2PAccess to properly handle CUDA errors by returning false on failure.

Signed-off-by: staryxchen <staryxchen@tencent.com>

* update something according to comment

- Remove unused checkCudaError function
- Update getNumDevices to use checkCudaErrorReturn
- Fix typo in error message ("not device" to "no device")
- Improve peer access error handling with detailed logging

Signed-off-by: staryxchen <staryxchen@tencent.com>

---------

Signed-off-by: staryxchen <staryxchen@tencent.com>
JasonZhang517 pushed a commit to JasonZhang517/Mooncake that referenced this pull request Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants