Skip to content

[ROCm][MP] Fix HIP invalid-argument on lazy host buffer past 2 GB#3079

Merged
ApostaC merged 3 commits intoLMCache:devfrom
Shaoting-Feng:fix/amd-lazy-memcpy-2gb-overflow
Apr 20, 2026
Merged

[ROCm][MP] Fix HIP invalid-argument on lazy host buffer past 2 GB#3079
ApostaC merged 3 commits intoLMCache:devfrom
Shaoting-Feng:fix/amd-lazy-memcpy-2gb-overflow

Conversation

@Shaoting-Feng
Copy link
Copy Markdown
Contributor

@Shaoting-Feng Shaoting-Feng commented Apr 19, 2026

What this PR does / why we need it:

On AMD/ROCm in MP mode, the second long request fails with
HIP error: invalid argument once the LazyMemoryAllocator starts
handing out addresses at or past the 2 GB virtual-offset boundary.

Root cause: lmcache_memcpy_async in csrc/mem_kernels.cu uses an
unqualified min(size_t, size_t). HIP's overload set resolves this to
an int-returning variant, silently narrowing the arguments. Once
host_buffer_offset + nbytes >= 2 GB, the intermediate value crosses
INT_MIN as a signed int, real_end is cast back to size_t as a
huge garbage number, and hipMemcpyAsync rejects the call. CUDA's
resolution picks a wider overload, which is why CUDA users never hit
this.

Fix: csrc/mem_kernels.cu is updated to use std::min<size_t> so a
future rebuild of the kernel is correct on HIP.

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

On AMD, the second long MP-mode request fails with
  HIP error: invalid argument
once the LazyMemoryAllocator hands out a memory_obj whose virtual offset
reaches the 2 GB mark. Root cause is the unqualified `min` in
lmcache_memcpy_async: HIP's overload set picks an int-returning variant,
silently narrowing the size_t arguments so real_end wraps to a huge
garbage value and hipMemcpyAsync rejects the call. CUDA's resolution
picks a wider overload, which is why the CUDA path was unaffected.

Fix: in lmcache/v1/gpu_connector/gpu_ops.py, split the transfer at
PIN_CHUNK_SIZE boundaries in Python and use torch.Tensor.copy_ with
non_blocking=True instead of calling the buggy kernel. Python ints are
arbitrary-precision, so there is no overflow, and the per-chunk split
is still required because HIP cannot cross independently-registered
pin regions in a single async memcpy. csrc/mem_kernels.cu is also
updated to use std::min<size_t> so a future rebuild of the kernel is
correct on HIP.

Adds tests/v1/gpu_connector/test_gpu_ops_lazy.py that exercises the
entry points across and beyond the 2 GB boundary; these fail on the
unfixed build and pass with the fix.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a bug on AMD/ROCm where memory transfers exceeding 2 GB were truncated due to silent narrowing in the C++ kernel. The solution includes using std::min<size_t> in the kernel and moving the chunk-splitting logic to Python to leverage arbitrary-precision integers. Review feedback suggests ensuring 1-D tensor views during copies for better robustness and addresses several style guide violations in the new regression tests, such as private member access, missing type hints, and hardcoded device identifiers.

Comment thread lmcache/v1/gpu_connector/gpu_ops.py Outdated
Comment thread lmcache/v1/gpu_connector/gpu_ops.py Outdated
Comment thread tests/v1/gpu_connector/test_gpu_ops_lazy.py Outdated
Comment thread tests/v1/gpu_connector/test_gpu_ops_lazy.py Outdated
Comment thread tests/v1/gpu_connector/test_gpu_ops_lazy.py Outdated
Comment thread tests/v1/gpu_connector/test_gpu_ops_lazy.py Outdated
Shaoting-Feng and others added 2 commits April 20, 2026 21:13
The Python-side chunk split in gpu_ops.py is no longer needed once the
C++ kernel is rebuilt with std::min<size_t>. Revert gpu_ops.py so the
LazyMemoryAllocator path goes back to calling lmc_ops.lmcache_memcpy_async
directly. The regression tests in tests/v1/gpu_connector/test_gpu_ops_lazy.py
are kept — they exercise the high-level entry points across the 2 GB
boundary and will catch any future narrowing bug in the kernel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test requires a real GPU and 6 GB of pinned host memory, so it
would be skipped in typical CI and only runs on a ROCm dev box. The
kernel fix (std::min<size_t>) is a one-character change whose intent
is documented in the surrounding comment, so a dedicated regression
test is not worth the complexity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Shaoting-Feng Shaoting-Feng marked this pull request as ready for review April 20, 2026 21:38
@Shaoting-Feng
Copy link
Copy Markdown
Contributor Author

Tested by sending two long requests consecutively, and it did not fail this time.

Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@ApostaC ApostaC merged commit 0b44377 into LMCache:dev Apr 20, 2026
29 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants