[ROCm][MP] Fix HIP invalid-argument on lazy host buffer past 2 GB by Shaoting-Feng · Pull Request #3079 · LMCache/LMCache

Shaoting-Feng · 2026-04-19T08:36:11Z

What this PR does / why we need it:

On AMD/ROCm in MP mode, the second long request fails with
HIP error: invalid argument once the LazyMemoryAllocator starts
handing out addresses at or past the 2 GB virtual-offset boundary.

Root cause: lmcache_memcpy_async in csrc/mem_kernels.cu uses an
unqualified min(size_t, size_t). HIP's overload set resolves this to
an int-returning variant, silently narrowing the arguments. Once
host_buffer_offset + nbytes >= 2 GB, the intermediate value crosses
INT_MIN as a signed int, real_end is cast back to size_t as a
huge garbage number, and hipMemcpyAsync rejects the call. CUDA's
resolution picks a wider overload, which is why CUDA users never hit
this.

Fix: csrc/mem_kernels.cu is updated to use std::min<size_t> so a
future rebuild of the kernel is correct on HIP.

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

On AMD, the second long MP-mode request fails with HIP error: invalid argument once the LazyMemoryAllocator hands out a memory_obj whose virtual offset reaches the 2 GB mark. Root cause is the unqualified `min` in lmcache_memcpy_async: HIP's overload set picks an int-returning variant, silently narrowing the size_t arguments so real_end wraps to a huge garbage value and hipMemcpyAsync rejects the call. CUDA's resolution picks a wider overload, which is why the CUDA path was unaffected. Fix: in lmcache/v1/gpu_connector/gpu_ops.py, split the transfer at PIN_CHUNK_SIZE boundaries in Python and use torch.Tensor.copy_ with non_blocking=True instead of calling the buggy kernel. Python ints are arbitrary-precision, so there is no overflow, and the per-chunk split is still required because HIP cannot cross independently-registered pin regions in a single async memcpy. csrc/mem_kernels.cu is also updated to use std::min<size_t> so a future rebuild of the kernel is correct on HIP. Adds tests/v1/gpu_connector/test_gpu_ops_lazy.py that exercises the entry points across and beyond the 2 GB boundary; these fail on the unfixed build and pass with the fix.

gemini-code-assist

Code Review

This pull request fixes a bug on AMD/ROCm where memory transfers exceeding 2 GB were truncated due to silent narrowing in the C++ kernel. The solution includes using std::min<size_t> in the kernel and moving the chunk-splitting logic to Python to leverage arbitrary-precision integers. Review feedback suggests ensuring 1-D tensor views during copies for better robustness and addresses several style guide violations in the new regression tests, such as private member access, missing type hints, and hardcoded device identifiers.

The Python-side chunk split in gpu_ops.py is no longer needed once the C++ kernel is rebuilt with std::min<size_t>. Revert gpu_ops.py so the LazyMemoryAllocator path goes back to calling lmc_ops.lmcache_memcpy_async directly. The regression tests in tests/v1/gpu_connector/test_gpu_ops_lazy.py are kept — they exercise the high-level entry points across the 2 GB boundary and will catch any future narrowing bug in the kernel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The test requires a real GPU and 6 GB of pinned host memory, so it would be skipped in typical CI and only runs on a ROCm dev box. The kernel fix (std::min<size_t>) is a one-character change whose intent is documented in the surrounding comment, so a dedicated regression test is not worth the complexity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Shaoting-Feng · 2026-04-20T21:39:57Z

Tested by sending two long requests consecutively, and it did not fail this time.

ApostaC

Good catch!

Shaoting-Feng requested review from ApostaC, YaoJiayi, deng451e, hickeyma and sammshen as code owners April 19, 2026 08:36

Shaoting-Feng marked this pull request as draft April 19, 2026 08:40

gemini-code-assist Bot reviewed Apr 19, 2026

View reviewed changes

Shaoting-Feng and others added 2 commits April 20, 2026 21:13

Shaoting-Feng marked this pull request as ready for review April 20, 2026 21:38

ApostaC approved these changes Apr 20, 2026

View reviewed changes

ApostaC merged commit 0b44377 into LMCache:dev Apr 20, 2026
29 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][MP] Fix HIP invalid-argument on lazy host buffer past 2 GB#3079

[ROCm][MP] Fix HIP invalid-argument on lazy host buffer past 2 GB#3079
ApostaC merged 3 commits intoLMCache:devfrom
Shaoting-Feng:fix/amd-lazy-memcpy-2gb-overflow

Shaoting-Feng commented Apr 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shaoting-Feng commented Apr 20, 2026

Uh oh!

ApostaC left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Shaoting-Feng commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shaoting-Feng commented Apr 20, 2026

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shaoting-Feng commented Apr 19, 2026 •

edited

Loading