[ROCm][MP] Fix HIP invalid-argument on lazy host buffer past 2 GB#3079
Merged
ApostaC merged 3 commits intoLMCache:devfrom Apr 20, 2026
Merged
Conversation
On AMD, the second long MP-mode request fails with HIP error: invalid argument once the LazyMemoryAllocator hands out a memory_obj whose virtual offset reaches the 2 GB mark. Root cause is the unqualified `min` in lmcache_memcpy_async: HIP's overload set picks an int-returning variant, silently narrowing the size_t arguments so real_end wraps to a huge garbage value and hipMemcpyAsync rejects the call. CUDA's resolution picks a wider overload, which is why the CUDA path was unaffected. Fix: in lmcache/v1/gpu_connector/gpu_ops.py, split the transfer at PIN_CHUNK_SIZE boundaries in Python and use torch.Tensor.copy_ with non_blocking=True instead of calling the buggy kernel. Python ints are arbitrary-precision, so there is no overflow, and the per-chunk split is still required because HIP cannot cross independently-registered pin regions in a single async memcpy. csrc/mem_kernels.cu is also updated to use std::min<size_t> so a future rebuild of the kernel is correct on HIP. Adds tests/v1/gpu_connector/test_gpu_ops_lazy.py that exercises the entry points across and beyond the 2 GB boundary; these fail on the unfixed build and pass with the fix.
Contributor
There was a problem hiding this comment.
Code Review
This pull request fixes a bug on AMD/ROCm where memory transfers exceeding 2 GB were truncated due to silent narrowing in the C++ kernel. The solution includes using std::min<size_t> in the kernel and moving the chunk-splitting logic to Python to leverage arbitrary-precision integers. Review feedback suggests ensuring 1-D tensor views during copies for better robustness and addresses several style guide violations in the new regression tests, such as private member access, missing type hints, and hardcoded device identifiers.
The Python-side chunk split in gpu_ops.py is no longer needed once the C++ kernel is rebuilt with std::min<size_t>. Revert gpu_ops.py so the LazyMemoryAllocator path goes back to calling lmc_ops.lmcache_memcpy_async directly. The regression tests in tests/v1/gpu_connector/test_gpu_ops_lazy.py are kept — they exercise the high-level entry points across the 2 GB boundary and will catch any future narrowing bug in the kernel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test requires a real GPU and 6 GB of pinned host memory, so it would be skipped in typical CI and only runs on a ROCm dev box. The kernel fix (std::min<size_t>) is a one-character change whose intent is documented in the surrounding comment, so a dedicated regression test is not worth the complexity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Tested by sending two long requests consecutively, and it did not fail this time. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
On AMD/ROCm in MP mode, the second long request fails with
HIP error: invalid argumentonce theLazyMemoryAllocatorstartshanding out addresses at or past the 2 GB virtual-offset boundary.
Root cause:
lmcache_memcpy_asyncincsrc/mem_kernels.cuuses anunqualified
min(size_t, size_t). HIP's overload set resolves this toan
int-returning variant, silently narrowing the arguments. Oncehost_buffer_offset + nbytes >= 2 GB, the intermediate value crossesINT_MINas a signed int,real_endis cast back tosize_tas ahuge garbage number, and
hipMemcpyAsyncrejects the call. CUDA'sresolution picks a wider overload, which is why CUDA users never hit
this.
Fix:
csrc/mem_kernels.cuis updated to usestd::min<size_t>so afuture rebuild of the kernel is correct on HIP.
If applicable: