[Bugfix] Respect VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY in prefetch offloader#37699
Conversation
…h offloader The prefetch CPU offloader unconditionally allocates pinned memory, ignoring the VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY env var. On unified-memory systems like GH200 where pinned memory eats into the GPU memory pool, this causes OOM during model loading. The UVA offloader already checks this env var (added in vllm-project#32993). Apply the same check to the three places in the prefetch backend that allocate or assert pinned memory: - _offload_to_cpu_internal(): initial CPU storage allocation - _update_cpu_storage_from_param(): re-pinning after weight processing - start_onload_to_static(): pinned memory assertion Fixes vllm-project#37672 Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request correctly extends the handling of the VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY environment variable to the prefetch offloader, aligning its behavior with the UVA offloader. The changes prevent unconditional pinned memory allocation, which can cause out-of-memory errors on unified-memory systems. While the logic is correct, I've identified a code duplication issue across three locations. I've added a high-severity comment suggesting a refactoring to a shared helper function to improve maintainability and prevent future bugs.
| pin_memory = (is_pin_memory_available() | ||
| and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY) |
There was a problem hiding this comment.
The logic to decide whether to use pinned memory, is_pin_memory_available() and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY, is now duplicated in three places within this file. This creates a maintenance risk, as future changes to this logic might not be applied consistently across all locations, potentially re-introducing bugs.
To improve maintainability, I recommend extracting this logic into a single, module-level helper function.
For example:
def _should_pin_memory() -> bool:
"""Returns True if memory should be pinned for offloading."""
return (is_pin_memory_available()
and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY)This function can then be used in all three modified locations:
- In
_CpuParamOffloader._offload_to_cpu_internal(this location):pin_memory = _should_pin_memory()
- In
_CpuParamOffloader._update_cpu_storage_from_param:if _should_pin_memory() and not param.data.is_pinned(): ...
- In
PrefetchOffloader.start_onload_to_static:assert not _should_pin_memory() or cpu_storage.is_pinned(), ( ... )
This refactoring will make the code cleaner, less error-prone, and easier to understand and modify in the future.
There was a problem hiding this comment.
@gemini-code-assist This is not just limited to prefetch.py. You need to take a look at all files in vllm/model_executor/offloader.
There was a problem hiding this comment.
You are absolutely correct, @he-yufeng. My apologies for the oversight. I will expand the scope of the refactoring to include all relevant files within the vllm/model_executor/offloader directory to ensure consistency in handling the VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY environment variable.
There was a problem hiding this comment.
Good point, thanks for the catch. I went through every file in vllm/model_executor/offloader/:
base.py— added a sharedshould_pin_memory()helper that consolidates theis_pin_memory_available() and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORYcheckprefetch.py— all 3 call sites now use the helperuva.py— switched to the same helper (was already correct, just DRY'd it up)prefetch_ops.py— no pin memory usage, nothing to change__init__.py— just exports, addedshould_pin_memory
Also noted there's one more instance in model_loader/utils.py (outside the offloader dir) that could be migrated to the helper in a follow-up if desired.
|
@gemini-code-assist Change the above review to not just be limited to |
|
Thanks for the clarification, @ehfd. I've reviewed the changes in |
|
@gemini-code-assist Then, the solution is to retain the current implementation, since it's exactly the same way that works on |
|
I have noted your comment regarding the implementation's consistency with |
|
Thanks for the fix! prefetch_offloader.py, line 84 concern Also is it safe to read tests/worker/test_prefetch_offloader.py concern prefetch_offloader.py, line 12 nit Once the test is added and naming cleaned up this looks good to go, nice catch on the prefetch path being overlooked. |
Extract the duplicated `is_pin_memory_available() and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY` pattern into a shared `should_pin_memory()` helper in base.py, used by both prefetch.py and uva.py. This addresses the review feedback to check all files in vllm/model_executor/offloader/, not just prefetch.py. Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
3d48a5b to
355aea9
Compare
|
Ran |
|
ping on this — GH200 users are still hitting the OOM. |
|
It is definitely needed. Else GH200 doesn't work properly with unified memory at all. |
…h offloader (vllm-project#37699) Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: zengxian <xiangdong.zeng@intel.com>
…h offloader (vllm-project#37699) Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…h offloader (vllm-project#37699) Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Purpose
Fixes #37672
The prefetch CPU offloader (
--offload-backend prefetch) ignoresVLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORYand unconditionally allocates pinned memory. On unified-memory systems like GH200, pinned memory consumes GPU memory, causing OOM during model loading even when the env var is set to1.The UVA offloader already handles this correctly (added in #32993). This PR applies the same pattern to the prefetch backend and consolidates the duplicated check into a shared helper.
Changes
base.py— newshould_pin_memory()helper that combinesis_pin_memory_available()with the env var check, eliminating duplication across filesprefetch.py— three places in_CpuParamOffloadernow use the shared helper:_offload_to_cpu_internal()—pin_memoryflag when allocating initial CPU storage_update_cpu_storage_from_param()— skips re-pinning when the env var is setstart_onload_to_static()assertion — no longer fires when pinning is intentionally disableduva.py—self.pin_memoryassignment now uses the same shared helper (functionally identical, just DRY)__init__.py— exportsshould_pin_memoryfor external useOther files in
vllm/model_executor/offloader/(prefetch_ops.py) don't deal with pin memory at all, so no changes needed.Test Plan
tests/basic_correctness/test_cpu_offload.pycoversVLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORYfor the UVA backendpin_memoryon the same boolean via the shared helper