[Bugfix] Respect VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY in prefetch offloader by he-yufeng · Pull Request #37699 · vllm-project/vllm

he-yufeng · 2026-03-20T16:06:54Z

Purpose

The prefetch CPU offloader (--offload-backend prefetch) ignores VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY and unconditionally allocates pinned memory. On unified-memory systems like GH200, pinned memory consumes GPU memory, causing OOM during model loading even when the env var is set to 1.

The UVA offloader already handles this correctly (added in #32993). This PR applies the same pattern to the prefetch backend and consolidates the duplicated check into a shared helper.

Changes

base.py — new should_pin_memory() helper that combines is_pin_memory_available() with the env var check, eliminating duplication across files
prefetch.py — three places in _CpuParamOffloader now use the shared helper:
- _offload_to_cpu_internal() — pin_memory flag when allocating initial CPU storage
- _update_cpu_storage_from_param() — skips re-pinning when the env var is set
- start_onload_to_static() assertion — no longer fires when pinning is intentionally disabled
uva.py — self.pin_memory assignment now uses the same shared helper (functionally identical, just DRY)
__init__.py — exports should_pin_memory for external use

Other files in vllm/model_executor/offloader/ (prefetch_ops.py) don't deal with pin memory at all, so no changes needed.

Test Plan

Existing tests/basic_correctness/test_cpu_offload.py covers VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY for the UVA backend
The env var handling is identical between UVA and prefetch — both just gate pin_memory on the same boolean via the shared helper
Verified no other references to the raw pattern remain in the offloader directory

…h offloader The prefetch CPU offloader unconditionally allocates pinned memory, ignoring the VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY env var. On unified-memory systems like GH200 where pinned memory eats into the GPU memory pool, this causes OOM during model loading. The UVA offloader already checks this env var (added in vllm-project#32993). Apply the same check to the three places in the prefetch backend that allocate or assert pinned memory: - _offload_to_cpu_internal(): initial CPU storage allocation - _update_cpu_storage_from_param(): re-pinning after weight processing - start_onload_to_static(): pinned memory assertion Fixes vllm-project#37672 Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request correctly extends the handling of the VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY environment variable to the prefetch offloader, aligning its behavior with the UVA offloader. The changes prevent unconditional pinned memory allocation, which can cause out-of-memory errors on unified-memory systems. While the logic is correct, I've identified a code duplication issue across three locations. I've added a high-severity comment suggesting a refactoring to a shared helper function to improve maintainability and prevent future bugs.

gemini-code-assist · 2026-03-20T16:12:11Z

+        pin_memory = (is_pin_memory_available()
+                      and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY)


The logic to decide whether to use pinned memory, is_pin_memory_available() and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY, is now duplicated in three places within this file. This creates a maintenance risk, as future changes to this logic might not be applied consistently across all locations, potentially re-introducing bugs.

To improve maintainability, I recommend extracting this logic into a single, module-level helper function.

For example:

def _should_pin_memory() -> bool: """Returns True if memory should be pinned for offloading.""" return (is_pin_memory_available() and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY)

This function can then be used in all three modified locations:

In _CpuParamOffloader._offload_to_cpu_internal (this location):
pin_memory = _should_pin_memory()

In _CpuParamOffloader._update_cpu_storage_from_param:
if _should_pin_memory() and not param.data.is_pinned(): ...

In PrefetchOffloader.start_onload_to_static:
assert not _should_pin_memory() or cpu_storage.is_pinned(), ( ... )

This refactoring will make the code cleaner, less error-prone, and easier to understand and modify in the future.

@gemini-code-assist This is not just limited to prefetch.py. You need to take a look at all files in vllm/model_executor/offloader.

You are absolutely correct, @he-yufeng. My apologies for the oversight. I will expand the scope of the refactoring to include all relevant files within the vllm/model_executor/offloader directory to ensure consistency in handling the VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY environment variable.

Good point, thanks for the catch. I went through every file in vllm/model_executor/offloader/:

base.py — added a shared should_pin_memory() helper that consolidates the is_pin_memory_available() and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY check

prefetch.py — all 3 call sites now use the helper

uva.py — switched to the same helper (was already correct, just DRY'd it up)

prefetch_ops.py — no pin memory usage, nothing to change

__init__.py — just exports, added should_pin_memory

Also noted there's one more instance in model_loader/utils.py (outside the offloader dir) that could be migrated to the helper in a follow-up if desired.

ehfd · 2026-03-21T04:41:46Z

@gemini-code-assist Change the above review to not just be limited to vllm/model_executor/offloader/prefetch.py. You need to take a look at all files in vllm/model_executor/offloader.

gemini-code-assist · 2026-03-21T04:41:57Z

Thanks for the clarification, @ehfd. I've reviewed the changes in vllm/model_executor/offloader/prefetch.py. However, based on the provided diff, vllm/model_executor/offloader/prefetch.py is the only file within the vllm/model_executor/offloader directory that has been modified in this pull request. Therefore, my review is limited to the changes in this specific file.

ehfd · 2026-03-21T06:01:18Z

@gemini-code-assist Then, the solution is to retain the current implementation, since it's exactly the same way that works on vllm/model_executor/offloader/uva.py

gemini-code-assist · 2026-03-21T06:02:02Z

I have noted your comment regarding the implementation's consistency with vllm/model_executor/offloader/uva.py.

ehfd · 2026-03-21T06:12:26Z

Please check https://docs.vllm.ai/en/latest/contributing/#linting @he-yufeng

PythonicMind2 · 2026-03-21T13:06:38Z

Thanks for the fix!
Pinned memory should definitely be respected in the prefetch path too, not just in the main offloading path. The intent is clear and the bug is real.

prefetch_offloader.py, line 84 concern
The variable name pin is too terse. Since this is a negation of a 'disable' flag it's easy to get the logic backwards on a quick read. Suggest:

use_pin_memory = not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY
return torch.empty(size, dtype=torch.uint8, pin_memory=use_pin_memory)

Also is it safe to read envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY every call, or should we cache this at __init__ time to avoid overhead in tight loops?

tests/worker/test_prefetch_offloader.py concern
No test added. This bug was silently ignoring the env var for the prefetch path. Please add a parametrized test covering both VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY=1 and =0, asserting buf.is_pinned() accordingly.

prefetch_offloader.py, line 12 nit
The new envs import is at the bottom of the import block. Please keep it sorted with the other vllm imports alphabetically.

Once the test is added and naming cleaned up this looks good to go, nice catch on the prefetch path being overlooked.

Extract the duplicated `is_pin_memory_available() and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY` pattern into a shared `should_pin_memory()` helper in base.py, used by both prefetch.py and uva.py. This addresses the review feedback to check all files in vllm/model_executor/offloader/, not just prefetch.py. Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

he-yufeng · 2026-03-21T15:13:03Z

Ran ruff check and ruff format on the offloader directory — all passing now. Thanks for the heads up.

ehfd · 2026-03-23T14:32:08Z

@youkaichao @mgoin @22quinn @LucasWilkinson @zou3519 @ywang96 @hmellor

Follow-up to #32993 and #29941.

he-yufeng · 2026-04-13T07:10:28Z

ping on this — GH200 users are still hitting the OOM.

ehfd · 2026-04-13T15:11:21Z

It is definitely needed. Else GH200 doesn't work properly with unified memory at all.

…h offloader (vllm-project#37699) Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: zengxian <xiangdong.zeng@intel.com>

…h offloader (vllm-project#37699) Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

…h offloader (vllm-project#37699) Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

mergify Bot added the bug Something isn't working label Mar 20, 2026

gemini-code-assist Bot reviewed Mar 20, 2026

View reviewed changes

he-yufeng force-pushed the fix/prefetch-offload-disable-pin-memory branch from 3d48a5b to 355aea9 Compare March 21, 2026 15:12

ehfd mentioned this pull request Mar 23, 2026

[Bug]: CPU offload errors on nightly with NVIDIA GH200 Unified Memory (UMA) #36796

Open

1 task

DarkLight1337 requested review from Isotr0py and mgoin April 13, 2026 15:31

Isotr0py approved these changes Apr 14, 2026

View reviewed changes

Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 14, 2026

Merge branch 'main' into fix/prefetch-offload-disable-pin-memory

b8e2ae5

Isotr0py enabled auto-merge (squash) April 14, 2026 16:38

mgoin approved these changes Apr 15, 2026

View reviewed changes

vllm-bot merged commit bcc2306 into vllm-project:main Apr 15, 2026
51 of 52 checks passed

		pin_memory = (is_pin_memory_available()
		and not envs.VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY)

Uh oh!

Conversation

he-yufeng commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

ehfd Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

he-yufeng Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

ehfd commented Mar 21, 2026

Uh oh!

gemini-code-assist Bot commented Mar 21, 2026

Uh oh!

ehfd commented Mar 21, 2026

Uh oh!

gemini-code-assist Bot commented Mar 21, 2026

Uh oh!

ehfd commented Mar 21, 2026

Uh oh!

PythonicMind2 commented Mar 21, 2026

Uh oh!

he-yufeng commented Mar 21, 2026

Uh oh!

ehfd commented Mar 23, 2026

Uh oh!

he-yufeng commented Apr 13, 2026

Uh oh!

ehfd commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

he-yufeng commented Mar 20, 2026 •

edited

Loading