[caching] Add enable_prompt_embeds and cpu_offload_gb to compile hashes. by zhxchen17 · Pull Request #29435 · vllm-project/vllm

zhxchen17 · 2025-11-25T19:25:36Z

Summary:

This is a re-apply of #27285 since it regressed in the vllm trunk recently.

enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence.

Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here)

Therefore 2 ways to do this:

Use dynamo guards, so this is guarded at torch.compile level. Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level. In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code.

cpu_offload_gb will affect model inputs since it will produce a different graph for different offloading configs.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request correctly adds enable_prompt_embeds and cpu_offload_gb to the compilation hash to ensure proper cache invalidation. The changes are straightforward and well-justified. I've identified a related issue where multimodal configuration parameters that affect the computation graph are also being ignored in the hash calculation, and I've provided a suggestion to address this to improve caching robustness for multimodal models.

Summary: This is a reland of vllm-project#27285 since it regressed in the vllm trunk recently. enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence. Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here) Therefore 2 ways to do this: Use dynamo guards, so this is guarded at torch.compile level. Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level. In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code. cpu_offload_gb will affect model inputs since it will produce a different graph for different offloading configs. Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <zhxchen17@fb.com>

Summary: `enable_sleep_mode` will introduce a new allocation context which subtly changes dynamo compilation results. Therefore we should include it into caching factors (similar to vllm-project#29435). Test Plan: First run test_cumem.py pytest tests/basic_correctness/test_cumem.py Second run test_cpu_offload.py pytest tests/basic_correctness/test_cpu_offload.py This fails without including enable_sleep_mode into caching factors. After adding `enable_sleep_mode`, these two tests can pass. Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <zhxchen17@fb.com>

…es. (vllm-project#29435) Signed-off-by: zhxchen17 <zhxchen17@fb.com>

…es. (vllm-project#29435) Signed-off-by: zhxchen17 <zhxchen17@fb.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

zhxchen17 requested review from ProExpertProg, WoosukKwon, heheda12345, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners November 25, 2025 19:25

gemini-code-assist bot reviewed Nov 25, 2025

View reviewed changes

zhxchen17 force-pushed the zhxchen17/caching/fix1 branch from 054e9d6 to 3498f8a Compare November 25, 2025 19:27

zou3519 approved these changes Nov 25, 2025

View reviewed changes

zou3519 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 25, 2025

zou3519 enabled auto-merge (squash) November 25, 2025 19:34

zou3519 merged commit 0abc794 into vllm-project:main Nov 25, 2025
49 checks passed

zhxchen17 mentioned this pull request Nov 28, 2025

[compile] Include enable_sleep_mode into caching factors. #29696

Merged

5 tasks

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[caching] Add enable_prompt_embeds and cpu_offload_gb to compile hash…

3cf56ef

…es. (vllm-project#29435) Signed-off-by: zhxchen17 <zhxchen17@fb.com>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[caching] Add enable_prompt_embeds and cpu_offload_gb to compile hash…

851b0c9

…es. (vllm-project#29435) Signed-off-by: zhxchen17 <zhxchen17@fb.com>

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[caching] Add enable_prompt_embeds and cpu_offload_gb to compile hash…

592c890

…es. (vllm-project#29435) Signed-off-by: zhxchen17 <zhxchen17@fb.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[caching] Add enable_prompt_embeds and cpu_offload_gb to compile hashes.#29435

[caching] Add enable_prompt_embeds and cpu_offload_gb to compile hashes.#29435
zou3519 merged 1 commit intovllm-project:mainfrom
zhxchen17:zhxchen17/caching/fix1

zhxchen17 commented Nov 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zhxchen17 commented Nov 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhxchen17 commented Nov 25, 2025 •

edited by github-actions bot

Loading