Skip to content

[caching] Add enable_prompt_embeds and cpu_offload_gb to compile hashes.#29435

Merged
zou3519 merged 1 commit intovllm-project:mainfrom
zhxchen17:zhxchen17/caching/fix1
Nov 25, 2025
Merged

[caching] Add enable_prompt_embeds and cpu_offload_gb to compile hashes.#29435
zou3519 merged 1 commit intovllm-project:mainfrom
zhxchen17:zhxchen17/caching/fix1

Conversation

@zhxchen17
Copy link
Copy Markdown
Contributor

@zhxchen17 zhxchen17 commented Nov 25, 2025

Summary:

This is a re-apply of #27285 since it regressed in the vllm trunk recently.

enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence.

Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here)

Therefore 2 ways to do this:

Use dynamo guards, so this is guarded at torch.compile level. Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level. In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code.

cpu_offload_gb will affect model inputs since it will produce a different graph for different offloading configs.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly adds enable_prompt_embeds and cpu_offload_gb to the compilation hash to ensure proper cache invalidation. The changes are straightforward and well-justified. I've identified a related issue where multimodal configuration parameters that affect the computation graph are also being ignored in the hash calculation, and I've provided a suggestion to address this to improve caching robustness for multimodal models.

Summary:

This is a reland of vllm-project#27285 since it
regressed in the vllm trunk recently.

enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence.

Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here)

Therefore 2 ways to do this:

Use dynamo guards, so this is guarded at torch.compile level.
Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level.
In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code.

cpu_offload_gb will affect model inputs since it will produce a different graph for
different offloading configs.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
@zhxchen17 zhxchen17 force-pushed the zhxchen17/caching/fix1 branch from 054e9d6 to 3498f8a Compare November 25, 2025 19:27
@zou3519 zou3519 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 25, 2025
@zou3519 zou3519 enabled auto-merge (squash) November 25, 2025 19:34
@zou3519 zou3519 merged commit 0abc794 into vllm-project:main Nov 25, 2025
49 checks passed
zhxchen17 added a commit to zhxchen17/vllm that referenced this pull request Nov 28, 2025
Summary:

`enable_sleep_mode` will introduce a new allocation context which subtly changes dynamo compilation results.
Therefore we should include it into caching factors (similar to vllm-project#29435).

Test Plan:

First run test_cumem.py
pytest tests/basic_correctness/test_cumem.py

Second run test_cpu_offload.py
pytest tests/basic_correctness/test_cpu_offload.py

This fails without including enable_sleep_mode into caching factors.
After adding `enable_sleep_mode`, these two tests can pass.

Reviewers:

Subscribers:

Tasks:

Tags:

Signed-off-by: zhxchen17 <zhxchen17@fb.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…es. (vllm-project#29435)

Signed-off-by: zhxchen17 <zhxchen17@fb.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants