[caching] Add enable_prompt_embeds and cpu_offload_gb to compile hashes.#29435
Merged
zou3519 merged 1 commit intovllm-project:mainfrom Nov 25, 2025
Merged
[caching] Add enable_prompt_embeds and cpu_offload_gb to compile hashes.#29435zou3519 merged 1 commit intovllm-project:mainfrom
zou3519 merged 1 commit intovllm-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request correctly adds enable_prompt_embeds and cpu_offload_gb to the compilation hash to ensure proper cache invalidation. The changes are straightforward and well-justified. I've identified a related issue where multimodal configuration parameters that affect the computation graph are also being ignored in the hash calculation, and I've provided a suggestion to address this to improve caching robustness for multimodal models.
Summary: This is a reland of vllm-project#27285 since it regressed in the vllm trunk recently. enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence. Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here) Therefore 2 ways to do this: Use dynamo guards, so this is guarded at torch.compile level. Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level. In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code. cpu_offload_gb will affect model inputs since it will produce a different graph for different offloading configs. Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <zhxchen17@fb.com>
054e9d6 to
3498f8a
Compare
zou3519
approved these changes
Nov 25, 2025
zhxchen17
added a commit
to zhxchen17/vllm
that referenced
this pull request
Nov 28, 2025
Summary: `enable_sleep_mode` will introduce a new allocation context which subtly changes dynamo compilation results. Therefore we should include it into caching factors (similar to vllm-project#29435). Test Plan: First run test_cumem.py pytest tests/basic_correctness/test_cumem.py Second run test_cpu_offload.py pytest tests/basic_correctness/test_cpu_offload.py This fails without including enable_sleep_mode into caching factors. After adding `enable_sleep_mode`, these two tests can pass. Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <zhxchen17@fb.com>
5 tasks
devpatelio
pushed a commit
to SumanthRH/vllm
that referenced
this pull request
Nov 29, 2025
…es. (vllm-project#29435) Signed-off-by: zhxchen17 <zhxchen17@fb.com>
kitaekatt
pushed a commit
to kitaekatt/vllm
that referenced
this pull request
Dec 1, 2025
…es. (vllm-project#29435) Signed-off-by: zhxchen17 <zhxchen17@fb.com>
dsuhinin
pushed a commit
to dsuhinin/vllm
that referenced
this pull request
Jan 21, 2026
…es. (vllm-project#29435) Signed-off-by: zhxchen17 <zhxchen17@fb.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
This is a re-apply of #27285 since it regressed in the vllm trunk recently.
enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence.
Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here)
Therefore 2 ways to do this:
Use dynamo guards, so this is guarded at torch.compile level. Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level. In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code.
cpu_offload_gb will affect model inputs since it will produce a different graph for different offloading configs.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.