[compile] Add enable_prompt_embeds to compile hash.#27285
[compile] Add enable_prompt_embeds to compile hash.#27285DarkLight1337 merged 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a potential cache invalidation issue by including enable_prompt_embeds in the compilation hash. This ensures that changes to this flag, which can alter the input types to the model, will properly trigger a re-compilation. I've added one suggestion to also include runner_type and convert_type in the hash, as they also seem to have a significant impact on the computation graph and could lead to similar caching problems if not included. Overall, this is a good fix.
| @@ -339,6 +339,7 @@ def compute_hash(self) -> str: | |||
| factors.append(self.rope_scaling) | |||
| factors.append(self.rope_theta) | |||
| factors.append(self.video_pruning_rate) | |||
| factors.append(self.enable_prompt_embeds) | |||
There was a problem hiding this comment.
Good catch adding enable_prompt_embeds to the compilation hash.
While reviewing this, I noticed that runner_type and convert_type also seem to affect the computation graph but are not currently included in the hash. These fields can determine which model implementation is used (e.g., for generation vs. pooling) or whether a model adapter is applied, both of which are significant changes to the graph.
To prevent potential cache collisions when switching between runners or converters for the same base model, it would be safer to include them in the hash factors. What do you think about adding them here?
| factors.append(self.enable_prompt_embeds) | |
| factors.append(self.enable_prompt_embeds) | |
| factors.append(self.runner_type) | |
| factors.append(self.convert_type) |
Summary: Fixing issue vllm-project#27283 `enable_prompt_embeds` will make input_ids argument to be `None` instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence. Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here) Therefore 2 ways to do this: 1. Use dynamo guards, so this is guarded at torch.compile level. 2. Add enable_prompt_embeds to `compute_hash`, so this is guarded at vllm level. In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code. Test Plan: (with torch 2.10.dev) `pytest tests/basic_correctness/test_basic_correctness.py` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <zhxchen17@fb.com>
f8dc86e to
c2c379a
Compare
Signed-off-by: zhxchen17 <zhxchen17@fb.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: zhxchen17 <zhxchen17@fb.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: zhxchen17 <zhxchen17@fb.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Summary: This is a reland of vllm-project#27285 since it regressed in the vllm trunk recently. enable_prompt_embeds will make input_ids argument to be None instead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence. Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here) Therefore 2 ways to do this: Use dynamo guards, so this is guarded at torch.compile level. Add enable_prompt_embeds to compute_hash, so this is guarded at vllm level. In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code. cpu_offload_gb will affect model inputs since it will produce a different graph for different offloading configs. Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <zhxchen17@fb.com>
Signed-off-by: zhxchen17 <zhxchen17@fb.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Summary:
Fixing issue #27283
enable_prompt_embedswill make input_ids argument to beNoneinstead of tensor type, which will invalidate the compile cache at vllm level. Previously this wasn't an issue because inductor has its own caching validation that serves as the last line of defence.Now that we enabled AOT compilation, the dynamo bytecode is also cached and therefore we need to guard it against input type changes (e.g. Tensor -> None here)
Therefore 2 ways to do this:
compute_hash, so this is guarded at vllm level.In the short term, 2. seems to be the better approach because vllm already throws away all the guards from dynamo and enabling the guards is a non trivial change to the existing code.
Test Plan:
(with torch 2.10.dev)
pytest tests/basic_correctness/test_basic_correctness.pyReviewers:
Subscribers:
Tasks:
Tags:
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.