[Bug] Fix torch Compilation Cache Hit Error by yewentao256 · Pull Request #25093 · vllm-project/vllm

yewentao256 · 2025-09-17T18:56:16Z

Purpose

The root cause of this issue is from (runtime_shape, graph_index, backend_name) is not a strong enough cache key for compiled cuda graph. And in complicated situation like DeepEP HT, we split the piecewise graph moe_forward and moe_forward_shared and make the wrong cache hit.

I am not sure if it is a good idea to refactor the cache key system throughly just for the support of piecewise graph for HT (Perhaps not worth enough), so simply cancel the support for HT graph now. If we encounter other scenarios where the cache key proves insufficient, we should revisit and redesign the cache system.

Note: The rough idea to refactor the cache system: Adding a fourth key introducing the signature of sub_graph like. Works good locally

def compute_subgraph_signature(graph: fx.GraphModule) -> str:
    parts: list[str] = []
    for node in graph.graph.nodes:
        parts.append(f"{node.op}:{str(node.target)}")
    sig = "|".join(parts)
    return hashlib.md5(sig.encode(), usedforsecurity=False).hexdigest()[:16]

Test

Originally: Wrong graph cache hit in the second run

(EngineCore_DP6 pid=2464687)     hidden_states = self.model(input_ids, positions, intermediate_tensors,
(EngineCore_DP6 pid=2464687)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/vllm/vllm/compilation/decorators.py", line 305, in __call__
(EngineCore_DP6 pid=2464687)     output = self.compiled_callable(*args, **kwargs)
(EngineCore_DP6 pid=2464687)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
(EngineCore_DP6 pid=2464687)     return fn(*args, **kwargs)
(EngineCore_DP6 pid=2464687)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/vllm/vllm/model_executor/models/deepseek_v2.py", line 767, in forward
(EngineCore_DP6 pid=2464687)     def forward(
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(EngineCore_DP6 pid=2464687)     return self._call_impl(*args, **kwargs)
(EngineCore_DP6 pid=2464687)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(EngineCore_DP6 pid=2464687)     return forward_call(*args, **kwargs)
(EngineCore_DP6 pid=2464687)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
(EngineCore_DP6 pid=2464687)     return fn(*args, **kwargs)
(EngineCore_DP6 pid=2464687)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 830, in call_wrapped
(EngineCore_DP6 pid=2464687)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP6 pid=2464687)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 406, in __call__
(EngineCore_DP6 pid=2464687)     raise e
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 393, in __call__
(EngineCore_DP6 pid=2464687)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP6 pid=2464687)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(EngineCore_DP6 pid=2464687)     return self._call_impl(*args, **kwargs)
(EngineCore_DP6 pid=2464687)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(EngineCore_DP6 pid=2464687)     return forward_call(*args, **kwargs)
(EngineCore_DP6 pid=2464687)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "<eval_with_key>.240", line 723, in forward
(EngineCore_DP6 pid=2464687)     submod_10 = self.submod_10(submod_9, s0, getitem_22, l_self_modules_layers_modules_4_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_4_modules_self_attn_modules_mla_attn_modules_fused_qkv_a_proj_parameters_weight_, l_self_modules_layers_modules_4_modules_self_attn_modules_mla_attn_modules_fused_qkv_a_proj_parameters_weight_scale_inv_, l_self_modules_layers_modules_4_modules_self_attn_modules_mla_attn_modules_q_a_layernorm_parameters_weight_, l_self_modules_layers_modules_4_modules_self_attn_modules_mla_attn_modules_q_b_proj_parameters_weight_, l_self_modules_layers_modules_4_modules_self_attn_modules_mla_attn_modules_q_b_proj_parameters_weight_scale_inv_, l_self_modules_layers_modules_4_modules_self_attn_modules_mla_attn_modules_kv_a_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_mla_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_);  submod_9 = getitem_22 = l_self_modules_layers_modules_4_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_4_modules_self_attn_modules_mla_attn_modules_fused_qkv_a_proj_parameters_weight_ = l_self_modules_layers_modules_4_modules_self_attn_modules_mla_attn_modules_fused_qkv_a_proj_parameters_weight_scale_inv_ = l_self_modules_layers_modules_4_modules_self_attn_modules_mla_attn_modules_q_a_layernorm_parameters_weight_ = l_self_modules_layers_modules_4_modules_self_attn_modules_mla_attn_modules_q_b_proj_parameters_weight_ = l_self_modules_layers_modules_4_modules_self_attn_modules_mla_attn_modules_q_b_proj_parameters_weight_scale_inv_ = l_self_modules_layers_modules_4_modules_self_attn_modules_mla_attn_modules_kv_a_layernorm_parameters_weight_ = None
(EngineCore_DP6 pid=2464687)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/vllm/vllm/compilation/cuda_graph.py", line 119, in __call__
(EngineCore_DP6 pid=2464687)     return self.runnable(*args, **kwargs)
(EngineCore_DP6 pid=2464687)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/vllm/vllm/compilation/cuda_piecewise_backend.py", line 90, in __call__
(EngineCore_DP6 pid=2464687)     return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP6 pid=2464687)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/vllm/vllm/compilation/compiler_interface.py", line 518, in compiled_graph
(EngineCore_DP6 pid=2464687)     graph_output = inductor_compiled_graph(list_args)
(EngineCore_DP6 pid=2464687)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 460, in __call__
(EngineCore_DP6 pid=2464687)     return self.current_callable(inputs)
(EngineCore_DP6 pid=2464687)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP6 pid=2464687)   File "/data/vllm-community-homes/vllm-user-6/.cache/vllm/torch_compile_cache/827a4e48c2/rank_0_6/inductor_cache/u3/cu35tgdepztiquv25xuh6ksxtdembn2u7jrhps4jbvwlcnvv4bzl.py", line 441, in call
(EngineCore_DP6 pid=2464687)     arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1 = args
(EngineCore_DP6 pid=2464687)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=2464682)     return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP6 pid=2464687) ValueError: not enough values to unpack (expected 13, got 12)
(EngineCore_DP1 pid=2464682)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=2464682)   File "/data/vllm-community-homes/vllm-user-6/vllm/vllm/compilation/compiler_interface.py", line 518, in compiled_graph
(EngineCore_DP1 pid=2464682)     graph_output = inductor_compiled_graph(list_args)
(EngineCore_DP1 pid=2464682)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=2464682)   File "/data/vllm-community-homes/vllm-user-6/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 460, in __call__
(EngineCore_DP1 pid=2464682)     return self.current_callable(inputs)
(EngineCore_DP1 pid=2464682)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=2464682)   File "/data/vllm-community-homes/vllm-user-6/.cache/vllm/torch_compile_cache/827a4e48c2/rank_0_1/inductor_cache/5x/c5xlzojszbnaoiyl4la7syjq6lh2rfdc3q3emyo3ase2b6w6hizw.py", line 441, in call
(EngineCore_DP1 pid=2464682)     arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1 = args
(EngineCore_DP1 pid=2464682)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP1 pid=2464682) ValueError: not enough values to unpack (expected 13, got 12)

Now:

(APIServer pid=1527118) INFO:     Started server process [1527118]
(APIServer pid=1527118) INFO:     Waiting for application startup.
(APIServer pid=1527118) INFO:     Application startup complete.

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gemini-code-assist

Code Review

This pull request addresses a critical bug where an insufficient cache key for compiled CUDA graphs caused cache collisions and errors when using the deepep_high_throughput backend. The fix correctly disables CUDA graphs for this specific configuration, preventing the crash. This is a solid, pragmatic solution for the immediate problem. I've added one high-severity suggestion to add a TODO comment to track the technical debt of this temporary fix, ensuring the long-term goal of re-enabling this performance feature with a more robust caching mechanism is not lost.

vllm/platforms/cuda.py

ProExpertProg

I don't understand, is this a torch compile caching issue or is it a CUDAGraph issue? These are (somewhat) orthogonal features. I don't know of any cudagraph caching. Also I think we should be able to disable CUDAGraphs but keep compilation (maybe it just can't be piecewise).

vllm/platforms/cuda.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 · 2025-09-17T22:31:38Z

I don't understand, is this a torch compile caching issue or is it a CUDAGraph issue? These are (somewhat) orthogonal features. I don't know of any cudagraph caching. Also I think we should be able to disable CUDAGraphs but keep compilation (maybe it just can't be piecewise).

@ProExpertProg This is a compile caching issue. But the splitting_ops is used for both torch compile and CudaGraphs.

Yeah we can keep compilation as it is, just disabling cuda graphs, fixed now.

ProExpertProg

This seems fine, but if we want more performance we could also just disable inductor compile caching (increases startup time but would give the best performance).

vllm/platforms/cuda.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 · 2025-09-18T01:04:42Z

This seems fine, but if we want more performance we could also just disable inductor compile caching (increases startup time but would give the best performance).

Yes, but seems that piecewise cuda graph for HT is mainly beneficial for decoding, for prefill we don't see too much performance improvement.

ProExpertProg

Sounds good

zou3519 · 2025-09-19T00:26:48Z

vllm/platforms/cuda.py

+            # TODO: Piecewise Cuda graph might be enabled
+            # if torch compile cache key issue fixed
+            # See https://github.com/vllm-project/vllm/pull/25093


is this a bug? can you file an issue if so?

This is not a bug, it is just the cache key is not strong enough to support splitting. I think it is not worth doing the refactor just for the support of HT Piecewise cudagraph, so let's put it there.

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com>

remove-deepep-HT-support-for-piecewise-cudagraph

7e461f2

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth and youkaichao as code owners September 17, 2025 18:56

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

vllm/platforms/cuda.py Outdated Show resolved Hide resolved

ProExpertProg requested changes Sep 17, 2025

View reviewed changes

vllm/platforms/cuda.py Outdated Show resolved Hide resolved

remove enforce eager

adc0032

Signed-off-by: yewentao256 <zhyanwentao@126.com>

ProExpertProg reviewed Sep 18, 2025

View reviewed changes

vllm/platforms/cuda.py Show resolved Hide resolved

add comment

70af411

Signed-off-by: yewentao256 <zhyanwentao@126.com>

ProExpertProg approved these changes Sep 18, 2025

View reviewed changes

ProExpertProg enabled auto-merge (squash) September 18, 2025 14:39

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 18, 2025

yewentao256 mentioned this pull request Sep 18, 2025

[Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support and Prefill support #24845

Merged

5 tasks

simon-mo disabled auto-merge September 18, 2025 19:38

simon-mo merged commit d2a30a2 into vllm-project:main Sep 18, 2025
49 of 51 checks passed

yewentao256 deleted the wye-remove-deepep-HT-support-for-piecewise-cudagraph branch September 18, 2025 20:03

zou3519 reviewed Sep 19, 2025

View reviewed changes

ywang96 pushed a commit to ywang96/vllm that referenced this pull request Sep 19, 2025

[Bug] Fix torch Compilation Cache Hit Error (vllm-project#25093)

73c2bc4

Signed-off-by: yewentao256 <zhyanwentao@126.com>

debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025

[Bug] Fix torch Compilation Cache Hit Error (vllm-project#25093)

e88fa27

Signed-off-by: yewentao256 <zhyanwentao@126.com>

ABC12345anouys pushed a commit to ABC12345anouys/vllm that referenced this pull request Sep 25, 2025

[Bug] Fix torch Compilation Cache Hit Error (vllm-project#25093)

d177d4c

Signed-off-by: yewentao256 <zhyanwentao@126.com>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[Bug] Fix torch Compilation Cache Hit Error (vllm-project#25093)

335003b

Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: charlifu <charlifu@amd.com>

yewentao256 mentioned this pull request Oct 6, 2025

[Bug]: splitting_ops can be updated after it gets included in the compile cache key #26299

Open

1 task

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Bug] Fix torch Compilation Cache Hit Error (vllm-project#25093)

96816dc

Signed-off-by: yewentao256 <zhyanwentao@126.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Bug] Fix torch Compilation Cache Hit Error (vllm-project#25093)

176bbc0

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 mentioned this pull request Nov 27, 2025

[Perf] Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement #29558

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Fix torch Compilation Cache Hit Error#25093

[Bug] Fix torch Compilation Cache Hit Error#25093
simon-mo merged 3 commits intovllm-project:mainfrom
neuralmagic:wye-remove-deepep-HT-support-for-piecewise-cudagraph

yewentao256 commented Sep 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

ProExpertProg left a comment

Uh oh!

Uh oh!

yewentao256 commented Sep 17, 2025

Uh oh!

ProExpertProg left a comment

Uh oh!

Uh oh!

yewentao256 commented Sep 18, 2025

Uh oh!

ProExpertProg left a comment

Uh oh!

Uh oh!

zou3519 Sep 19, 2025

Uh oh!

yewentao256 Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

yewentao256 commented Sep 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yewentao256 commented Sep 17, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yewentao256 commented Sep 18, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zou3519 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

yewentao256 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yewentao256 commented Sep 17, 2025 •

edited by github-actions bot

Loading