Support piecewise cuda graph for MLA by ispobock · Pull Request #11812 · sgl-project/sglang

ispobock · 2025-10-18T18:13:02Z

Motivation

Support piecewise cuda graph for MLA.

Triton

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V2-Lite --enable-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8192 --attention-backend triton
python3 benchmark/gsm8k/bench_sglang.py --parallel 1319 --num-questions 1319

Accuracy: 0.387
Invalid: 0.005
Latency: 16.465 s
Output throughput: 9663.436 token/s

Flashinfer

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V2-Lite --enable-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8192 --attention-backend flashinfer
python3 benchmark/gsm8k/bench_sglang.py --parallel 1319 --num-questions 1319

Accuracy: 0.384
Invalid: 0.005
Latency: 15.206 s
Output throughput: 10550.984 token/s

FA3

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V2-Lite --enable-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8192 --attention-backend fa3
python3 benchmark/gsm8k/bench_sglang.py --parallel 1319 --num-questions 1319

Accuracy: 0.387
Invalid: 0.004
Latency: 15.164 s
Output throughput: 10469.733 token/s

gemini-code-assist · 2025-10-18T18:13:26Z

Summary of Changes

Hello @ispobock, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly advances the support for piecewise CUDA graphs within the system, particularly for Multi-Layer Attention (MLA) and DeepSeek models. It refines the integration of torch.compile by addressing stability issues and ensuring that model compilation and execution are properly managed within the CUDA graph context. The changes also enhance the flexibility of attention mechanisms by allowing additional parameters and improve compatibility with various model architectures, leading to more efficient and robust model execution.

Highlights

Piecewise CUDA Graph Integration: Refactored and enhanced the piecewise CUDA graph context management, ensuring that model compilation and execution correctly utilize this feature for improved performance.
DeepSeek Model Compatibility: Added specific support for DeepSeek models by recognizing their unique attention layer attribute (attn_mqa) and adjusting attention backend behavior within the CUDA graph mode to maintain compatibility.
torch.compile Stability Improvements: Introduced configuration changes to torch.compile to mitigate FailOnRecompileLimitHit errors, significantly improving the robustness and reliability of graph compilation.
Flexible Attention Parameters: Extended the unified_attention_with_output function to accept additional attention-related parameters like RoPE (Rotary Positional Embedding) and sinks, allowing for more complex and customizable attention mechanisms.
Dynamic Argument Dimension Inference: Improved the inference of dynamic argument dimensions for torch.Tensor and Optional[torch.Tensor] types, especially when using future import annotations, which is crucial for correct torch.compile behavior.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for piecewise CUDA graphs for models with Multi-Latent Attention (MLA), which is a valuable performance optimization. The changes correctly adapt the codebase for torch.compile, handling string-based type annotations, working around torch.compile limitations with certain tensor operations, and ensuring static code paths for graph capture. The PR also includes correctness fixes for attention layers and broadens model support. I've identified one minor inconsistency in an optimization that could be addressed for completeness.

gemini-code-assist · 2025-10-18T18:15:18Z

            query_pass = query[..., self.rotary_dim :]
            key_pass = key[..., self.rotary_dim :]

-        self.cos_sin_cache: torch.Tensor = self.cos_sin_cache.to(positions.device)


The removal of self.cos_sin_cache.to(positions.device) in forward_native is a good optimization to avoid a redundant device transfer. However, a similar redundant call still exists in the forward_npu method of the same DeepseekScalingRotaryEmbedding class (at line 825). For consistency and to apply the same optimization for NPU devices, this line should also be removed from forward_npu.

Oasis-Git · 2025-11-03T19:16:58Z

LGTM. However I suggest postponing the merge until:

Merge of [PieceWise CUDA Graph] Support awq/gptq model in piecewise cudagraph #12518 since the overall modification on context control is heavy in this branch
Understanding of capture problem of MLA model

Edenzzzz · 2025-11-03T23:35:35Z

                not get_global_server_args().flashinfer_mla_disable_ragged
                and extend_no_prefix
+                # Piecewise cuda graph should use paged prefill to be compatible with prefix cache
+                and not is_in_piecewise_cuda_graph()


Wonder why piecewise cuda graph can impact attention execution?

Hi @Edenzzzz, good question! The reason is in the forward of deepseek model (ref), there are mostly two types of extend (MHA for no prefix and MLA for w/ prefix), but we can only capture one type in the prefill cuda graph. Currently we choose MLA since it can be used in both w/ or w/o prefix cases. So in the flashinfer_mla attention backend, we only use paged prefill kernel for MLA forward.

got it, ragged is MHA

Oasis-Git · 2025-11-09T21:54:59Z

LGTM. I think it could be merged.

ispobock added 3 commits October 18, 2025 08:27

update

7bd6607

update

bb299ef

format

541c8b9

ispobock requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, hnyls2002, kssteven418, kushanam, merrymercy and zhyncs as code owners October 18, 2025 18:13

sglang-bot added the run-ci label Oct 18, 2025

ispobock marked this pull request as draft October 18, 2025 18:14

gemini-code-assist Bot reviewed Oct 18, 2025

View reviewed changes

ispobock added 4 commits October 21, 2025 08:19

fix flashinfer

c446999

Merge branch 'main' into ke/mla-compile

ebbb324

update

b8d2f43

Merge branch 'main' into ke/mla-compile

974639b

ispobock mentioned this pull request Oct 29, 2025

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Closed

34 tasks

ispobock added 4 commits November 1, 2025 15:54

Merge branch 'main' into ke/mla-compile

88d0c42

fix

7d92cdb

fix lint

15d596c

fix acc

34f9ac7

ispobock marked this pull request as ready for review November 1, 2025 09:50

ispobock added run-ci and removed run-ci labels Nov 1, 2025

ispobock assigned BBuf Nov 2, 2025

ispobock assigned Oasis-Git Nov 2, 2025

Edenzzzz reviewed Nov 3, 2025

View reviewed changes

Merge branch 'main' into ke/mla-compile

073d35d

ispobock requested a review from Fridge003 as a code owner November 7, 2025 13:11

github-actions Bot added the deepseek label Nov 7, 2025

ispobock added 7 commits November 7, 2025 21:29

Merge branch 'main' into ke/mla-compile

915afdb

Merge branch 'main' into ke/mla-compile

aa77f94

Merge branch 'main' into ke/mla-compile

156b570

fix

01cec12

add ut

44dffef

Merge branch 'main' into ke/mla-compile

ee86394

adjust time

089b0e3

ispobock merged commit db24d34 into main Nov 10, 2025
81 of 83 checks passed

ispobock deleted the ke/mla-compile branch November 10, 2025 01:13

ispobock mentioned this pull request Nov 10, 2025

Support piecewise cuda graph for deepseek v3 #12996

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support piecewise cuda graph for MLA#11812

Support piecewise cuda graph for MLA#11812
ispobock merged 19 commits intomainfrom
ke/mla-compile

ispobock commented Oct 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Oct 18, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Oct 18, 2025

Uh oh!

Oasis-Git commented Nov 3, 2025

Uh oh!

Edenzzzz Nov 3, 2025

Uh oh!

ispobock Nov 8, 2025 •

edited

Loading

Uh oh!

Edenzzzz Nov 8, 2025

Uh oh!

Oasis-Git commented Nov 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ispobock commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Triton

Flashinfer

FA3

Uh oh!

gemini-code-assist Bot commented Oct 18, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

Oasis-Git commented Nov 3, 2025

Uh oh!

Edenzzzz Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

ispobock Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Edenzzzz Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Oasis-Git commented Nov 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ispobock commented Oct 18, 2025 •

edited

Loading

ispobock Nov 8, 2025 •

edited

Loading