support more model in piecewise cuda graph by narutolhy · Pull Request #11745 · sgl-project/sglang

narutolhy · 2025-10-17T04:10:50Z

Motivation

This PR refines the execution conditions in the CUDA graph runner and relaxes strict output shape checks to improve compatibility with models that have different tensor layouts.

Specifically, some models (e.g., Gemma-3-1B-IT) produce tensors with the same total number of elements but different shapes due to layout or internal reshape differences. The previous strict assert output.shape == ret.shape would incorrectly fail in such cases.
In addition, CUDA graph execution should be allowed when logprobs are returned for the last token only, but not when input logprobs are required.

Modifications

1 piecewise_cuda_graph_runner.py

Updated can_run() logic:
Previously, CUDA graph capture was skipped when forward_batch.return_logprobs was True.
Now, it only skips when input-id logprobs are requested.

for start_len, seq_len in zip(
    forward_batch.extend_logprob_start_lens_cpu,
    forward_batch.extend_seq_lens_cpu
):
    if start_len is not None and start_len < seq_len:
        return True

2 radix_attention.py
Replaced strict shape check:

assert output.shape == ret.shape

with a relaxed element-count check:

assert output.numel() == ret.numel()

Accuracy Tests

Verified correctness on multiple models including:

Qwen3-0.6B

Gemma-3-1B-IT

Outputs are numerically identical to the baseline (within FP16 tolerance).

Logprob return path remains functionally unchanged.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-10-17T04:11:02Z

Summary of Changes

Hello @narutolhy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the compatibility and flexibility of the CUDA graph runner by addressing limitations related to tensor shape strictness and logprob handling. By relaxing output shape checks to focus on element count and refining the conditions under which CUDA graphs can be utilized with logprobs, it enables support for a wider array of models, such as Gemma-3-1B-IT, ensuring efficient execution without compromising numerical accuracy.

Highlights

Relaxed Output Shape Check: The strict assertion for exact tensor shape matching (output.shape == ret.shape) in radix_attention.py has been replaced with a more flexible check ensuring only the total number of elements (output.numel() == ret.numel()) is identical. This allows models with different tensor layouts but the same data size to function correctly.
Refined CUDA Graph can_run() Logic: The piecewise_cuda_graph_runner.py now permits CUDA graph execution even when return_logprob is enabled, provided that logprobs are not requested for input IDs (i.e., only for the last token). This enhances compatibility for models that return logprobs selectively.
Explicit return_logprob Handling in replay_prepare: The replay_prepare function in piecewise_cuda_graph_runner.py now explicitly sets return_logprob=False when invoking the model runner's forward pass during graph replay. This ensures consistent behavior with the updated can_run logic for logprob handling within CUDA graphs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces two main changes: relaxing the output shape check in radix_attention to support models with varying tensor layouts, and updating the piecewise_cuda_graph_runner to allow CUDA graph execution when only last-token logprobs are requested.

The change in radix_attention.py to check numel() instead of shape is a good improvement for broader model compatibility.

However, I've found a critical issue in piecewise_cuda_graph_runner.py. The new logic in can_run to check for input logprob requests is inverted and also doesn't handle cases where logprobs are not requested at all, which could lead to a TypeError. This would either cause a crash or result in silently dropping requested input logprobs when using CUDA graphs. Please see my detailed comment for a suggested fix.

gemini-code-assist · 2025-10-17T04:12:28Z

+        for start_len, seq_len in zip(
+            forward_batch.extend_logprob_start_lens_cpu,
+            forward_batch.extend_seq_lens_cpu,
+        ):
+            if start_len is not None and start_len < seq_len:
+                return True


There are two issues with this new logic:

forward_batch.extend_logprob_start_lens_cpu can be None, which will cause zip to raise a TypeError. You should check if forward_batch.return_logprob is True before iterating, similar to the old logic.

The logic inside the loop is inverted. If input-id logprobs are requested (start_len < seq_len), CUDA graph execution should be skipped, so can_run should return False. The current code returns True, which would lead to incorrect behavior as the requested logprobs would not be computed in the CUDA graph path.

Suggested change

for start_len, seq_len in zip(

forward_batch.extend_logprob_start_lens_cpu,

forward_batch.extend_seq_lens_cpu,

):

if start_len is not None and start_len < seq_len:

return True

if forward_batch.return_logprob:

for start_len, seq_len in zip(

forward_batch.extend_logprob_start_lens_cpu,

forward_batch.extend_seq_lens_cpu,

):

if start_len is not None and start_len < seq_len:

return False

narutolhy · 2025-10-17T04:16:10Z

Hi @Oasis-Git, Thank you for providing the piecewise_cuda_graph feature, which is very helpful. I have made some relaxations on the usage conditions to make it applicable to more scenarios and models. Please take a look. Thank you.

ispobock · 2025-10-17T09:26:45Z

            seq_lens_sum=forward_batch.seq_lens_sum,
            encoder_lens=forward_batch.encoder_lens,
-            return_logprob=forward_batch.return_logprob,
+            return_logprob=False,


For logprobs support, @Oasis-Git is working on that.

Thanks, I see the plan. I'm only temporarily supporting the return of logprob for non-input tokens. Originally, in the can_run function, only batch logprob was returned, not using piecewise CUDA graphs. However, it seems only input tokens were not supported, so I relaxed the requirements a bit to also support the return of the logprob for the last token in prefill-only scenarios.
Same as line 223, it is all False here and can be changed to True later. I am planning to use it temporarily here. So I set it to false first. Thanks

ispobock · 2025-10-17T09:35:16Z

@narutolhy Thanks for contribution! We have a slack channel #piecewise-cuda-graph to discuss this feature, welcome to join if you are interested.

Oasis-Git · 2025-10-18T01:58:37Z

Hi @narutolhy

Thanks for your contribution. Could you please add the following part in your pr:

MMLU Test/Benchmark Output
Unit Test for MMLU in file: https://github.com/sgl-project/sglang/blob/main/test/srt/test_piecewise_cuda_graph.py

narutolhy · 2025-10-18T04:53:40Z

Hi @narutolhy

Thanks for your contribution. Could you please add the following part in your pr:

MMLU Test/Benchmark Output

Unit Test for MMLU in file: https://github.com/sgl-project/sglang/blob/main/test/srt/test_piecewise_cuda_graph.py

ok, I will add them

narutolhy · 2025-10-20T07:45:37Z

Hi @narutolhy

Thanks for your contribution. Could you please add the following part in your pr:

MMLU Test/Benchmark Output

Unit Test for MMLU in file: https://github.com/sgl-project/sglang/blob/main/test/srt/test_piecewise_cuda_graph.py

Hi @Oasis-Git I have add Unit Test for MMLU in file: https://github.com/sgl-project/sglang/blob/main/test/srt/test_piecewise_cuda_graph.py
I also added tests for the gemma model so that I can test scenarios where the shapes are different but the overall size is the same.

There is no problem when using the unsloth/gemma-3-1b-it model, but an error occurs when using the unsloth/gemma-3-4b-it model:
AssertionError: Input addresses for cudagraphs are different during replay. Expected [140472240193024, 140472239365632, 140472239366144, 140383149409792, 140472239366656, 140466951553024, 140472239364608, 140472239365120], got [140472240193024, 140472239365632, 140472239366144, 140399219879424, 140472239366656, 140466951553024, 140472239364608, 140472239365120]

The value of the fourth pointer is inconsistent, I don't know if you've encountered this before.
That's why I used unsloth/gemma-3-1b-it for the mmlu test. Since it's a small model, the test standards were lowered.

Please review it.
Thank you

Oasis-Git · 2025-10-20T08:11:19Z

Hi @narutolhy
Thanks for your contribution. Could you please add the following part in your pr:

MMLU Test/Benchmark Output

Unit Test for MMLU in file: https://github.com/sgl-project/sglang/blob/main/test/srt/test_piecewise_cuda_graph.py

Hi @Oasis-Git I have add Unit Test for MMLU in file: https://github.com/sgl-project/sglang/blob/main/test/srt/test_piecewise_cuda_graph.py I also added tests for the gemma model so that I can test scenarios where the shapes are different but the overall size is the same.

There is no problem when using the unsloth/gemma-3-1b-it model, but an error occurs when using the unsloth/gemma-3-4b-it model: AssertionError: Input addresses for cudagraphs are different during replay. Expected [140472240193024, 140472239365632, 140472239366144, 140383149409792, 140472239366656, 140466951553024, 140472239364608, 140472239365120], got [140472240193024, 140472239365632, 140472239366144, 140399219879424, 140472239366656, 140466951553024, 140472239364608, 140472239365120]

The value of the fourth pointer is inconsistent, I don't know if you've encountered this before. That's why I used unsloth/gemma-3-1b-it for the mmlu test. Since it's a small model, the test standards were lowered.

Please review it. Thank you

I left some comments in slack channel.

This reverts commit 5934d4d.

support more model in piecewise cuda graph

edccc52

narutolhy requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy and zhyncs as code owners October 17, 2025 04:10

gemini-code-assist Bot reviewed Oct 17, 2025

View reviewed changes

ispobock reviewed Oct 17, 2025

View reviewed changes

ispobock added the run-ci label Oct 17, 2025

ispobock approved these changes Oct 17, 2025

View reviewed changes

fix

b4f96d0

add mmlu test

5934d4d

narutolhy added 2 commits October 20, 2025 10:58

Revert "add mmlu test"

475ded8

This reverts commit 5934d4d.

add mmlu test

89cc799

merrymercy approved these changes Oct 23, 2025

View reviewed changes

ispobock merged commit 1801cd1 into sgl-project:main Oct 24, 2025
58 of 68 checks passed

ispobock mentioned this pull request Oct 29, 2025

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Closed

34 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support more model in piecewise cuda graph#11745

support more model in piecewise cuda graph#11745
ispobock merged 5 commits intosgl-project:mainfrom
narutolhy:reshape_for_piecewise_cuda_graph

narutolhy commented Oct 17, 2025

Uh oh!

gemini-code-assist Bot commented Oct 17, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Oct 17, 2025

Uh oh!

narutolhy commented Oct 17, 2025

Uh oh!

ispobock Oct 17, 2025

Uh oh!

narutolhy Oct 17, 2025

Uh oh!

ispobock commented Oct 17, 2025

Uh oh!

Oasis-Git commented Oct 18, 2025

Uh oh!

narutolhy commented Oct 18, 2025

Uh oh!

narutolhy commented Oct 20, 2025

Uh oh!

Oasis-Git commented Oct 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

narutolhy commented Oct 17, 2025

Motivation

Modifications

Accuracy Tests

Checklist

Uh oh!

gemini-code-assist Bot commented Oct 17, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

narutolhy commented Oct 17, 2025

Uh oh!

ispobock Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

narutolhy Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

ispobock commented Oct 17, 2025

Uh oh!

Oasis-Git commented Oct 18, 2025

Uh oh!

narutolhy commented Oct 18, 2025

Uh oh!

narutolhy commented Oct 20, 2025

Uh oh!

Oasis-Git commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Oasis-Git commented Oct 20, 2025 •

edited

Loading