support more model in piecewise cuda graph#11745
Conversation
Summary of ChangesHello @narutolhy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the compatibility and flexibility of the CUDA graph runner by addressing limitations related to tensor shape strictness and logprob handling. By relaxing output shape checks to focus on element count and refining the conditions under which CUDA graphs can be utilized with logprobs, it enables support for a wider array of models, such as Gemma-3-1B-IT, ensuring efficient execution without compromising numerical accuracy. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces two main changes: relaxing the output shape check in radix_attention to support models with varying tensor layouts, and updating the piecewise_cuda_graph_runner to allow CUDA graph execution when only last-token logprobs are requested.
The change in radix_attention.py to check numel() instead of shape is a good improvement for broader model compatibility.
However, I've found a critical issue in piecewise_cuda_graph_runner.py. The new logic in can_run to check for input logprob requests is inverted and also doesn't handle cases where logprobs are not requested at all, which could lead to a TypeError. This would either cause a crash or result in silently dropping requested input logprobs when using CUDA graphs. Please see my detailed comment for a suggested fix.
| for start_len, seq_len in zip( | ||
| forward_batch.extend_logprob_start_lens_cpu, | ||
| forward_batch.extend_seq_lens_cpu, | ||
| ): | ||
| if start_len is not None and start_len < seq_len: | ||
| return True |
There was a problem hiding this comment.
There are two issues with this new logic:
forward_batch.extend_logprob_start_lens_cpucan beNone, which will causezipto raise aTypeError. You should check ifforward_batch.return_logprobisTruebefore iterating, similar to the old logic.- The logic inside the loop is inverted. If input-id logprobs are requested (
start_len < seq_len), CUDA graph execution should be skipped, socan_runshould returnFalse. The current code returnsTrue, which would lead to incorrect behavior as the requested logprobs would not be computed in the CUDA graph path.
| for start_len, seq_len in zip( | |
| forward_batch.extend_logprob_start_lens_cpu, | |
| forward_batch.extend_seq_lens_cpu, | |
| ): | |
| if start_len is not None and start_len < seq_len: | |
| return True | |
| if forward_batch.return_logprob: | |
| for start_len, seq_len in zip( | |
| forward_batch.extend_logprob_start_lens_cpu, | |
| forward_batch.extend_seq_lens_cpu, | |
| ): | |
| if start_len is not None and start_len < seq_len: | |
| return False |
|
Hi @Oasis-Git, Thank you for providing the piecewise_cuda_graph feature, which is very helpful. I have made some relaxations on the usage conditions to make it applicable to more scenarios and models. Please take a look. Thank you. |
| seq_lens_sum=forward_batch.seq_lens_sum, | ||
| encoder_lens=forward_batch.encoder_lens, | ||
| return_logprob=forward_batch.return_logprob, | ||
| return_logprob=False, |
There was a problem hiding this comment.
For logprobs support, @Oasis-Git is working on that.
There was a problem hiding this comment.
Thanks, I see the plan. I'm only temporarily supporting the return of logprob for non-input tokens. Originally, in the can_run function, only batch logprob was returned, not using piecewise CUDA graphs. However, it seems only input tokens were not supported, so I relaxed the requirements a bit to also support the return of the logprob for the last token in prefill-only scenarios.
Same as line 223, it is all False here and can be changed to True later. I am planning to use it temporarily here. So I set it to false first. Thanks
|
@narutolhy Thanks for contribution! We have a slack channel #piecewise-cuda-graph to discuss this feature, welcome to join if you are interested. |
|
Hi @narutolhy Thanks for your contribution. Could you please add the following part in your pr:
|
ok, I will add them |
Hi @Oasis-Git I have add Unit Test for MMLU in file: https://github.com/sgl-project/sglang/blob/main/test/srt/test_piecewise_cuda_graph.py There is no problem when using the unsloth/gemma-3-1b-it model, but an error occurs when using the unsloth/gemma-3-4b-it model: The value of the fourth pointer is inconsistent, I don't know if you've encountered this before. Please review it. |
I left some comments in slack channel. |
This reverts commit 5934d4d.
Motivation
This PR refines the execution conditions in the CUDA graph runner and relaxes strict output shape checks to improve compatibility with models that have different tensor layouts.
Specifically, some models (e.g., Gemma-3-1B-IT) produce tensors with the same total number of elements but different shapes due to layout or internal reshape differences. The previous strict assert output.shape == ret.shape would incorrectly fail in such cases.
In addition, CUDA graph execution should be allowed when logprobs are returned for the last token only, but not when input logprobs are required.
Modifications
1
piecewise_cuda_graph_runner.pyUpdated can_run() logic:
Previously, CUDA graph capture was skipped when forward_batch.return_logprobs was True.
Now, it only skips when input-id logprobs are requested.
2
radix_attention.pyReplaced strict shape check:
with a relaxed element-count check:
Accuracy Tests
Verified correctness on multiple models including:
Qwen3-0.6B
Gemma-3-1B-IT
Outputs are numerically identical to the baseline (within FP16 tolerance).
Logprob return path remains functionally unchanged.
Checklist