Skip to content

Piecewise Cuda Graph Support for gpt-oss model#13045

Merged
ispobock merged 4 commits intosgl-project:mainfrom
Oasis-Git:gpt-oss
Nov 15, 2025
Merged

Piecewise Cuda Graph Support for gpt-oss model#13045
ispobock merged 4 commits intosgl-project:mainfrom
Oasis-Git:gpt-oss

Conversation

@Oasis-Git
Copy link
Copy Markdown
Collaborator

@Oasis-Git Oasis-Git commented Nov 11, 2025

Motivation

Support Piecewise cuda graph for gpt-oss series model.

Modifications

  • MoE backend Select: With piecewise cuda graph, we can achieve similar performance with auto backend compared with triton backend.
  • Adjust the position of enable_piecewise_cudagraph to avoid circular import

Accuracy Tests

In benchmark & profilling section

Benchmarking and Profiling

For gsm 8k test:

  • piecewise cuda graph support with auto backend
Accuracy: 0.525
Invalid: 0.158
Latency: 34.185 s
Output throughput: 16977.166 token/s
  • triton backend
Accuracy: 0.522
Invalid: 0.149
Latency: 34.580 s
Output throughput: 17012.391 token/s
  • auto backend only
Accuracy: 0.512
Invalid: 0.157
Latency: 250.572 s
Output throughput: 2301.222 token/s

Checklist

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

if self.moe_runner_backend == "auto":
if is_blackwell_supported() and is_mxfp4_quant_format:
if self.enable_piecewise_cuda_graph:
self.moe_runner_backend = "auto"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating code?

Copy link
Copy Markdown
Collaborator

@BBuf BBuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.


if self.moe_runner_backend == "auto":
if is_blackwell_supported() and is_mxfp4_quant_format:
if self.enable_piecewise_cuda_graph:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we enable both piecewise cuda graph and flashinfer_mxfp4, which moe_runner_backend will use?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Here is potential problems. Will fix it soon

@ispobock ispobock merged commit eae59b3 into sgl-project:main Nov 15, 2025
51 of 61 checks passed
@Oasis-Git Oasis-Git deleted the gpt-oss branch November 22, 2025 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants