Piecewise Cuda Graph set default by Oasis-Git · Pull Request #16331 · sgl-project/sglang

Oasis-Git · 2026-01-03T05:28:04Z

Motivation

Work in progress

Modifications

Work in progress

Accuracy Tests

Work in progress

Benchmarking and Profiling

Work in progress

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-03T05:28:33Z

Summary of Changes

Hello @Oasis-Git, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the management and default behavior of the Piecewise CUDA Graph feature. The primary goal is to make Piecewise CUDA Graphs enabled by default, shifting the user interaction from an opt-in --enable flag to an opt-out --disable flag. It also consolidates the conditions under which this optimization is automatically turned off due to conflicts with other features, leading to a more robust and user-friendly configuration experience while simplifying the underlying code structure.

Highlights

Default Piecewise CUDA Graph Behavior: The Piecewise CUDA Graph feature is now enabled by default, simplifying its activation for users. Previously, it required explicit enablement.
Configuration Flag Inversion: The command-line argument and internal server argument have been changed from --enable-piecewise-cuda-graph to --disable-piecewise-cuda-graph, reflecting the new opt-out approach.
Centralized Disablement Logic: Logic for automatically disabling Piecewise CUDA Graph under incompatible conditions (e.g., when speculative decoding or distributed attention are enabled) has been centralized within the ServerArgs initialization, removing redundant checks from other parts of the codebase.
Code Simplification: The set_compiled context manager has been removed, and the ForwardContext initialization has been refactored for cleaner and more direct parameter passing, streamlining the compilation and context management code.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the configuration for Piecewise CUDA Graph, enabling it by default and providing a --disable-piecewise-cuda-graph flag. The changes are mostly consistent and improve the codebase by centralizing configuration logic. However, I've identified a potential issue where some safety checks for feature compatibility were dropped during the refactoring. My review includes a suggestion to restore these checks to prevent potential runtime errors.

gemini-code-assist · 2026-01-03T05:30:05Z

+    def _handle_piecewise_cuda_graph(self):
+        # Disable piecewise cuda graph with following conditions:
+        # 1. Speculative decoding
+        if self.speculative_algorithm is not None:
+            self.disable_piecewise_cuda_graph = True
+        # 2. DP attention
+        if self.enable_dp_attention:
+            self.disable_piecewise_cuda_graph = True
+
+        # TODO: Add more conditions to disable piecewise cuda graph
+
+


It seems that some checks from the old can_run_piecewise_cuda_graph method in model_runner.py were missed when refactoring the logic into this new _handle_piecewise_cuda_graph method. The old method included checks for enable_torch_compile, pp_size > 1, and specific MoE A2A backends, which appear to be important for preventing runtime issues with unsupported feature combinations.

I recommend re-introducing these checks and adding logging to inform users when Piecewise CUDA Graph is disabled, similar to the old implementation. This will improve robustness and user experience.

def _handle_piecewise_cuda_graph(self): # Disable piecewise cuda graph with following conditions: # 1. Speculative decoding if self.speculative_algorithm is not None: self.disable_piecewise_cuda_graph = True log_info_on_rank0(logger, "Disable piecewise CUDA graph because it is not compatible with speculative decoding.") # 2. DP attention if self.enable_dp_attention: self.disable_piecewise_cuda_graph = True log_info_on_rank0(logger, "Disable piecewise CUDA graph because it is not compatible with DP attention.") # 3. torch.compile if self.enable_torch_compile: self.disable_piecewise_cuda_graph = True log_info_on_rank0(logger, "Disable piecewise CUDA graph because it has a conflict with torch.compile.") # 4. Pipeline Parallelism if self.pp_size > 1: self.disable_piecewise_cuda_graph = True log_info_on_rank0(logger, "Disable piecewise CUDA graph because it does not support Pipeline Parallelism.") # 5. MoE A2A backends if self.moe_a2a_backend in ["deepep", "mooncake"]: self.disable_piecewise_cuda_graph = True log_info_on_rank0(logger, "Disable piecewise CUDA graph due to existing compilation errors with MoE A2A backends.")

Signed-off-by: Yimeng-mesh-1 <yimengteng@link.cuhk.edu.cn> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Signed-off-by: Samuel Shen <slshen@uchicago.edu>

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Signed-off-by: Samuel Shen <slshen@uchicago.edu>

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

`enable_hierarchical_cache` was incorrectly grouped with `cpu_offload_gb` in `_handle_piecewise_cuda_graph()` condition sgl-project#12 (introduced by sgl-project#16331), causing piecewise CUDA graph (PCG) to be disabled when hierarchical cache is enabled. This breaks FP8 KV cache with flashinfer attention backend, producing completely garbled output. FP8 flashinfer decode relies on PCG's direct model.forward() execution to receive fresh dispatch plans on each decode step. With regular CUDA graph replay, the FP8 kernel's execution plan is frozen at capture time and becomes stale during replay. Unlike `cpu_offload_gb` which moves model weights to CPU during forward (changing GPU tensor addresses and breaking CUDA graph replay), hierarchical cache only performs KV cache eviction/restore at the scheduler level between forward passes. It does not affect tensor addresses or CUDA graph recording/replay in any way. Verified on MiniMax-M2.5 (TP4, H20, flashinfer + fp8_e4m3 + hicache): - Before fix: garbled output (PCG incorrectly disabled) - After fix: correct output, stable for 9+ hours Three independent methods of disabling PCG all produce garbled FP8 output: --enable-hierarchical-cache, --disable-piecewise-cuda-graph, and --enable-dp-attention, confirming the root cause is FP8's dependency on PCG rather than any hicache-specific interaction. BF16 KV cache is unaffected because BF16 decode kernels do not depend on PCG's segmented execution mechanism.

Hierarchical cache was incorrectly grouped with cpu_offload in _handle_piecewise_cuda_graph() condition sgl-project#12 (introduced by sgl-project#16331). This causes FP8 flashinfer decode to produce garbled output when hicache is enabled, because FP8 decode depends on PCG. Unlike cpu_offload (which moves model weights during forward and breaks CUDA graph replay), hicache only does KV eviction/restore at the scheduler level between forward passes and does not affect CUDA graph recording or replay.

PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration - MTP acceptance: 3.29 tokens/step Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy (Qwen3.5-35B-A3B FP8, TP2, H100, GSM8K 50q, --reasoning-parser qwen3): - Baseline (extra_buffer): 0.980 - MTP alone: 0.980, acceptance=3.44 - PCG + MTP: 0.980, acceptance=3.46 Benchmark (extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions Bot added deepseek piecewise-cuda-graph labels Jan 3, 2026

gemini-code-assist Bot reviewed Jan 3, 2026

View reviewed changes

pcg set default

28dc85f

Signed-off-by: Yimeng-mesh-1 <yimengteng@link.cuhk.edu.cn> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git force-pushed the default branch from e5aab00 to 28dc85f Compare January 3, 2026 05:38

Oasis-Git added 24 commits January 9, 2026 14:21

Merge branch 'main' into default

87533c0

add fix

9288a8a

Signed-off-by: Samuel Shen <slshen@uchicago.edu>

fix

2ba4f33

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

fix

37b5c1b

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

add fix

5113e8f

Signed-off-by: Samuel Shen <slshen@uchicago.edu>

Merge branch 'main' into fix

5eeec7e

minor change

b916114

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

remove test

7ee8153

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

type fix

d1b0584

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

minor change

7018b74

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Merge branch 'main' into fix

8462c52

Merge branch 'main' into fix

6537eff

minor change

ec28b7d

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Merge branch 'main' into fix

7183de9

merge fix

0b0af69

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

default upload

08447ea

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

default

76de9c7

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

default

79e1937

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

remove conditions

5dc19a5

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

add safe check

af15826

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

fix disable reason

63d0f63

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

disable pcg with input embeddings

1a1a7aa

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

disable with offload

b5c7e84

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

set pcg max token with max content

01421f3

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

ishandhanani mentioned this pull request Mar 5, 2026

MiniMax-M2.5 FP8 produces garbage output on B200 with default trtllm gemm backend #19971

Open

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

Piecewise Cuda Graph set default (sgl-project#16331)

1bb5c83

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

Piecewise Cuda Graph set default (sgl-project#16331)

241402a

janbernloehr mentioned this pull request Mar 21, 2026

[Bug] Piecewise CUDA Graph crashes with illegal memory access on H100 (FA3 backend) during warmup_compile #21112

Open

5 tasks

narutolhy mentioned this pull request Apr 5, 2026

Allow piecewise CUDA graph with speculative decoding #22128

Merged

5 tasks

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

Piecewise Cuda Graph set default (sgl-project#16331)

dfb8e9b

hubertlu-tw mentioned this pull request Apr 8, 2026

[AMD] Enable Piecewise CUDA Graph for AMD GPUs #22299

Draft

5 tasks

Oasis-Git assigned Oasis-Git and unassigned Oasis-Git Apr 8, 2026

moehanabi mentioned this pull request Apr 13, 2026

fix: Bump sglang version from 0.5.9 to 0.5.10 sgl-project/SpecForge#529

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Piecewise Cuda Graph set default#16331

Piecewise Cuda Graph set default#16331
ispobock merged 86 commits intosgl-project:mainfrom
Oasis-Git:default

Oasis-Git commented Jan 3, 2026

Uh oh!

gemini-code-assist Bot commented Jan 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Oasis-Git commented Jan 3, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants