Piecewise Cuda Graph set default#16331
Conversation
Summary of ChangesHello @Oasis-Git, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the management and default behavior of the Piecewise CUDA Graph feature. The primary goal is to make Piecewise CUDA Graphs enabled by default, shifting the user interaction from an opt-in Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the configuration for Piecewise CUDA Graph, enabling it by default and providing a --disable-piecewise-cuda-graph flag. The changes are mostly consistent and improve the codebase by centralizing configuration logic. However, I've identified a potential issue where some safety checks for feature compatibility were dropped during the refactoring. My review includes a suggestion to restore these checks to prevent potential runtime errors.
| def _handle_piecewise_cuda_graph(self): | ||
| # Disable piecewise cuda graph with following conditions: | ||
| # 1. Speculative decoding | ||
| if self.speculative_algorithm is not None: | ||
| self.disable_piecewise_cuda_graph = True | ||
| # 2. DP attention | ||
| if self.enable_dp_attention: | ||
| self.disable_piecewise_cuda_graph = True | ||
|
|
||
| # TODO: Add more conditions to disable piecewise cuda graph | ||
|
|
||
|
|
There was a problem hiding this comment.
It seems that some checks from the old can_run_piecewise_cuda_graph method in model_runner.py were missed when refactoring the logic into this new _handle_piecewise_cuda_graph method. The old method included checks for enable_torch_compile, pp_size > 1, and specific MoE A2A backends, which appear to be important for preventing runtime issues with unsupported feature combinations.
I recommend re-introducing these checks and adding logging to inform users when Piecewise CUDA Graph is disabled, similar to the old implementation. This will improve robustness and user experience.
def _handle_piecewise_cuda_graph(self):
# Disable piecewise cuda graph with following conditions:
# 1. Speculative decoding
if self.speculative_algorithm is not None:
self.disable_piecewise_cuda_graph = True
log_info_on_rank0(logger, "Disable piecewise CUDA graph because it is not compatible with speculative decoding.")
# 2. DP attention
if self.enable_dp_attention:
self.disable_piecewise_cuda_graph = True
log_info_on_rank0(logger, "Disable piecewise CUDA graph because it is not compatible with DP attention.")
# 3. torch.compile
if self.enable_torch_compile:
self.disable_piecewise_cuda_graph = True
log_info_on_rank0(logger, "Disable piecewise CUDA graph because it has a conflict with torch.compile.")
# 4. Pipeline Parallelism
if self.pp_size > 1:
self.disable_piecewise_cuda_graph = True
log_info_on_rank0(logger, "Disable piecewise CUDA graph because it does not support Pipeline Parallelism.")
# 5. MoE A2A backends
if self.moe_a2a_backend in ["deepep", "mooncake"]:
self.disable_piecewise_cuda_graph = True
log_info_on_rank0(logger, "Disable piecewise CUDA graph due to existing compilation errors with MoE A2A backends.")Signed-off-by: Yimeng-mesh-1 <yimengteng@link.cuhk.edu.cn> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
`enable_hierarchical_cache` was incorrectly grouped with `cpu_offload_gb` in `_handle_piecewise_cuda_graph()` condition sgl-project#12 (introduced by sgl-project#16331), causing piecewise CUDA graph (PCG) to be disabled when hierarchical cache is enabled. This breaks FP8 KV cache with flashinfer attention backend, producing completely garbled output. FP8 flashinfer decode relies on PCG's direct model.forward() execution to receive fresh dispatch plans on each decode step. With regular CUDA graph replay, the FP8 kernel's execution plan is frozen at capture time and becomes stale during replay. Unlike `cpu_offload_gb` which moves model weights to CPU during forward (changing GPU tensor addresses and breaking CUDA graph replay), hierarchical cache only performs KV cache eviction/restore at the scheduler level between forward passes. It does not affect tensor addresses or CUDA graph recording/replay in any way. Verified on MiniMax-M2.5 (TP4, H20, flashinfer + fp8_e4m3 + hicache): - Before fix: garbled output (PCG incorrectly disabled) - After fix: correct output, stable for 9+ hours Three independent methods of disabling PCG all produce garbled FP8 output: --enable-hierarchical-cache, --disable-piecewise-cuda-graph, and --enable-dp-attention, confirming the root cause is FP8's dependency on PCG rather than any hicache-specific interaction. BF16 KV cache is unaffected because BF16 decode kernels do not depend on PCG's segmented execution mechanism.
Hierarchical cache was incorrectly grouped with cpu_offload in _handle_piecewise_cuda_graph() condition sgl-project#12 (introduced by sgl-project#16331). This causes FP8 flashinfer decode to produce garbled output when hicache is enabled, because FP8 decode depends on PCG. Unlike cpu_offload (which moves model weights during forward and breaks CUDA graph replay), hicache only does KV eviction/restore at the scheduler level between forward passes and does not affect CUDA graph recording or replay.
Hierarchical cache was incorrectly grouped with cpu_offload in _handle_piecewise_cuda_graph() condition sgl-project#12 (introduced by sgl-project#16331). This causes FP8 flashinfer decode to produce garbled output when hicache is enabled, because FP8 decode depends on PCG. Unlike cpu_offload (which moves model weights during forward and breaks CUDA graph replay), hicache only does KV eviction/restore at the scheduler level between forward passes and does not affect CUDA graph recording or replay.
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration - MTP acceptance: 3.29 tokens/step Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy (Qwen3.5-35B-A3B FP8, TP2, H100, GSM8K 50q, --reasoning-parser qwen3): - Baseline (extra_buffer): 0.980 - MTP alone: 0.980, acceptance=3.44 - PCG + MTP: 0.980, acceptance=3.46 Benchmark (extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN): - ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule - TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical) EAGLE3: PCG=1.000, acceptance=4.23 STANDALONE: PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PCG and speculative decoding operate on independent forward paths: - PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None - Speculative: draft/verify uses decode CUDA graphs or eager execution Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY mode, since PCG graphs are captured with EXTEND/spec_info=None and must not be replayed for verify batches that have different spec_info and capture_hidden_mode. Previously all speculative algorithms disabled PCG (sgl-project#16331 added this as a conservative safety measure when PCG became default-enabled). The original PCG implementation (sgl-project#10062) had no speculative restriction. Accuracy verification (GSM8K, 50 questions per config): EAGLE/NEXTN: PCG=0.970, acceptance=3.44 EAGLE3: PCG=1.000, acceptance=4.23 (verified on TP2) STANDALONE: PCG=0.840, acceptance=3.24 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Motivation
Work in progress
Modifications
Work in progress
Accuracy Tests
Work in progress
Benchmarking and Profiling
Work in progress
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci) or contact authorized users to do so.