Skip to content

Piecewise Cuda Graph set default#16331

Merged
ispobock merged 86 commits intosgl-project:mainfrom
Oasis-Git:default
Mar 2, 2026
Merged

Piecewise Cuda Graph set default#16331
ispobock merged 86 commits intosgl-project:mainfrom
Oasis-Git:default

Conversation

@Oasis-Git
Copy link
Copy Markdown
Collaborator

Motivation

Work in progress

Modifications

Work in progress

Accuracy Tests

Work in progress

Benchmarking and Profiling

Work in progress

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Oasis-Git, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the management and default behavior of the Piecewise CUDA Graph feature. The primary goal is to make Piecewise CUDA Graphs enabled by default, shifting the user interaction from an opt-in --enable flag to an opt-out --disable flag. It also consolidates the conditions under which this optimization is automatically turned off due to conflicts with other features, leading to a more robust and user-friendly configuration experience while simplifying the underlying code structure.

Highlights

  • Default Piecewise CUDA Graph Behavior: The Piecewise CUDA Graph feature is now enabled by default, simplifying its activation for users. Previously, it required explicit enablement.
  • Configuration Flag Inversion: The command-line argument and internal server argument have been changed from --enable-piecewise-cuda-graph to --disable-piecewise-cuda-graph, reflecting the new opt-out approach.
  • Centralized Disablement Logic: Logic for automatically disabling Piecewise CUDA Graph under incompatible conditions (e.g., when speculative decoding or distributed attention are enabled) has been centralized within the ServerArgs initialization, removing redundant checks from other parts of the codebase.
  • Code Simplification: The set_compiled context manager has been removed, and the ForwardContext initialization has been refactored for cleaner and more direct parameter passing, streamlining the compilation and context management code.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the configuration for Piecewise CUDA Graph, enabling it by default and providing a --disable-piecewise-cuda-graph flag. The changes are mostly consistent and improve the codebase by centralizing configuration logic. However, I've identified a potential issue where some safety checks for feature compatibility were dropped during the refactoring. My review includes a suggestion to restore these checks to prevent potential runtime errors.

Comment on lines +805 to +816
def _handle_piecewise_cuda_graph(self):
# Disable piecewise cuda graph with following conditions:
# 1. Speculative decoding
if self.speculative_algorithm is not None:
self.disable_piecewise_cuda_graph = True
# 2. DP attention
if self.enable_dp_attention:
self.disable_piecewise_cuda_graph = True

# TODO: Add more conditions to disable piecewise cuda graph


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

It seems that some checks from the old can_run_piecewise_cuda_graph method in model_runner.py were missed when refactoring the logic into this new _handle_piecewise_cuda_graph method. The old method included checks for enable_torch_compile, pp_size > 1, and specific MoE A2A backends, which appear to be important for preventing runtime issues with unsupported feature combinations.

I recommend re-introducing these checks and adding logging to inform users when Piecewise CUDA Graph is disabled, similar to the old implementation. This will improve robustness and user experience.

    def _handle_piecewise_cuda_graph(self):
        # Disable piecewise cuda graph with following conditions:
        # 1. Speculative decoding
        if self.speculative_algorithm is not None:
            self.disable_piecewise_cuda_graph = True
            log_info_on_rank0(logger, "Disable piecewise CUDA graph because it is not compatible with speculative decoding.")
        # 2. DP attention
        if self.enable_dp_attention:
            self.disable_piecewise_cuda_graph = True
            log_info_on_rank0(logger, "Disable piecewise CUDA graph because it is not compatible with DP attention.")
        # 3. torch.compile
        if self.enable_torch_compile:
            self.disable_piecewise_cuda_graph = True
            log_info_on_rank0(logger, "Disable piecewise CUDA graph because it has a conflict with torch.compile.")
        # 4. Pipeline Parallelism
        if self.pp_size > 1:
            self.disable_piecewise_cuda_graph = True
            log_info_on_rank0(logger, "Disable piecewise CUDA graph because it does not support Pipeline Parallelism.")
        # 5. MoE A2A backends
        if self.moe_a2a_backend in ["deepep", "mooncake"]:
            self.disable_piecewise_cuda_graph = True
            log_info_on_rank0(logger, "Disable piecewise CUDA graph due to existing compilation errors with MoE A2A backends.")

Signed-off-by: Yimeng-mesh-1 <yimengteng@link.cuhk.edu.cn>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Samuel Shen <slshen@uchicago.edu>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Samuel Shen <slshen@uchicago.edu>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
DarkraiHL pushed a commit to DarkraiHL/sglang that referenced this pull request Mar 22, 2026
`enable_hierarchical_cache` was incorrectly grouped with `cpu_offload_gb`
in `_handle_piecewise_cuda_graph()` condition sgl-project#12 (introduced by sgl-project#16331),
causing piecewise CUDA graph (PCG) to be disabled when hierarchical cache
is enabled.

This breaks FP8 KV cache with flashinfer attention backend, producing
completely garbled output. FP8 flashinfer decode relies on PCG's direct
model.forward() execution to receive fresh dispatch plans on each decode
step. With regular CUDA graph replay, the FP8 kernel's execution plan is
frozen at capture time and becomes stale during replay.

Unlike `cpu_offload_gb` which moves model weights to CPU during forward
(changing GPU tensor addresses and breaking CUDA graph replay),
hierarchical cache only performs KV cache eviction/restore at the
scheduler level between forward passes. It does not affect tensor
addresses or CUDA graph recording/replay in any way.

Verified on MiniMax-M2.5 (TP4, H20, flashinfer + fp8_e4m3 + hicache):
- Before fix: garbled output (PCG incorrectly disabled)
- After fix: correct output, stable for 9+ hours

Three independent methods of disabling PCG all produce garbled FP8 output:
--enable-hierarchical-cache, --disable-piecewise-cuda-graph, and
--enable-dp-attention, confirming the root cause is FP8's dependency on
PCG rather than any hicache-specific interaction.

BF16 KV cache is unaffected because BF16 decode kernels do not depend on
PCG's segmented execution mechanism.
DarkraiHL pushed a commit to DarkraiHL/sglang that referenced this pull request Mar 22, 2026
Hierarchical cache was incorrectly grouped with cpu_offload in
_handle_piecewise_cuda_graph() condition sgl-project#12 (introduced by sgl-project#16331).
This causes FP8 flashinfer decode to produce garbled output when
hicache is enabled, because FP8 decode depends on PCG.

Unlike cpu_offload (which moves model weights during forward and
breaks CUDA graph replay), hicache only does KV eviction/restore
at the scheduler level between forward passes and does not affect
CUDA graph recording or replay.
DarkraiHL added a commit to DarkraiHL/sglang that referenced this pull request Mar 22, 2026
Hierarchical cache was incorrectly grouped with cpu_offload in
_handle_piecewise_cuda_graph() condition sgl-project#12 (introduced by sgl-project#16331).
This causes FP8 flashinfer decode to produce garbled output when
hicache is enabled, because FP8 decode depends on PCG.

Unlike cpu_offload (which moves model weights during forward and
breaks CUDA graph replay), hicache only does KV eviction/restore
at the scheduler level between forward passes and does not affect
CUDA graph recording or replay.
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration
- MTP acceptance: 3.29 tokens/step

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy (Qwen3.5-35B-A3B FP8, TP2, H100, GSM8K 50q, --reasoning-parser qwen3):
- Baseline (extra_buffer): 0.980
- MTP alone: 0.980, acceptance=3.44
- PCG + MTP: 0.980, acceptance=3.46

Benchmark (extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%)
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
@Oasis-Git Oasis-Git assigned Oasis-Git and unassigned Oasis-Git Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants