Skip to content

Piecewise CUDA Graph Support & Torch Compile Backend#10062

Merged
ispobock merged 27 commits intosgl-project:mainfrom
Oasis-Git:compile
Oct 12, 2025
Merged

Piecewise CUDA Graph Support & Torch Compile Backend#10062
ispobock merged 27 commits intosgl-project:mainfrom
Oasis-Git:compile

Conversation

@Oasis-Git
Copy link
Copy Markdown
Collaborator

@Oasis-Git Oasis-Git commented Sep 5, 2025

Motivation

This PR introduces support for Piecewise CUDA Graph and Torch Backend Support in SGLang.

This feature is inspired by the implementation in vLLM v1 which partitions the full inference graph into smaller sub-graphs with attention operations. This approach offers several potential advantages over full-graph capture:

  • Improved Prefill Performance: SGLang's current native prefilling can incur kernel launch overhead. By capturing operations in a piecewise graph, we can reduce this overhead and improve performance, especially for the short length input.

  • Enables Advanced Attention Operations: This implementation allows for complex/external operations within the attention mechanism, such as layerwise KV Cache loading and storing during decoding which are not possible with a full CUDA graph.

  • Prefilling-Decoding Mixture: With piecewise CUDA Graph prefiling and decoding operations can be mixed within the same running batch as the same operation.

Based on discussion above, we are going to introduce piecewise CUDA graph as optional support for SGLang. Our implementation is primarily adapted from the vLLM open-source project and we extend our credit to their team for the foundational work.

Modifications

The main modification includes:

  • Custom torch.compile Backend: Introduced SGLangBackend to manage FX graph compilation and CUDA graph execution, located in sglang/python/srt/model_executor/compilation.

  • Piecewise CUDA Graph Runner: Added PiecewiseCudagraphRunner to handle the segmented capture and replay of CUDA graphs for model execution.

  • Radix Attention Integration: Wrapped the Radix Attention kernel and registered it as a custom torch.ops operator to make it compatible with torch.compile.

Accuracy Tests

Gsm 8K Test Outcome for Qwen3-8B model with H100

# With Piecewise Cuda Graph
$ bash gsm8k.sh 
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:07<00:00, 26.07it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 7.752 s
Output throughput: 3076.427 token/s

# Without Piecewise Cuda Graph
$ bash gsm8k.sh 
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:11<00:00, 17.67it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 11.384 s
Output throughput: 2091.467 token/s

Benchmarking and Profiling

Profiling

Here we provide the profile outcome of piecewise cuda graph implementation:

PiecewiseCudaGraph

Prefill-Only Benchmark

Here we provide the benchmark outcome with different model size on prefill-only scenario on H100.

The script:

python -m sglang.bench_one_batch --model-path Qwen/Qwen3-14B \
    --batch 1 --input-len 16 32 64 128 256 512 --output-len 1 \
    --enable-piecewise-cuda-graph

1B 8B 14B

E2E Benchmark

Gsm 8K Test Outcome for Qwen3-8B model with H100

# With Piecewise Cuda Graph
$ bash gsm8k.sh 
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:07<00:00, 26.07it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 7.752 s
Output throughput: 3076.427 token/s

# Without Piecewise Cuda Graph
$ bash gsm8k.sh 
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:11<00:00, 17.67it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 11.384 s
Output throughput: 2091.467 token/s

Checklist

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Comment thread python/sglang/srt/layers/radix_attention.py
"enable_auto_functionalized_v2": False,
}

def configure_post_pass(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could drop the pass manager changes and the fix functionalization (not required with auto_functionalize_v2) from this MR and the piecewise backend would still work.

This would enable us to work in parallel. as far as pass manager is concerned It only needs to hook into inductor_config["post_grad_custom_post_pass"]

Ideally I would love to support the pass manager without cuda graphs, for non-cuda backends so it would be good keep these two things as isolated as possible

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This part can be updated base on your pr.

Oasis-Git and others added 3 commits September 11, 2025 08:28
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
# can reuse the memory pool allocated for the large shapes.
with freeze_gc():
# Only rank 0 should print progress bar during capture
self.cudagraph_batch_sizes = [512, 256, 128, 64, 32, 16, 8, 4, 2, 1]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these sizes for prefill only? AFAK it doesn't take two graphs to mix decode and prefill, --enable-mixed-chunk does that in one forward.

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
@Oasis-Git Oasis-Git changed the title [WIP] Piecewise CUDA Graph Support Piecewise CUDA Graph Support & Torch Compile Backend Oct 3, 2025
@Oasis-Git Oasis-Git marked this pull request as ready for review October 3, 2025 00:56
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
@@ -326,10 +327,18 @@ def __init__(
self.qr_comm: Optional[QuickAllReduce] = None
if use_custom_allreduce and self.world_size > 1:
# Initialize a custom fast all-reduce implementation.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can cause allreduce kernel performance drop, for example:

python3 -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct --tp 8 --port 30000 --enable-piecewise-cuda-graph --piecewise-cuda-graph-max-tokens 8192
python3 -m sglang.launch_server --model meta-llama/Llama-3.3-70B-Instruct --tp 8 

python3 -m sglang.bench_serving --backend sglang-oai  --dataset-name random --random-input-len 4096 --random-output-len 20 --random-range-ratio 1 --num-prompts 10 --max-concurrency 1 --warmup-requests 3 --profile
图片 图片

370us->461us in prefill.

@Oasis-Git
Copy link
Copy Markdown
Collaborator Author

Oasis-Git commented Oct 17, 2025

@Bruce-x-1997 Basically it should not. However if you encounter any issues please let me know since piecewise cuda graph support is still in experimental period.

thanks, I am trying to make piecewise-cuda-graph support deepseekv3. but now I meet some graph-compiling bugs I see you use fullgraph=True, but it meets alot of graph break. Then I set fullgraph=False, but I see some data return error, like mla k_rope null do you have any idea to address it? I see torch.compile at normal cudagraph works well, and it uses default fullgraph=False So I think I could just change some piecewise cuda graph procedure, but I am not familiar to dynamic config in the piecewise cuda graph

Yes Deepseek V3 is the main issue model we are trying to apply. We hope to support this model with piecewise cuda graph as soon as possible and related code and benchmark will be in a separate PR.
Here are some known issues we are working on to support ds-v3 model: MLA support, EP kernel support and some related rope issues. To support this model it still needs much effort and we hope to finish it within this week,

Could you share the related code as a patch or pr?I just use fused moe, so mla and moe could work is enough for me.And BTW, I think cuda-graph might not need to change kernel.I see cudagraph works well at decode stage, and I even supported mla on the pr of 2d13b54 and It could work at prefill stage(just could not support radix cache).

It is not related to cuda graph only. Since Piecewise CUDA Graph Support is mainly based on torch compile, actually most of the issues fall in torch compile backend side. If you are interested in the pr we will release soon.

you mean you want to solve all graph breaks, and still set fullgraph=True, right?

Yes we hope to set these logics in fix_functionalization.py later.

@yansiyu550
Copy link
Copy Markdown

@Zhiy-Zhang Hi, thanks for your help! Yes flashinfer backend is not well tested and this error has been reported previously. Basically this can be fixed by provide proper metadata in piecewise_cuda_graph. If you hope to contribute on this, please list the error in issue #11490 and at your self. We also welcome you to enrolled in the channel of piecewise_cuda_graph of slack for discussion!

I had the same error. After adding the following code, the issue was resolved for me.
image

@Zhiy-Zhang
Copy link
Copy Markdown
Contributor

Zhiy-Zhang commented Oct 17, 2025

@yansiyu550 Actually, this modification is not correct. I tested it on the dense model of qwen3-4B and it works fine, but for the MoE model, the results differ significantly compared to those using FlashAttention. I suspect the issue is related to the input format of the FlashInfer operators. I'm still working on the fix.

@yansiyu550
Copy link
Copy Markdown

@yansiyu550 Hi, thanks for your benchmark. I think it is correct and acceptable due to following reason:

  1. One of the most important reason is that during torch compile all the custom_ops is set back to native forward version which has obvious performance drop compared with specifc kernel design in cuda forward implementaion, especially in heavy load
  2. Your benchmark is a relatively decode heavy benchmark so prefilling optimization could be trivial
  3. Our benchmark shows that for most of the model, it can not have improvement with token number >= 4096

We are still on the way for better support of piecewise cuda graph such as sgl_kernel support and hopefully this feature can achieve better performance later for your benchmark.

Thanks for your reply!
Just to clarify — I did not enable --enable-torch-compile, but there was still no performance improvement when using --enable-piecewise-cuda-graph.

My benchmark is actually Prefill-heavy, with input_len=2048 and output_len=128.

In addition, when I tried TP=2 (two GPUs), the server failed to start properly.
Does this feature currently not support multi-GPU / tensor parallel configurations?

@yansiyu550
Copy link
Copy Markdown

@yansiyu550 Actually, this modification is not correct. I tested it on the dense model of qwen3-4B and it works fine, but for the MoE model, the results differ significantly compared to those using FlashAttention. I suspect the issue is related to the input format of the FlashInfer operators. I'm still working on the fix.

I ran into the same situation — this change works fine for the dense model, but it causes problems when running the MoE model.

lpc0220 pushed a commit to lpc0220/sglang that referenced this pull request Oct 29, 2025
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
@fjybiocs
Copy link
Copy Markdown
Contributor

fjybiocs commented Nov 6, 2025

I noticed P nodes in PD disaggregation automatically disable cuda graph. But shouldn't piecewise cuda graph actually help with prefill performance on P nodes? Is this supported yet or are there some technical blockers?

@EduardDurech
Copy link
Copy Markdown
Contributor

@Oasis-Git for future reference https://peps.python.org/pep-0440

>>> "2.10"<"2.6"
True

Fixed in #15682

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration
- MTP acceptance: 3.29 tokens/step

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy (Qwen3.5-35B-A3B FP8, TP2, H100, GSM8K 50q, --reasoning-parser qwen3):
- Baseline (extra_buffer): 0.980
- MTP alone: 0.980, acceptance=3.44
- PCG + MTP: 0.980, acceptance=3.46

Benchmark (extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%)
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify (ForwardMode.TARGET_VERIFY) via decode CUDA graphs

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Benchmark (Qwen3.5-35B-A3B FP8, TP2, H100, extra_buffer + PCG + NEXTN):
- ITL: 22ms → 2.89ms (-87%) from MTP + overlap schedule
- TTFT (rate=5): 253ms → 147ms (-42%) from PCG prefill acceleration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.980 vs no-PCG=0.980 (identical)
  EAGLE3:      PCG=1.000, acceptance=4.23
  STANDALONE:  PCG=0.380 vs no-PCG=0.400 (within noise, 1q/50)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Apr 5, 2026
PCG and speculative decoding operate on independent forward paths:
- PCG: prefill/extend (ForwardMode.EXTEND), captures with spec_info=None
- Speculative: draft/verify uses decode CUDA graphs or eager execution

Key safety guard: PCG's can_run() now explicitly rejects TARGET_VERIFY
mode, since PCG graphs are captured with EXTEND/spec_info=None and
must not be replayed for verify batches that have different spec_info
and capture_hidden_mode.

Previously all speculative algorithms disabled PCG (sgl-project#16331 added this
as a conservative safety measure when PCG became default-enabled).
The original PCG implementation (sgl-project#10062) had no speculative restriction.

Accuracy verification (GSM8K, 50 questions per config):
  EAGLE/NEXTN: PCG=0.970, acceptance=3.44
  EAGLE3:      PCG=1.000, acceptance=4.23 (verified on TP2)
  STANDALONE:  PCG=0.840, acceptance=3.24

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.