[Experimental] Breakable Piecewise Cuda Graph by Oasis-Git · Pull Request #22218 · sgl-project/sglang

Oasis-Git · 2026-04-07T01:05:43Z

Motivation

Inspired by #19102 and credit to @cctry, we implemented breakable piecewise CUDA graph which does not rely on torch compile backend.

This is still an experimental feature for simpler support of piecewise CUDA graph.

Usage: --enable-breakable-cuda-graph

mGSM8K Benchmark (200 questions)

Config	PCG score	PCG tput	PCG cap_GB	BCG score	BCG tput	BCG cap_GB
qwen3_8b_tp1	0.850	3352.6	1.43	0.815	3366.0	1.40
qwen3_8b_tp2	0.835	4918.5	1.85	0.825	4989.9	1.93
qwen3_32b_tp1	0.965	818.8	2.78	0.955	665.5	2.51
qwen3_32b_tp4	0.975	2267.6	2.81	0.965	2284.1	2.84
qwen3_30b_a3b_tp1	0.955	1689.8	1.37	0.955	1669.6	1.35
qwen3_30b_a3b_tp2	0.955	2634.7	1.96	0.960	2560.5	2.04
qwen3_30b_a3b_ep2	0.940	2452.3	2.06	0.950	2422.7	2.13
qwen3_235b_tp8	0.980	901.2	3.53	0.985	892.2	3.54
qwen3_235b_ep8	0.980	754.5	3.76	0.975	728.0	3.80
nemotronh_8b_tp2	0.310	3610.5	1.66	0.300	3544.4	1.86

Profiler:

Under fix:
bcg + mla + radixcache

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

gemini-code-assist · 2026-04-07T01:05:47Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

cctry · 2026-04-07T01:23:25Z

maybe we don't need the new runner file. the intention is to make pcg working at low level with minimal code change (decorator)

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git · 2026-04-07T01:41:58Z

Make sense

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Rename class BreakablePiecewiseCudaGraphRunner -> BreakableCudaGraphRunner and file breakable_piecewise_cuda_graph_runner.py -> breakable_cuda_graph_runner.py for consistency with the already-named breakable_cuda_graph subpackage. Also drop the unused __all__ export in bcg_attention.py — nothing uses star imports and the explicit import in radix_attention.py makes it redundant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Give BCG its own runtime state instead of piggy-backing on PCG's: - Add breakable_cuda_graph/context.py with enable_breakable_cuda_graph() context manager and is_in_breakable_cuda_graph() query, parallel to compilation/piecewise_context_manager.py but managed independently. - Wrap BCG _capture_all and replay in enable_breakable_cuda_graph() and drop the enable_piecewise_cuda_graph() wrap entirely. Existing callers of is_in_piecewise_cuda_graph() across the codebase are torch.compile PCG-specific behaviors BCG doesn't need. - RadixAttention.forward dispatches on is_in_breakable_cuda_graph() instead of get_global_server_args().enable_breakable_cuda_graph, dropping the server-args fetch from the hot path and the server_args import from the file. Tidy up runner docstrings and log prefixes now that BCG is no longer framed as a sub-mode of PCG: "[Breakable PCG]" -> "[BCG]", drop stale "Reuse parent's ..." comments (no parent class). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Introduce breakable_cuda_graph/bcg_ops.py with a single factory ``make_bcg_break_point(fn)`` that lazy-wraps a PCG custom op as a BCG eager break point. Each model now declares its break points next to the PCG ``@register_split_op`` definition as a one-liner: bcg_unified_attention_with_output = make_bcg_break_point( unified_attention_with_output ) breakable_nemotron_mamba2_with_output = make_bcg_break_point( nemotron_mamba2_with_output ) Delete bcg_attention.py; its contents are replaced by the single factory call in radix_attention.py. nemotron_h.py drops the 20-line lazy wrapper for the same reason. Both files now match upstream shape for their PCG custom ops. Verified by re-running mgsm_en 200q: - Qwen3-8B tp=1: 0.840 / 3468.6 tok/s / 1.40 GB cap - NemotronH-8B tp=2: 0.315 / 3445.8 tok/s / 1.86 GB cap Parity with the prior runs within sampling noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

The model_runner already logs one summary line at the end of piecewise capture with total mem usage and avail mem, matching the decode CG runner's format. The 58-lines-per-startup per-size mem_delta / segments / breaks logging was useful while debugging Fix 14's per-segment blow-up but is redundant and noisy now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git · 2026-04-23T00:44:22Z

/tag-and-rerun-ci

…le_cuda_graph/ Move both BCG test files into a dedicated directory: - test/registered/cuda_graph/test_breakable_cuda_graph.py -> test/registered/breakable_cuda_graph/test_breakable_cuda_graph.py - test/registered/piecewise_cuda_graph/test_breakable_piecewise_cuda_graph.py -> test/registered/breakable_cuda_graph/test_breakable_piecewise_cuda_graph.py Mirrors the src-side layout (breakable_cuda_graph subpackage) and separates BCG tests from PCG / decode-CG tests that still live under their own directories. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

- test_breakable_cuda_graph_unit_test.py: unit tests for capture/replay mechanism (was test_breakable_cuda_graph.py) - test_breakable_cuda_graph.py: integration test for Qwen3-8B + mgsm_en (was test_breakable_piecewise_cuda_graph.py) Also rename the test class TestBreakablePiecewiseCudaGraph -> TestBreakableCudaGraph to match the runner rename and drop the stale "breakable PCG" print string. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Fold the integration test class (Qwen3-8B + mgsm_en) into the same file as the unit tests. Uses the large CI suite (est_time=130, was 30+100) since the server eval is the long pole; unit tests just ride along in that slot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

merrymercy · 2026-04-23T11:27:01Z

+                seg.replay()
+                if i < len(self._break_fns):
+                    self._break_fns[i]()
        finally:


does it through exceptions correctly?

It should but no error threw out before.

eager_on_graph already behaves lazily: the decorated wrapper only touches cuda.bindings when actually capturing, and breakable_cuda_graph.py's cuda.bindings import is already try/except'd. So wrapping at module load is safe — the extra factory indirection was redundant. radix_attention.py and nemotron_h.py now apply eager_on_graph(True) directly next to the PCG custom op. bcg_ops.py deleted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

sglang.srt.compilation.weak_ref_tensor hard-raises NotImplementedError on non-CUDA/non-NPU platforms. Since radix_attention.py now imports eager_on_graph at module level, that chain reached weak_ref_tensors and crashed CPU-only CI runners during test collection. Move the import into _weak_ref_if_tensor so it's only triggered inside an active BCG capture — which can't happen on CPU-only anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Wrap BreakableCUDAGraph.replay's inner loop in a try/except that logs the failing segment index plus exception message before re-raising. No behavior change for the success path; makes BCG-specific crash diagnosis easier on the failure path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

This reverts commit 89828fe.

…lay_prepare The shared replay_prepare bound from PiecewiseCudaGraphRunner reads self.capture_return_pooled_hidden_states (added upstream by the Score API PR sgl-project#22427). BCG's __init__ never set it, so CI merges of this PR with main hit AttributeError on first replay. Mirror PCG's initialization: not model_runner.is_generation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Upstream main dropped the num_tokens parameter from set_forward_context; BCG's replay still passed it, breaking post-merge. Align with the new signature — num_tokens is no longer threaded through the context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git · 2026-04-23T22:48:07Z

/rerun-failed-ci

Oasis-Git · 2026-04-23T23:53:19Z

/rerun-failed-ci

Oasis-Git · 2026-04-24T02:42:45Z

/rerun-failed-ci

Oasis-Git · 2026-04-24T05:45:21Z

/rerun-failed-ci

Analyze the refactored BCG implementation including architecture changes, new components, NPU adaptability assessment, and comparison with the original BCG implementation.

…project#22218

Oasis-Git added 10 commits April 3, 2026 20:15

bcg

d8c294e

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

fix

24a77ea

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

fix

fd74b94

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

fix

ba7d67c

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

fix

8f3fda0

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

add fix

474c8ce

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

fix

12a965d

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

fix

3113267

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

lint fix

1510d46

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

remove

1452dce

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock and merrymercy as code owners April 7, 2026 01:05

add unit test

b5b1045

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git added 2 commits April 7, 2026 04:01

merge bcg to pcg

c5383f0

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

update

1a0dfeb

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git requested a review from hebiao064 as a code owner April 7, 2026 04:18

Oasis-Git added 4 commits April 7, 2026 04:33

revert

3bf1a15

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

fix bug

4e124d4

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

remove

0213016

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

lint fix

a2b68f0

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git and others added 3 commits April 22, 2026 22:04

cctry reviewed Apr 22, 2026

View reviewed changes

Comment thread python/sglang/srt/model_executor/breakable_cuda_graph/bcg_attention.py Outdated

merrymercy approved these changes Apr 23, 2026

View reviewed changes

merrymercy reviewed Apr 23, 2026

View reviewed changes

Comment thread test/registered/piecewise_cuda_graph/test_breakable_piecewise_cuda_graph.py Outdated

Oasis-Git and others added 3 commits April 23, 2026 03:50

merrymercy reviewed Apr 23, 2026

View reviewed changes

merrymercy added the high priority label Apr 23, 2026

Oasis-Git and others added 7 commits April 23, 2026 19:25

Revert "breakable cuda graph: log replay failures with segment index"

5aa100f

This reverts commit 89828fe.

Merge branch 'main' into bcg

4eb5a2d

merrymercy merged commit 60bbb80 into sgl-project:main Apr 24, 2026
565 of 636 checks passed

Oasis-Git deleted the bcg branch April 24, 2026 19:06

alisonshao mentioned this pull request Apr 26, 2026

Revert #23533 (Hy3 preview) + re-enable test_nvidia_nemotron_3_nano #23758

Closed

2 tasks

syy-hw added a commit to syy-hw/sglang that referenced this pull request Apr 27, 2026

docs: add BCG beginner tutorial covering CUDA Graph basics to PR sgl-…

2c1be84

…project#22218

sunway513 mentioned this pull request Apr 30, 2026

Hand-off: DSV4-Pro Sprint 6/7 — what to use, what to skip, and how to continue multi-req debug (#37) sunway513/ATOM#60

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] Breakable Piecewise Cuda Graph#22218

[Experimental] Breakable Piecewise Cuda Graph#22218
merrymercy merged 52 commits intosgl-project:mainfrom
Oasis-Git:bcg

Oasis-Git commented Apr 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 7, 2026

Uh oh!

cctry commented Apr 7, 2026 •

edited

Loading

Uh oh!

Oasis-Git commented Apr 7, 2026

Uh oh!

Uh oh!

Oasis-Git commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

merrymercy Apr 23, 2026

Uh oh!

Oasis-Git Apr 23, 2026

Uh oh!

Oasis-Git commented Apr 23, 2026

Uh oh!

Oasis-Git commented Apr 23, 2026

Uh oh!

Oasis-Git commented Apr 24, 2026

Uh oh!

Oasis-Git commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Oasis-Git commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 7, 2026

Uh oh!

cctry commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Oasis-Git commented Apr 7, 2026

Uh oh!

Uh oh!

Oasis-Git commented Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

merrymercy Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Oasis-Git Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Oasis-Git commented Apr 23, 2026

Uh oh!

Oasis-Git commented Apr 23, 2026

Uh oh!

Oasis-Git commented Apr 24, 2026

Uh oh!

Oasis-Git commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Oasis-Git commented Apr 7, 2026 •

edited

Loading

cctry commented Apr 7, 2026 •

edited

Loading