Skip to content

[Experimental] Breakable Piecewise Cuda Graph#22218

Merged
merrymercy merged 52 commits intosgl-project:mainfrom
Oasis-Git:bcg
Apr 24, 2026
Merged

[Experimental] Breakable Piecewise Cuda Graph#22218
merrymercy merged 52 commits intosgl-project:mainfrom
Oasis-Git:bcg

Conversation

@Oasis-Git
Copy link
Copy Markdown
Collaborator

@Oasis-Git Oasis-Git commented Apr 7, 2026

Motivation

Inspired by #19102 and credit to @cctry, we implemented breakable piecewise CUDA graph which does not rely on torch compile backend.

This is still an experimental feature for simpler support of piecewise CUDA graph.

Usage: --enable-breakable-cuda-graph

mGSM8K Benchmark (200 questions)

Config PCG score PCG tput PCG cap_GB BCG score BCG tput BCG cap_GB
qwen3_8b_tp1 0.850 3352.6 1.43 0.815 3366.0 1.40
qwen3_8b_tp2 0.835 4918.5 1.85 0.825 4989.9 1.93
qwen3_32b_tp1 0.965 818.8 2.78 0.955 665.5 2.51
qwen3_32b_tp4 0.975 2267.6 2.81 0.965 2284.1 2.84
qwen3_30b_a3b_tp1 0.955 1689.8 1.37 0.955 1669.6 1.35
qwen3_30b_a3b_tp2 0.955 2634.7 1.96 0.960 2560.5 2.04
qwen3_30b_a3b_ep2 0.940 2452.3 2.06 0.950 2422.7 2.13
qwen3_235b_tp8 0.980 901.2 3.53 0.985 892.2 3.54
qwen3_235b_ep8 0.980 754.5 3.76 0.975 728.0 3.80
nemotronh_8b_tp2 0.310 3610.5 1.66 0.300 3544.4 1.86

Profiler:
Screenshot 2026-04-06 at 6 04 42 PM

Under fix:
bcg + mla + radixcache

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Oasis-Git added 10 commits April 3, 2026 20:15
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@cctry
Copy link
Copy Markdown
Collaborator

cctry commented Apr 7, 2026

maybe we don't need the new runner file. the intention is to make pcg working at low level with minimal code change (decorator)

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
@Oasis-Git
Copy link
Copy Markdown
Collaborator Author

Make sense

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
@Oasis-Git Oasis-Git requested a review from hebiao064 as a code owner April 7, 2026 04:18
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Oasis-Git and others added 3 commits April 22, 2026 22:04
Rename class BreakablePiecewiseCudaGraphRunner -> BreakableCudaGraphRunner
and file breakable_piecewise_cuda_graph_runner.py ->
breakable_cuda_graph_runner.py for consistency with the already-named
breakable_cuda_graph subpackage. Also drop the unused __all__ export in
bcg_attention.py — nothing uses star imports and the explicit import
in radix_attention.py makes it redundant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Give BCG its own runtime state instead of piggy-backing on PCG's:
- Add breakable_cuda_graph/context.py with enable_breakable_cuda_graph()
  context manager and is_in_breakable_cuda_graph() query, parallel to
  compilation/piecewise_context_manager.py but managed independently.
- Wrap BCG _capture_all and replay in enable_breakable_cuda_graph() and
  drop the enable_piecewise_cuda_graph() wrap entirely. Existing callers
  of is_in_piecewise_cuda_graph() across the codebase are torch.compile
  PCG-specific behaviors BCG doesn't need.
- RadixAttention.forward dispatches on is_in_breakable_cuda_graph()
  instead of get_global_server_args().enable_breakable_cuda_graph, dropping
  the server-args fetch from the hot path and the server_args import from
  the file.

Tidy up runner docstrings and log prefixes now that BCG is no longer
framed as a sub-mode of PCG: "[Breakable PCG]" -> "[BCG]", drop stale
"Reuse parent's ..." comments (no parent class).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Introduce breakable_cuda_graph/bcg_ops.py with a single factory
``make_bcg_break_point(fn)`` that lazy-wraps a PCG custom op as a BCG
eager break point. Each model now declares its break points next to the
PCG ``@register_split_op`` definition as a one-liner:

    bcg_unified_attention_with_output = make_bcg_break_point(
        unified_attention_with_output
    )
    breakable_nemotron_mamba2_with_output = make_bcg_break_point(
        nemotron_mamba2_with_output
    )

Delete bcg_attention.py; its contents are replaced by the single
factory call in radix_attention.py. nemotron_h.py drops the 20-line
lazy wrapper for the same reason. Both files now match upstream shape
for their PCG custom ops.

Verified by re-running mgsm_en 200q:
- Qwen3-8B tp=1: 0.840 / 3468.6 tok/s / 1.40 GB cap
- NemotronH-8B tp=2: 0.315 / 3445.8 tok/s / 1.86 GB cap
Parity with the prior runs within sampling noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Comment thread python/sglang/srt/model_executor/breakable_cuda_graph/bcg_attention.py Outdated
The model_runner already logs one summary line at the end of piecewise
capture with total mem usage and avail mem, matching the decode CG runner's
format. The 58-lines-per-startup per-size mem_delta / segments / breaks
logging was useful while debugging Fix 14's per-segment blow-up but is
redundant and noisy now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
@Oasis-Git
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Comment thread test/registered/piecewise_cuda_graph/test_breakable_piecewise_cuda_graph.py Outdated
Oasis-Git and others added 3 commits April 23, 2026 03:50
…le_cuda_graph/

Move both BCG test files into a dedicated directory:
- test/registered/cuda_graph/test_breakable_cuda_graph.py ->
  test/registered/breakable_cuda_graph/test_breakable_cuda_graph.py
- test/registered/piecewise_cuda_graph/test_breakable_piecewise_cuda_graph.py ->
  test/registered/breakable_cuda_graph/test_breakable_piecewise_cuda_graph.py

Mirrors the src-side layout (breakable_cuda_graph subpackage) and
separates BCG tests from PCG / decode-CG tests that still live under
their own directories.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
- test_breakable_cuda_graph_unit_test.py: unit tests for capture/replay
  mechanism (was test_breakable_cuda_graph.py)
- test_breakable_cuda_graph.py: integration test for Qwen3-8B + mgsm_en
  (was test_breakable_piecewise_cuda_graph.py)

Also rename the test class TestBreakablePiecewiseCudaGraph ->
TestBreakableCudaGraph to match the runner rename and drop the stale
"breakable PCG" print string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Fold the integration test class (Qwen3-8B + mgsm_en) into the same file
as the unit tests. Uses the large CI suite (est_time=130, was 30+100)
since the server eval is the long pole; unit tests just ride along in
that slot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Comment thread python/sglang/srt/model_executor/breakable_cuda_graph/bcg_ops.py Outdated
seg.replay()
if i < len(self._break_fns):
self._break_fns[i]()
finally:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it through exceptions correctly?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should but no error threw out before.

Oasis-Git and others added 7 commits April 23, 2026 19:25
eager_on_graph already behaves lazily: the decorated wrapper only
touches cuda.bindings when actually capturing, and
breakable_cuda_graph.py's cuda.bindings import is already try/except'd.
So wrapping at module load is safe — the extra factory indirection was
redundant.

radix_attention.py and nemotron_h.py now apply eager_on_graph(True)
directly next to the PCG custom op. bcg_ops.py deleted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
sglang.srt.compilation.weak_ref_tensor hard-raises NotImplementedError
on non-CUDA/non-NPU platforms. Since radix_attention.py now imports
eager_on_graph at module level, that chain reached weak_ref_tensors and
crashed CPU-only CI runners during test collection.

Move the import into _weak_ref_if_tensor so it's only triggered inside
an active BCG capture — which can't happen on CPU-only anyway.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Wrap BreakableCUDAGraph.replay's inner loop in a try/except that logs
the failing segment index plus exception message before re-raising.
No behavior change for the success path; makes BCG-specific crash
diagnosis easier on the failure path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
…lay_prepare

The shared replay_prepare bound from PiecewiseCudaGraphRunner reads
self.capture_return_pooled_hidden_states (added upstream by the Score
API PR sgl-project#22427). BCG's __init__ never set it, so CI merges of this PR
with main hit AttributeError on first replay.

Mirror PCG's initialization: not model_runner.is_generation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Upstream main dropped the num_tokens parameter from set_forward_context;
BCG's replay still passed it, breaking post-merge. Align with the new
signature — num_tokens is no longer threaded through the context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
@Oasis-Git
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

3 similar comments
@Oasis-Git
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@Oasis-Git
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@Oasis-Git
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@merrymercy merrymercy merged commit 60bbb80 into sgl-project:main Apr 24, 2026
565 of 636 checks passed
@Oasis-Git Oasis-Git deleted the bcg branch April 24, 2026 19:06
syy-hw added a commit to syy-hw/sglang that referenced this pull request Apr 27, 2026
Analyze the refactored BCG implementation including architecture
changes, new components, NPU adaptability assessment, and comparison
with the original BCG implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants