[Experimental] Breakable Piecewise Cuda Graph#22218
Merged
merrymercy merged 52 commits intosgl-project:mainfrom Apr 24, 2026
Merged
[Experimental] Breakable Piecewise Cuda Graph#22218merrymercy merged 52 commits intosgl-project:mainfrom
merrymercy merged 52 commits intosgl-project:mainfrom
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
|
maybe we don't need the new runner file. the intention is to make pcg working at low level with minimal code change (decorator) |
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Collaborator
Author
|
Make sense |
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Rename class BreakablePiecewiseCudaGraphRunner -> BreakableCudaGraphRunner and file breakable_piecewise_cuda_graph_runner.py -> breakable_cuda_graph_runner.py for consistency with the already-named breakable_cuda_graph subpackage. Also drop the unused __all__ export in bcg_attention.py — nothing uses star imports and the explicit import in radix_attention.py makes it redundant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Give BCG its own runtime state instead of piggy-backing on PCG's: - Add breakable_cuda_graph/context.py with enable_breakable_cuda_graph() context manager and is_in_breakable_cuda_graph() query, parallel to compilation/piecewise_context_manager.py but managed independently. - Wrap BCG _capture_all and replay in enable_breakable_cuda_graph() and drop the enable_piecewise_cuda_graph() wrap entirely. Existing callers of is_in_piecewise_cuda_graph() across the codebase are torch.compile PCG-specific behaviors BCG doesn't need. - RadixAttention.forward dispatches on is_in_breakable_cuda_graph() instead of get_global_server_args().enable_breakable_cuda_graph, dropping the server-args fetch from the hot path and the server_args import from the file. Tidy up runner docstrings and log prefixes now that BCG is no longer framed as a sub-mode of PCG: "[Breakable PCG]" -> "[BCG]", drop stale "Reuse parent's ..." comments (no parent class). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Introduce breakable_cuda_graph/bcg_ops.py with a single factory
``make_bcg_break_point(fn)`` that lazy-wraps a PCG custom op as a BCG
eager break point. Each model now declares its break points next to the
PCG ``@register_split_op`` definition as a one-liner:
bcg_unified_attention_with_output = make_bcg_break_point(
unified_attention_with_output
)
breakable_nemotron_mamba2_with_output = make_bcg_break_point(
nemotron_mamba2_with_output
)
Delete bcg_attention.py; its contents are replaced by the single
factory call in radix_attention.py. nemotron_h.py drops the 20-line
lazy wrapper for the same reason. Both files now match upstream shape
for their PCG custom ops.
Verified by re-running mgsm_en 200q:
- Qwen3-8B tp=1: 0.840 / 3468.6 tok/s / 1.40 GB cap
- NemotronH-8B tp=2: 0.315 / 3445.8 tok/s / 1.86 GB cap
Parity with the prior runs within sampling noise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
cctry
reviewed
Apr 22, 2026
The model_runner already logs one summary line at the end of piecewise capture with total mem usage and avail mem, matching the decode CG runner's format. The 58-lines-per-startup per-size mem_delta / segments / breaks logging was useful while debugging Fix 14's per-segment blow-up but is redundant and noisy now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Collaborator
Author
|
/tag-and-rerun-ci |
merrymercy
approved these changes
Apr 23, 2026
merrymercy
reviewed
Apr 23, 2026
…le_cuda_graph/ Move both BCG test files into a dedicated directory: - test/registered/cuda_graph/test_breakable_cuda_graph.py -> test/registered/breakable_cuda_graph/test_breakable_cuda_graph.py - test/registered/piecewise_cuda_graph/test_breakable_piecewise_cuda_graph.py -> test/registered/breakable_cuda_graph/test_breakable_piecewise_cuda_graph.py Mirrors the src-side layout (breakable_cuda_graph subpackage) and separates BCG tests from PCG / decode-CG tests that still live under their own directories. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
- test_breakable_cuda_graph_unit_test.py: unit tests for capture/replay mechanism (was test_breakable_cuda_graph.py) - test_breakable_cuda_graph.py: integration test for Qwen3-8B + mgsm_en (was test_breakable_piecewise_cuda_graph.py) Also rename the test class TestBreakablePiecewiseCudaGraph -> TestBreakableCudaGraph to match the runner rename and drop the stale "breakable PCG" print string. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Fold the integration test class (Qwen3-8B + mgsm_en) into the same file as the unit tests. Uses the large CI suite (est_time=130, was 30+100) since the server eval is the long pole; unit tests just ride along in that slot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
merrymercy
reviewed
Apr 23, 2026
| seg.replay() | ||
| if i < len(self._break_fns): | ||
| self._break_fns[i]() | ||
| finally: |
Contributor
There was a problem hiding this comment.
does it through exceptions correctly?
Collaborator
Author
There was a problem hiding this comment.
It should but no error threw out before.
eager_on_graph already behaves lazily: the decorated wrapper only touches cuda.bindings when actually capturing, and breakable_cuda_graph.py's cuda.bindings import is already try/except'd. So wrapping at module load is safe — the extra factory indirection was redundant. radix_attention.py and nemotron_h.py now apply eager_on_graph(True) directly next to the PCG custom op. bcg_ops.py deleted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
sglang.srt.compilation.weak_ref_tensor hard-raises NotImplementedError on non-CUDA/non-NPU platforms. Since radix_attention.py now imports eager_on_graph at module level, that chain reached weak_ref_tensors and crashed CPU-only CI runners during test collection. Move the import into _weak_ref_if_tensor so it's only triggered inside an active BCG capture — which can't happen on CPU-only anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Wrap BreakableCUDAGraph.replay's inner loop in a try/except that logs the failing segment index plus exception message before re-raising. No behavior change for the success path; makes BCG-specific crash diagnosis easier on the failure path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
This reverts commit 89828fe.
…lay_prepare The shared replay_prepare bound from PiecewiseCudaGraphRunner reads self.capture_return_pooled_hidden_states (added upstream by the Score API PR sgl-project#22427). BCG's __init__ never set it, so CI merges of this PR with main hit AttributeError on first replay. Mirror PCG's initialization: not model_runner.is_generation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Upstream main dropped the num_tokens parameter from set_forward_context; BCG's replay still passed it, breaking post-merge. Align with the new signature — num_tokens is no longer threaded through the context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Collaborator
Author
|
/rerun-failed-ci |
3 similar comments
Collaborator
Author
|
/rerun-failed-ci |
Collaborator
Author
|
/rerun-failed-ci |
Collaborator
Author
|
/rerun-failed-ci |
2 tasks
syy-hw
added a commit
to syy-hw/sglang
that referenced
this pull request
Apr 27, 2026
Analyze the refactored BCG implementation including architecture changes, new components, NPU adaptability assessment, and comparison with the original BCG implementation.
syy-hw
added a commit
to syy-hw/sglang
that referenced
this pull request
Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Inspired by #19102 and credit to @cctry, we implemented breakable piecewise CUDA graph which does not rely on torch compile backend.
This is still an experimental feature for simpler support of piecewise CUDA graph.
Usage:
--enable-breakable-cuda-graphmGSM8K Benchmark (200 questions)
Profiler:

Under fix:
bcg + mla + radixcache
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci