[Refactor] Cuda Graph Runner/Backend Refactor by Oasis-Git · Pull Request #23906 · sgl-project/sglang

Oasis-Git · 2026-04-28T05:00:51Z

Motivation

#23004

[WIP]

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

…ils} packages Sets up the empty package skeletons for the CUDA graph refactor without changing any behavior. - Create cuda_graph_runner/ package; relocate legacy cuda_graph_runner.py to cuda_graph_runner/legacy.py and re-export verbatim from __init__.py so the 31 existing import sites (model_runner, eagle_worker, lora, memory_pool, etc.) keep working transparently. - Create cuda_graph_backend/ package with Base/Full/Breakable/TCPiecewise CudaGraphBackend skeleton classes (no implementations yet). - Create cuda_graph_backend_utils/{breakable_cuda_graph,piecewise_cuda_graph}/ empty subpackages for primitives that move in Phase 1. - Add ServerArgs.cuda_graph_mode: Optional[Dict[str, str]] = None field for the upcoming canonical per-phase config; legacy flags still drive behavior. - Add cuda_graph_runner.config_resolution.resolve_cuda_graph_config() no-op stub; real pipeline lands in Phase 1. No code path uses the new abstractions yet. See refactor/plan.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… primitives Splits the conflated context state in compilation/piecewise_context_manager.py along its real seam: the cuda-graph-capture flag (used by both BCG and tcpcg) moves to model_executor/, while the torch.compile-warmup flag (tcpcg-internal) stays in compilation/. Renames: is_in_piecewise_cuda_graph -> is_in_cuda_graph_capture enable_piecewise_cuda_graph -> enable_cuda_graph_capture is_in_pcg_torch_compile -> is_in_torch_compile_warmup enable_piecewise_cuda_graph_compile -> enable_torch_compile_warmup PIECEWISE_CUDA_GRAPH_CAPTURE_FAILED_MSG -> CUDA_GRAPH_CAPTURE_FAILED_MSG Relocations: compilation/piecewise_context_manager.py -> model_executor/cuda_graph_backend_utils/piecewise_cuda_graph/context_manager.py (capture flag, ForwardContext, set/get_forward_context) -> compilation/compile_phase.py (warmup flag, pcg_capture_stream) model_executor/breakable_cuda_graph/{breakable_cuda_graph,context,cuda_utils}.py -> model_executor/cuda_graph_backend_utils/breakable_cuda_graph/{...} The two old paths (compilation/piecewise_context_manager.py and model_executor/breakable_cuda_graph/) are kept as transition shims that re-export from the new homes under both old and new names. Removed in Phase 6. Audited 38 callsites across 16 production files. All switched to the renamed primitives at their new import paths. Behavior preserved mechanically (bucket A everywhere). Two bucket-C candidates flagged with TODO comments for follow-up: - models/nemotron_h.py:_forward_core (CUDA stream overlap path — genuine dynamo-tracing constraint, not capture-or-replay) - models/deepseek_common/.../forward_mla.py (non-contiguous-output bmm form — required by dynamo, not by capture) Verified clean: 22 audited modules import OK, identity preserved through both shims, BCG test path still resolves. See refactor/plan.md §6.5 for context-flag semantics + audit rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ig_resolution Moves the 18-condition ``_handle_piecewise_cuda_graph`` from server_args.py to ``cuda_graph_runner.config_resolution`` and converts the long if/if cascade into a data table of ``_PiecewiseDisableRule(name, predicate)`` entries — easier to read, audit, and extend. Wires ``resolve_cuda_graph_config(self)`` from ``ServerArgs.__post_init__`` in place of the old method call. The old method is removed (no callers left). Phase 1 only implements stage 3 (compatibility checks) of the four-stage pipeline described in plan §3. Stages 1 (parse), 2 (default), and 4 (validate) remain stubs that land in Phase 4 alongside the new CLI surface. GPU-memory-based defaulting still lives in ``_handle_gpu_memory_settings`` until then. Behavior parity verified against an 18-case matrix covering every rule plus the ``enforce_piecewise_cuda_graph`` bypass path. ``--enforce`` still overrides the entire table for testing per Q3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Locks down the contract that backend extractions in Phases 2b–2d must satisfy. The ABC has four methods: prepare(runner) # one-time setup capture_one(shape_key, forward_fn) # capture for one shape replay(shape_key) # replay artifact at shape cleanup() # default no-op plus a class attribute ``captures_attn_metadata`` that lets the runner know whether ``init_forward_metadata_capture_cuda_graph`` should run inside the captured region (full-graph style) or outside on every replay (PCG/BCG style). Three concrete subclasses declared with the right metadata flag but NotImplementedError bodies: - FullCudaGraphBackend (captures_attn_metadata=True) — Phase 2b - BreakableCudaGraphBackend (captures_attn_metadata=False) — Phase 2c - TCPiecewiseCudaGraphBackend (captures_attn_metadata=False) — Phase 2d Bodies will be lifted from cuda_graph_runner/legacy.py, breakable_cuda_graph_runner.py, piecewise_cuda_graph_runner.py respectively. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 0 added the ``ServerArgs.cuda_graph_mode`` field but missed the corresponding ``parser.add_argument`` registration. ``from_cli_args`` maps every dataclass field name back from the argparse Namespace, so launching the server crashed with ``AttributeError: 'Namespace' object has no attribute 'cuda_graph_mode'``. Adds a minimal ``--cuda-graph-mode`` flag that accepts a JSON object and parses it to ``Dict[str, str]``. Validation of allowed values (full/breakable/tcpcg/disabled) per phase lands in Phase 4; for now the field is still unread. Caught by trying to launch a baseline Qwen3-8B server for mgsm_en validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lifts the full-CUDA-graph capture primitives out of ``legacy.CudaGraphRunner`` and into ``cuda_graph_backend/full.py`` as two runner-coupling-free static methods: FullCudaGraphBackend.make_graph() -> torch.cuda.CUDAGraph() (was the FULL branch of CudaGraphRunner._create_device_graph) FullCudaGraphBackend.capture_into(graph, pool, stream, device_module, memory_saver_adapter, run_once_fn) -> opens the appropriate graph capture context (memory-saver-aware) and runs run_once_fn under it, returns its output (was the FULL branch of CudaGraphRunner._capture_graph) ``legacy.CudaGraphRunner._create_device_graph`` and ``_capture_graph`` keep their env-var-driven Breakable branch inline (Phase 2c lifts that) and delegate to the new primitives on the FULL branch. Net runtime behavior: identical bytecode path, plus one extra Python call frame at *startup* (per-shape capture); zero per-request cost. The ABC methods (prepare/capture_one/replay) stay NotImplementedError — the runner still owns the dict-based dispatch and the buffer setup; Phase 3 wires the backend to be driven through the abstract interface when runners get unified. Validated against the cg-refactor baseline (commit 8e79e6e, mgsm_en N=200 on Qwen3-8B = 0.865, latency 63.18s, throughput 3439.7 tok/s): Phase 2b: score 0.840, latency 64.22s, throughput 3418.8 tok/s Score delta -0.025 = 1σ noise at p=0.85, N=200 (~5 samples flipped between greedy-decoding runs, plausible from kernel-level non-determinism). Latency/throughput within ±2% noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lifts the breakable-CUDA-graph capture primitives out of the env-var branch of ``legacy.CudaGraphRunner._capture_graph`` into ``cuda_graph_backend/breakable.py`` as runner-coupling-free static methods: BreakableCudaGraphBackend.make_graph() -> BreakableCUDAGraph() (HIP guard moves into the backend) BreakableCudaGraphBackend.capture_into(graph, pool, stream, run_once_fn, *, debug_eager, memory_saver_adapter) -> opens BreakableCUDAGraphCapture, optionally wraps with eager_on_graph(True) for --debug-cuda-graph mode, raises on memory-saver incompatibility. ``legacy.CudaGraphRunner._create_device_graph`` and ``_capture_graph`` both branches now delegate (Full path → FullCudaGraphBackend, Breakable env-var path → BreakableCudaGraphBackend). The runner is now a thin dispatcher over the two backends; only the dict-based per-shape storage and the prefill BCG class remain non-extracted, both of which land in Phase 3. ABC methods (prepare/capture_one/replay) stay NotImplementedError — runner unification in Phase 3 wires them. Validation (cg-refactor baseline = 0.865, score floor 0.80 per user): Phase 2c default Full path: score 0.845, latency 64.79s, throughput 3397.0 tok/s — well within floor. The breakable env-var decode path is not directly exercised by mgsm_en (default config uses Full); the migration is byte-equivalent and shares a path that was working pre-refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lifts the torch.compile setup logic out of ``PiecewiseCudaGraphRunner`` and into ``cuda_graph_backend/tcpcg.py`` as runner-coupling-free static methods: TCPiecewiseCudaGraphBackend.build_compilation_config(server_args) -> CompilationConfig Validates --piecewise-cuda-graph-compiler choice, builds the config, registers the MoE A2A split-op when DeepEP/Mooncake is in use. Mirrors PiecewiseCudaGraphRunner.__init__ lines that previously did this inline. TCPiecewiseCudaGraphBackend.install_compile(language_model, compile_config, graph_pool, fullgraph=True, dynamic_arg_dims=None) -> wraps language_model with install_torch_compiled. Mirrors the call site in PiecewiseCudaGraphRunner.capture(). PCG runner now imports from the backend (deferred to call site to avoid import-cycle risk) instead of holding the construction logic inline. The unused top-of-file imports (CompilationConfig, install_torch_compiled, get_moe_a2a_backend) are removed; they're now reached only through the backend. ABC methods (prepare/capture_one/replay) stay NotImplementedError — runner unification in Phase 3 wires them. Validation (cg-refactor baseline = 0.865, score floor 0.80 per user): Phase 2d: score 0.850, latency 64.09s, throughput 3406.3 tok/s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Establishes the canonical phase-named runner classes; today they are thin wrappers/factories over the legacy classes so behavior is unchanged. Phase 3b/c will migrate bodies into them. cuda_graph_runner/decode_runner.py: class DecodeCudaGraphRunner(CudaGraphRunner): pass Subclasses the legacy decode runner. New code should refer to this name; the legacy class continues to host the implementation. cuda_graph_runner/prefill_runner.py: class PrefillCudaGraphRunner: def __new__(cls, model_runner): if model_runner.server_args.enable_breakable_cuda_graph: return BreakableCudaGraphRunner(model_runner) return PiecewiseCudaGraphRunner(model_runner) Factory that selects the prefill backend (breakable vs tcpcg) by today's server-arg flag. Phase 4 will drive the selection from the canonical ``cuda_graph_mode`` config. model_runner.py: - decode path uses DecodeCudaGraphRunner instead of CudaGraphRunner (defaultdict default; CPU/NPU paths unchanged). - prefill path uses PrefillCudaGraphRunner factory instead of an inline if/else. Late imports avoid circular-dependency risk: model_executor modules import from cuda_graph_runner; cuda_graph_runner imports the legacy runner module which in turn imports model_executor primitives. Localized imports inside the factory and inside model_runner sidestep the cycle. External readers of ``model_runner.graph_runner`` (eagle workers, hardware stubs) continue to work since DecodeCudaGraphRunner is-a CudaGraphRunner via inheritance. Validation (cg-refactor baseline = 0.865, score floor 0.80 per user): Phase 3a: score 0.835, latency 65.17s, throughput 3371.2 tok/s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implements stages 1 (parse) and 4 (validate) of the resolver pipeline in ``cuda_graph_runner.config_resolution`` and switches ``PrefillCudaGraphRunner`` to drive backend selection from the canonical ``cuda_graph_mode`` field. Stage 1 (``_parse_canonical``): - Translates today's legacy flags to a canonical ``Dict[str, str]`` covering both phases: ``--disable-cuda-graph`` -> decode = "disabled" ``--disable-piecewise-cuda-graph`` -> prefill = "disabled" ``--enable-breakable-cuda-graph`` -> prefill = "breakable" otherwise -> defaults {decode: "full", prefill: "tcpcg"}. - Explicit ``--cuda-graph-mode`` JSON wins per-phase (Q8); when the JSON conflicts with a legacy convenience flag, emits a warning naming both, per plan §6 Q8. - Re-runs after compatibility checks so any auto-disable flips (e.g. ``disable_piecewise_cuda_graph = True`` from the 18-rule table) propagate into ``cuda_graph_mode``. Stage 4 (``_validate_canonical``): - Rejects unknown phases (only ``decode``/``prefill`` allowed). - Rejects unknown backends per phase. - Raises ``NotImplementedError`` for the (prefill, full) cell with a pointer to use breakable/tcpcg instead — plan §6 Q1. PrefillCudaGraphRunner factory now reads ``cuda_graph_mode["prefill"]`` instead of ``enable_breakable_cuda_graph`` directly. Decode side is unchanged since the only available decode backend in v1 is ``full``. Validation: - Unit tests confirmed default + breakable + decode-disabled mappings, JSON-vs-flag override warning, and validator rejection of (prefill, full), unknown phase, unknown backend. - mgsm_en N=200 on Qwen3-8B: score 0.835, latency 64.19s, throughput 3420.8 tok/s — above 0.80 floor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…da_graph_mode Turns ``DecodeCudaGraphRunner`` into a factory (matching ``PrefillCudaGraphRunner``) that consults ``cuda_graph_mode["decode"]``. Three branches: - "full" (default): returns a ``CudaGraphRunner`` instance. No behavior change vs Phase 3a. - "breakable" (experimental, plan §2.4): bridges to today's ``SGLANG_USE_BREAKABLE_CUDA_GRAPH`` env-var path inside ``CudaGraphRunner._capture_graph`` / ``_create_device_graph``. Sets the env var if not already set. Phase 3b/c will replace the env-var read with a constructor parameter. - "tcpcg": not implemented for the decode phase in v1; logs a one-shot warning and falls back to "full" so the server still boots. Tracked as a Phase-3+ follow-up in refactor/progress.md. The (prefill, full) cell continues to raise NotImplementedError from the validator (Phase 4a, plan §6 Q1). The matrix is now: (decode, full) — implemented (default) (decode, breakable) — experimental, env-var bridge (decode, tcpcg) — falls back to full + warning (TODO) (prefill, breakable) — implemented (prefill, tcpcg) — implemented (default) (prefill, full) — NotImplementedError stub Validation: - mgsm_en N=200 on Qwen3-8B (default mode = decode:full + prefill:tcpcg): score 0.835, latency 64.14s, throughput 3423.4 tok/s. Above 0.80 floor. - The breakable/tcpcg decode branches are exercised at construction time but not driven by the default eval; explicit tests for them are deferred to a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…i-model PCG/BCG test pass The transition shim at ``model_executor/breakable_cuda_graph/breakable_cuda_graph.py`` had its underscore-prefixed re-exports listed explicitly (``_copy_output``, etc.) but ``_copy_output`` was missed. The PCG/BCG test suite at ``test/registered/breakable_cuda_graph/test_breakable_cuda_graph.py`` imports it directly from the legacy path, so ``TestCopyOutput.setUpClass`` errored with ImportError under the cg-refactor branch. Adds ``_copy_output`` to the explicit-imports list so the shim remains 1:1 with the legacy module surface. (The * import covers it for non-underscore symbols; private symbols need explicit listing.) End-to-end test pass (CI-registered tests + multi-model coverage at HEAD ``a7ae66efc`` plus this fix): PCG suite (test/registered/piecewise_cuda_graph/): - TestPiecewiseCudaGraphQwen25VL (Qwen2.5-VL-7B-Instruct --enforce-piecewise-cuda-graph + --disable-radix-cache, gsm8k): score 0.818 ≥ 0.80 ✓ - TestPiecewiseCudaGraphInternVL25 (InternVL2.5-8B same setup, gsm8k): score 0.575 ≥ 0.54 ✓ - TestPiecewiseCudaGraphQwen25VLEmbedding (Qwen2.5-VL-3B-Instruct embedding, enforce vs disable): max_abs_diff 0.0078 < 1e-2 ✓ BCG suite (test/registered/breakable_cuda_graph/): - TestBreakableCUDAGraphBasic + TestCopyOutput + TestBreakGraphHelper (unit): 11 tests pass ✓ - TestBreakableCudaGraph (Qwen3-8B --enable-breakable-cuda-graph, mgsm_en N=1319): score 0.856 ≥ 0.80 ✓ Plus the multi-model spot-checks already in progress.md: Qwen3-8B PCG default (Phase 5 commit): 0.835 Nemotron-H Mamba: 0.280 (parity with BCG-notes 0.310 base model) Qwen3-30B-A3B MoE: 0.950 Spec-decoding PCG test (Qwen3.5-35B-A3B + NEXTN, 2-GPU, FP8) deferred — model not cached, suite is "stage-b-test-2-gpu-large", out of scope for this single-GPU validation pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-04-28T05:00:54Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Adds the four convenience flags from plan §3.2 as sugar over ``--cuda-graph-mode``: --prefill-cuda-graph-backend {full,breakable,tcpcg,disabled} --decode-cuda-graph-backend {full,breakable,tcpcg,disabled} --prefill-disable-cuda-graph (== ...-backend disabled) --decode-disable-cuda-graph (== ...-backend disabled) Each translates to a single-phase entry in the canonical ``cuda_graph_mode`` dict — no decode change when only prefill is set, and vice versa. ServerArgs gets four new fields with the same names; CLI registration sits next to ``--cuda-graph-mode`` in ``add_cli_args``. Precedence in ``_parse_canonical`` (highest first; warning emitted on override): 1. ``--cuda-graph-mode`` JSON. 2. Per-phase convenience flags above. 3. Legacy ``--enable-breakable-cuda-graph`` / ``--disable-piecewise-cuda-graph`` / ``--disable-cuda-graph``. 4. Defaults: {decode: full, prefill: tcpcg}. Validation: - Unit tests covering 7 scenarios (default, single-phase convenience, JSON-vs-convenience override warning, convenience-vs-legacy override warning) all pass. - mgsm_en N=200 on Qwen3-8B with ``--prefill-cuda-graph-backend breakable``: score 0.825, latency 64.83s, throughput 3416.2 tok/s. Above 0.80 floor; the convenience flag drives the same path that ``--enable-breakable-cuda-graph`` does (both set ``cuda_graph_mode["prefill"] = "breakable"``). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…h constructor param Removes the hacky env-var bridge that ``DecodeCudaGraphRunner`` factory used to forward ``cuda_graph_mode["decode"] == "breakable"`` into ``CudaGraphRunner._capture_graph`` / ``_create_device_graph``. CudaGraphRunner.__init__ now accepts ``use_breakable_capture: Optional[bool]``. Default ``None`` keeps backwards compatibility with users who set ``SGLANG_USE_BREAKABLE_CUDA_GRAPH=1`` directly — when the kwarg is None, the env var is consulted as a fallback. The DecodeCudaGraphRunner factory now passes ``use_breakable_capture=True`` when ``cuda_graph_mode["decode"] == "breakable"``; no os.environ mutation. The (decode, tcpcg) fallback warning is unchanged. Fixes one of the cleanup items flagged in ``refactor/progress.md`` "Open issues" section. Validation: mgsm_en N=200 on Qwen3-8B (default decode=full) = 0.850, latency 65.10s, throughput 3397.9 tok/s. Above 0.80 floor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e to dynamo-only Two model-side gates were tagged ``TODO(cg-refactor)`` because their explicit comments named torch.compile / dynamo as the constraint, but they were gated on the broader ``is_in_cuda_graph_capture()`` (which fires during both capture and replay). Narrowed to ``torch.compiler.is_compiling()`` — fires only during dynamo tracing (compile time), letting replay take the fast path. Sites touched: models/nemotron_h.py:_forward_core Comment: "torch.compile cannot trace CUDA streams". The Mamba decoder layer's stream-overlap path was disabled during both capture and replay; now it's only disabled during compile. models/deepseek_common/.../forward_mla.py Comment: "torch dynamo requires out= op was called where output tensor was non-contiguous". The non-contiguous-output bmm form was used during both capture and replay; now it's only used during compile. Both gates were preserving correctness because the broader gate was a strict superset of the dynamo-only gate. Narrowing them improves replay performance without affecting capture-time correctness. Validation: mgsm_en N=200 on Nemotron-H-8B-Base-8K (which exercises the nemotron_h.py gate via the hybrid Mamba path): score 0.28, matching the pre-audit baseline (BCG notes' 0.310 for tp2; tp1 here). The ``forward_mla.py`` change touches the DeepSeek MLA path; full validation against DS-Coder-V2-Lite + flashinfer backend is a Phase 6 follow-up. Removes the imports of ``is_in_cuda_graph_capture`` from both files since they are no longer used after the narrowing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Strips refactor-process commentary out of production code: * Removes ``Phase N`` / ``cg-refactor`` mentions from docstrings, comments, and ``NotImplementedError`` messages across cuda_graph_runner, cuda_graph_backend, cuda_graph_backend_utils, the transition shims, server_args, model_runner, piecewise_cuda_graph_runner, and the two model-side gates in nemotron_h / forward_mla. * Deletes three speculative-scaffolding files that were never wired: - ``cuda_graph_runner/base_runner.py`` (empty ``BaseCudaGraphRunner`` placeholder; only used as a TYPE_CHECKING reference in stub methods that are now also gone) - ``cuda_graph_runner/buffers.py`` (no contents, no users) - ``cuda_graph_backend/base.py`` (``BaseCudaGraphBackend`` ABC with abstract ``prepare`` / ``capture_one`` / ``replay`` slots that nothing calls) * Strips each backend down to the static methods that are actually used: ``FullCudaGraphBackend.{make_graph, capture_into}``, ``BreakableCudaGraphBackend.{make_graph, capture_into}``, ``TCPiecewiseCudaGraphBackend.{build_compilation_config, install_compile}``. Drops the ABC inheritance and the ``captures_attn_metadata`` flag (which nothing read), the ``prepare`` / ``capture_one`` / ``replay`` ``NotImplementedError`` stubs, and the ``__init__`` re-exports for ``BaseCudaGraphBackend``. * Rewrites the docstrings of ``cuda_graph_runner/__init__.py``, ``config_resolution.py``, ``compile_phase.py``, and the ``cuda_graph_backend_utils`` package init to describe the current architecture without referring to refactor phases. Validation: - 42/42 module imports clean (every audited path). - mgsm_en N=200 on Qwen3-8B (default decode=full, prefill=tcpcg) = 0.840, latency 64.34s, throughput 3410.4 tok/s. Above 0.80 floor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1a renamed is_in_piecewise_cuda_graph -> is_in_cuda_graph_capture with the intent of "umbrella" semantics, but the umbrella was fiction: only TCPCG ever sets the flag. Full-decode never sets a sglang flag, and BCG has its own is_in_breakable_cuda_graph(). The generalized name made callsites lie about what they were checking. Revert all 35 callsites back to the original explicit names: is_in_cuda_graph_capture() -> is_in_piecewise_cuda_graph() enable_cuda_graph_capture(...) -> enable_piecewise_cuda_graph(...) CUDA_GRAPH_CAPTURE_FAILED_MSG -> PIECEWISE_CUDA_GRAPH_CAPTURE_FAILED_MSG _in_cuda_graph_capture (private) -> _in_piecewise_cuda_graph is_in_breakable_cuda_graph() and is_in_torch_compile_warmup() unchanged (both names describe what they actually test). Callsites that genuinely need umbrella semantics will use explicit inline `is_in_piecewise_cuda_graph() or is_in_breakable_cuda_graph()` in subsequent commits — no helper alias. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds three new module-level pieces, no behavior change. The legacy runners still own the actual capture/replay bodies; this commit lands the base classes that subsequent phases lift those bodies into. cuda_graph_runner/base_runner.py BaseCudaGraphRunner ABC, freeze_gc, get_batch_sizes_to_capture cuda_graph_runner/buffers.py DecodeInputBuffers, PrefillInputBuffers (dataclasses + populate_from_forward_batch + _grouped_foreach_copy_) cuda_graph_backend/base.py BaseCudaGraphBackend ABC: prepare / can_run / capture_one / replay / cleanup Buffer dataclass copies are duplicates for now — legacy.py and piecewise_cuda_graph_runner.py still hold their originals; subsequent phases switch their imports across and delete the duplicates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…backends Replaces the static-method backend helpers + legacy CudaGraphRunner with a real Decode runner backed by stateful backends. cuda_graph_backend/{full,breakable}.py → DELETED (static-method shims) cuda_graph_backend/full_cudagraph_backend.py NEW cuda_graph_backend/breakable_cudagraph_backend.py NEW - Each implements BaseCudaGraphBackend's 5 methods plus capture_session() ctx mgr. Owns _graphs[shape], _outputs[shape], pool, and (Full only) memory_saver_adapter. - Replay path is uniform from runner POV: backend.replay(shape, fb, **kw) returns the captured output; static_fb is unused for these two. cuda_graph_runner/legacy.py → DELETED cuda_graph_runner/decode_runner.py → real DecodeCudaGraphRunner(BaseCudaGraphRunner) - Lifts the entire CudaGraphRunner body (init, can_run, capture, capture_one_batch_size, recapture_if_needed, replay_prepare, replay, get_spec_info). - Backend dispatched off cuda_graph_mode["decode"]: full | breakable; tcpcg falls back to full with a one-shot warning. cuda_graph_runner/capture_mode.py NEW — model_capture_mode + lora-variant globals cuda_graph_runner/pool.py NEW — get/set_global_graph_memory_pool (used by speculative-draft runners) cuda_graph_runner/deepep_adapter.py NEW — DeepEPCudaGraphRunnerAdapter compilation/torch_compile_decoration.py NEW — patch_model + _to_torch + set_torch_compile_config Speculative draft runners no longer reuse `CudaGraphRunner.capture(self)` — each inlines its own ~25-line capture loop using the relocated freeze_gc / graph_capture / get_tensor_model_parallel_rank helpers. Their imports swap from `cuda_graph_runner` package re-exports to direct module paths. `CudaGraphRunner` symbol is gone; eagle_worker / adaptive_runtime_state / NPUGraphRunner all updated to `DecodeCudaGraphRunner`. NPUGraphRunner override of `_create_device_graph` / `_capture_graph` is now dead code — the new Decode runner uses backend.capture_one() and never calls those methods. NPU support needs a follow-up NPUCudaGraphBackend; out of scope for this PR (CUDA H100 testing only). Validation: Qwen3-8B mgsm_en N=200 (decode CG only, --disable-piecewise-cuda-graph): score 0.835, latency 63.58s, throughput 3,455 tok/s Within noise of prior commit's 0.835 baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ckend Replaces the legacy BCG + PCG runners with a single PrefillCudaGraphRunner that picks between BreakableCudaGraphBackend and TCPiecewiseCudaGraphBackend off cuda_graph_mode["prefill"]. cuda_graph_backend/tcpcg.py → DELETED (static-method shim) cuda_graph_backend/tcpcg_cudagraph_backend.py NEW - Stateful TCPiecewiseCudaGraphBackend(BaseCudaGraphBackend). - prepare(runner): builds CompilationConfig, multi-platform-op walks the language model into compile mode, runs a dummy warmup, calls install_torch_compiled, then runs the warmup_compile loop over every shape so torch.compile finishes JIT compilation before the cuda-graph capture session opens. - capture_session(stream): sets enable_piecewise_cuda_graph + set_pcg_capture_stream so the piecewise backend captures on the right stream. - capture_one(shape, fn, dummies): runs forward_fn twice (jit warm + cuda-graph capture) — both inside the capture_session. - replay(shape, static_fb, **kw): invokes the wrapped model_runner.model.forward (NOT language_model.model.forward — the former is what builds LogitsProcessorOutput; storing the latter in self._compiled_fn was a bug caught by the smoke run). - runtime_session(): enable_piecewise_cuda_graph for replay path. breakable_cuda_graph_runner.py → DELETED piecewise_cuda_graph_runner.py → DELETED cuda_graph_runner/prefill_runner.py - Replaces the prior factory with a real PrefillCudaGraphRunner(BaseCudaGraphRunner). - Owns PrefillInputBuffers, capture_num_tokens, attention_layers / moe_layers / moe_fusions snapshots, and per-bs static_* buffers that BCG segments read at replay (allocated regardless of active backend; trivial cost). - can_run() enforces the BCG-prefill bs<=1 constraint via isinstance check; rejects target_verify (tcpcg-prefill captured with EXTEND only); per-token-count cap. - replay() opens backend.runtime_session(), runs replay_prepare (pad/populate/build static_forward_batch), inits attn metadata, opens set_forward_context, calls backend.replay(num_tokens, static_fb), slices output to raw_num_tokens. - _run_warmup_forward(): the per-shape warmup hook that TCPiecewiseCudaGraphBackend.prepare calls during install_compile. cuda_graph_backend/{full,breakable,tcpcg}_cudagraph_backend.py - All three now do their own jit warmup (2x forward_fn) inside capture_one rather than expecting the runner to drive warmup separately. Decode runner's explicit pre-warmup loop dropped. - Each backend stashes self._tp_group during prepare() so the barrier between warmup runs works without going through the runner. - Added BaseCudaGraphBackend.runtime_session() — default no-op for Full; opens enable_breakable_cuda_graph for BCG; opens enable_piecewise_cuda_graph for tcpcg. Decode/prefill runners wrap their replay paths in backend.runtime_session() so model code reads the correct is_in_*_cuda_graph flag. - has_shape() added to all three backends; tcpcg's always-True since torch.compile dispatches by tensor shape internally. ModelRunner now imports DecodeCudaGraphRunner + PrefillCudaGraphRunner from cuda_graph_runner package directly. set_torch_compile_config moved to compilation/torch_compile_decoration. Validation: Qwen3-8B mgsm_en N=200, default cuda_graph_mode={'decode':'full', 'prefill':'tcpcg'}: score 0.850, latency 64.82s, throughput 3,380 tok/s Above baseline 0.835. Bug caught + fixed during smoke run: TCPiecewise.prepare initially stored self._compiled_fn = language_model.model.forward (the inner torch-compiled module). replay() invokes through this directly, skipping the outer Qwen3ForCausalLM.forward wrapper that builds LogitsProcessorOutput — so output came back as raw hidden states and the runner's isinstance-dispatch hit the PPProxyTensors-only fallback (AssertionError). Fix: store runner.model_runner.model.forward (the outer wrapper). Validated against Qwen3-8B PCG path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l,full) reject Field/method renames for symmetry between phases: ModelRunner.graph_runner → decode_cuda_graph_runner ModelRunner.piecewise_cuda_graph_runner → prefill_cuda_graph_runner ModelRunner.init_device_graphs() → init_decode_cuda_graph() ModelRunner.init_piecewise_cuda_graphs() → init_prefill_cuda_graph() Speculative-worker callers (eagle_worker / eagle_worker_v2 / multi_layer_eagle_worker_v2 / eagle_info_v2 / adaptive_runtime_state) all updated to access ``model_runner.decode_cuda_graph_runner`` instead of the old ``graph_runner`` field. (prefill, full) reject: - _downgrade_unsupported_combinations renamed to _reject_unsupported_combinations and now raises NotImplementedError at config-resolution time. Previous behavior silently downgraded to (prefill, disabled) with a warning. Per refactor goal: explicit over implicit; the user gets a clear error pointing to breakable or tcpcg. Shim deletion: - python/sglang/srt/model_executor/breakable_cuda_graph/ → DELETED (4-file shim re-exporting from cuda_graph_backend_utils/breakable_cuda_graph/) - python/sglang/srt/compilation/piecewise_context_manager.py → DELETED (re-exports superseded by direct imports from new homes) - test/registered/breakable_cuda_graph/test_breakable_cuda_graph.py repointed to the real location. Validation: Qwen3-8B mgsm_en N=200, default tcpcg+full: score 0.835, latency 64.24s, throughput 3,418 tok/s Within noise of Phase E+F's 0.850. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test/registered/breakable_cuda_graph/ → test/registered/cuda_graph/breakable/ test/registered/piecewise_cuda_graph/ → test/registered/cuda_graph/piecewise/ No content changes; pure file moves. Test imports were already repointed to the real (non-shim) source locations in Phase G. Note: .github/CODEOWNERS line 47 still references the deleted file ``piecewise_cuda_graph_runner.py``; left untouched to avoid touching governance config without explicit ask. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1A — BaseCudaGraphRunner now owns shared init (model_runner ref, device, parallel sizes, attn-tp coords, tbo plugin) and the _pad_to_bucket helper used by both Decode and Prefill replay_prepare. Decode/Prefill subclasses call super().__init__(model_runner) and skip the redundant field assignments. The padding helper has an explicit assert that documents the can_run/replay_prepare contract. Phase 1B — capture_session(stream) is now declared on the BaseCudaGraphBackend ABC. Every backend already implemented it; the declaration just makes the contract explicit. Phase 2D — tcpcg phase boundaries clarified. prepare() runs steps 1+2 (JIT activate + install_compile + compile-loop pass inside enable_torch_compile_warmup); capture_one() runs steps 3+4 (per-shape warmup forward + capture forward), matching Full/BCG's 2x warmup + 1x record pattern. _run_warmup_forward → _run_dummy_forward (it serves both jit-activate and compile-loop callers; "warmup" was misleading). Phase 2L — BCG static prefill buffers (static_seq_lens, static_extend_*, static_req_pool_indices, static_orig_seq_lens) move from PrefillCudaGraphRunner into BreakableCudaGraphBackend. Three new default-no-op hooks on BaseCudaGraphBackend let the runner stay uniform: setup_prefill_state, populate_prefill_dummy_inputs, commit_prefill_serving_inputs. The _is_breakable_backend isinstance flag and the ad-hoc bs>1 guard both go away — the latter moves into BCG's can_run. Phase 2M — cuda_graph_backend.factory.{resolve_decode_backend, resolve_prefill_backend} replaces the per-runner _resolve_*_backend functions. Phase / backend constants (PHASE_*, BACKEND_*, ALLOWED_BACKENDS_PER_PHASE, DEFAULT_CUDA_GRAPH_MODE) move to factory.py and are exported from cuda_graph_backend. Phase 3G — PrefillInputBuffers gains create() factory and populate_from_forward_batch() method, parallel to DecodeInputBuffers. The 50-line allocation block and 50-line population block in prefill_runner.py shrink to ~15 + ~15 lines. swa_translator is passed as a callback so the buffers module stays free of model_runner deps. Phase 3H — ForwardContext is a real @DataClass — fields declared at class level, no custom __init__, no 5 set_* setters. set_forward_context constructs with kwargs. Phase 4O — resolve_cuda_graph_config moves out of cuda_graph_runner/config_resolution.py (deleted) into ServerArgs._resolve_cuda_graph_config. Single-pass parser: compat rules now write directly to cuda_graph_mode["prefill"] = "disabled" instead of mutating the legacy disable_piecewise_cuda_graph flag, which is then derived once from the resolved mode. The double-parse hack is gone. Phase 4Q — BACKEND_FULL is no longer in the prefill allowed set. _validate_canonical raises with the historical NotImplementedError- style message when (prefill, full) is requested explicitly. The separate _reject_unsupported_combinations stage is gone (validate covers it). Phase 4I — NPUCudaGraphBackend (mirrors FullCudaGraphBackend but uses torch.npu.NPUGraph + torch.npu.graph + an async NPUGraph.update path for variable seq_lens at replay) lives in hardware_backend/npu/graph_runner/. The factory dispatches to it when device == "npu". NPUGraphRunner trims to a thin subclass that handles NPU-specific patch_model monkey-patch, the int32 cache_loc dtype, the disk-backed profile context, and the async-update replay branch — the dead _create_device_graph / _capture_graph overrides and the self.graphs[bs].update calls (which referenced a field that moved to self.backend._graphs in v1) are removed. Smoke import only — no NPU hardware available on the test box. Phase 4 P/J — torch_compile_decoration.py docstring clarified (calls out the duplication-by-design with tcpcg's _toggle_multi_platform_ops). cuda_graph_runner/__init__.py docstring refreshed (PrefillCudaGraphRunner is real now, not a "factory wrapping legacy BCG/PCG"). Phase 3R — _pad_to_bucket asserts raw_size <= max(buckets) so the upstream can_run/replay_prepare contract is local rather than implicit. Validation (Qwen3-8B tp1, mgsm_en N=200, --num-threads 32): default tcpcg+full: 0.855 (v1 baseline 0.835, +0.020 ≈ 0.8σ) BCG (run 1): 0.810 (v1 baseline 0.840, -0.030 ≈ 1.2σ) BCG (run 2): 0.840 (v1 baseline 0.840, match) Both within 1.5σ of historical baselines (1σ ≈ 0.026 at p=0.85, N=200). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

merrymercy · 2026-05-04T04:37:21Z

 from sglang.srt.compilation.compilation_config import CompilationConfig
-from sglang.srt.compilation.piecewise_context_manager import is_in_piecewise_cuda_graph
+from sglang.srt.model_executor.cuda_graph_backend_utils.tcpiecewise_cuda_graph import (
+    is_in_tcpiecewise_cuda_graph,


is_in_tcpiecewise_cuda_graph -> is_in_tc_piecewise_cuda_graph

merrymercy · 2026-05-04T04:38:38Z

            ctx = (
                nullcontext()
-                if not get_global_server_args().disable_piecewise_cuda_graph
+                if check_cuda_graph_backend(Phase.PREFILL, Backend.TCPIECEWISE)


Suggested change

if check_cuda_graph_backend(Phase.PREFILL, Backend.TCPIECEWISE)

if check_cuda_graph_backend(Phase.PREFILL, Backend.TC_PIECEWISE)

merrymercy · 2026-05-04T04:42:00Z

+    prefill_cuda_graph_backend: Optional[str] = None
+    decode_cuda_graph_backend: Optional[str] = None


Suggested change

prefill_cuda_graph_backend: Optional[str] = None

decode_cuda_graph_backend: Optional[str] = None

prefill_cuda_graph_mode: Optional[str] = None

decode_cuda_graph_mode: Optional[str] = None

align with the cuda_graph_mode argument.

Use either cuda_graph_mode or cuda_graph_backend. Unify all places.

merrymercy · 2026-05-04T04:43:01Z

+    disable_piecewise_cuda_graph: bool = False
+    enable_breakable_cuda_graph: bool = False


deprecate these two arguments

merrymercy · 2026-05-04T04:46:05Z

+
+DEFAULT_CUDA_GRAPH_MODE = {
+    Phase.DECODE: Backend.FULL,
+    Phase.PREFILL: Backend.TCPIECEWISE,


prefill should be breakable, right?

merrymercy · 2026-05-04T04:50:15Z

    enable_profile_cuda_graph: bool = False
    enable_cudagraph_gc: bool = False
    debug_cuda_graph: bool = False
+    cuda_graph_mode: Optional[Dict[str, str]] = None


Rename all general definition (e.g. server args and backend) to device graph

merrymercy · 2026-05-04T04:54:03Z

tcpiecewise_cudagraph_backend.py -> tc_piecewise_cudagraph_backend

VDV1985 · 2026-05-04T13:14:37Z


-class NPUGraphRunner(CudaGraphRunner):
-    """A NPUGraphRunner runs the forward pass of a model with npu graph and torch.compile."""
+class NPUGraphRunner(DecodeCudaGraphRunner):


Why this inherited from DecodeCudaGraphRunner and not from the BaseCudaGraphRunner? Does it mean that we cannot run Graph for prefill on NPU?

Insert an underscore between "tc" and "piecewise" to make the two-word intent explicit. Sweep covers identifiers, file/dir paths, and the user-facing config string value. Backend.TCPIECEWISE -> Backend.TC_PIECEWISE ("tcpiecewise" -> "tc_piecewise") TCPiecewise... -> TcPiecewise... is_in_tcpiecewise_* -> is_in_tc_piecewise_* enable_tcpiecewise_* -> enable_tc_piecewise_* tcpiecewise_cudagraph_backend.py -> tc_piecewise_cudagraph_backend.py tcpiecewise_cuda_graph/ -> tc_piecewise_cuda_graph/ 42 files touched (40 modified, 2 renamed via git mv plus 1 dir rename). Verified: every changed file parses, full module-level imports of ModelRunner / FlashAttention / CuteDsl / parallel_state / both custom_all_reduce variants succeed, and Backend.TC_PIECEWISE round-trips through cuda_graph_mode (default mode = ``{"prefill": "tc_piecewise"}``). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

main moved 228 commits since the cg-refactor fork point. Six conflicts resolved manually: python/sglang/srt/layers/attention/flashinfer_backend.py - Combine HEAD's is_in_tc_piecewise rename with main's added ``not self.use_paged`` clause on the use_ragged check. python/sglang/srt/model_executor/cuda_graph_runner/decode_runner.py - Drop bisect/gc imports added on main (lifted into base_runner / utils on cg-refactor); keep contextlib (used by the device_timer wrap). Combine main's device_timer.wrap with HEAD's backend.replay via ``with timer_ctx, self.backend.runtime_session():``. python/sglang/srt/model_executor/model_runner.py - Combine main's device_timer wrap with HEAD's renamed runner attribute (self.prefill_cuda_graph_runner, not the legacy self.piecewise_cuda_graph_runner). python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py - Modify/delete conflict — file was deleted in cg-refactor as part of the lift into prefill_runner.py. Stay deleted. python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py - Drop main's legacy template methods (_create_graph, _capture_init, _capture_graph, _replay, capture, capture_one_batch_size). EAGLE drafts now subclass DecodeCudaGraphRunner and override capture_one_shape directly. Preserve main's eagle_draft device_timer.wrap by adding it around the backend.replay() call in the existing replay() body. Verified: every critical module imports cleanly post-merge (server_args, model_runner, flashattention_backend, flashinfer_backend, flashinfer_cutedsl, parallel_state, both custom_all_reduce variants, decode/prefill runners, factory, all eagle workers). Backend.TC_PIECEWISE = 'tc_piecewise' round-trips through cuda_graph_mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply isort + black after the main merge: import reordering in factory.py and custom_all_reduce_v2.py, blank-line cleanup in compile_phase.py / capture_mode.py / buffers.py, reformatted assertion in breakable_cudagraph_backend.py, line wrapping in eagle_info_v2.py and buffers.py. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

merrymercy · 2026-05-06T20:08:04Z

+    prefill_cuda_graph_backend: Optional[str] = None
+    decode_cuda_graph_backend: Optional[str] = None
+    prefill_disable_cuda_graph: bool = False
+    decode_disable_cuda_graph: bool = False


also we need to clean up / unify other parameters such as capture batch size range

Oasis-Git and others added 12 commits April 27, 2026 22:34

github-actions Bot added blackwell SM100/SM120 piecewise-cuda-graph labels Apr 28, 2026

Oasis-Git mentioned this pull request Apr 28, 2026

[RFC] Cuda Graph Runner Backend Refactor #23004

Open

Oasis-Git and others added 2 commits April 28, 2026 05:11

Oasis-Git changed the title ~~[Refactor] Cuda Graph Runner/Backend Refactor~~ [WIP][Refactor] Cuda Graph Runner/Backend Refactor Apr 28, 2026

Oasis-Git and others added 7 commits April 28, 2026 05:33

github-actions Bot added speculative-decoding npu labels Apr 28, 2026

Oasis-Git marked this pull request as ready for review May 1, 2026 18:57

Oasis-Git requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Qiaolin-Yu, Ying1123, b8zhong, ch-wan, fzyzcjy, hebiao064, hnyls2002, iforgetmyname, ispobock, merrymercy, ping1jing2, xiezhq-hermann, yeahdongcn, yizhang2077 and yuan-luo as code owners May 1, 2026 18:57

Oasis-Git changed the title ~~[WIP][Refactor] Cuda Graph Runner/Backend Refactor~~ [Refactor] Cuda Graph Runner/Backend Refactor May 1, 2026

merrymercy reviewed May 4, 2026

View reviewed changes

VDV1985 reviewed May 4, 2026

View reviewed changes

Oasis-Git and others added 3 commits May 5, 2026 00:33

merrymercy reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Cuda Graph Runner/Backend Refactor#23906

[Refactor] Cuda Graph Runner/Backend Refactor#23906
Oasis-Git wants to merge 56 commits intosgl-project:mainfrom
Oasis-Git:cg-refactor

Oasis-Git commented Apr 28, 2026 •

edited by merrymercy

Loading

Uh oh!

gemini-code-assist Bot commented Apr 28, 2026

Uh oh!

merrymercy May 4, 2026

Uh oh!

merrymercy May 4, 2026

Uh oh!

merrymercy May 4, 2026

Uh oh!

merrymercy May 4, 2026

Uh oh!

merrymercy May 4, 2026

Uh oh!

merrymercy May 4, 2026

Uh oh!

merrymercy May 4, 2026

Uh oh!

VDV1985 May 4, 2026

Uh oh!

merrymercy May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if check_cuda_graph_backend(Phase.PREFILL, Backend.TCPIECEWISE)
	if check_cuda_graph_backend(Phase.PREFILL, Backend.TC_PIECEWISE)

		prefill_cuda_graph_backend: Optional[str] = None
		decode_cuda_graph_backend: Optional[str] = None

		disable_piecewise_cuda_graph: bool = False
		enable_breakable_cuda_graph: bool = False

Conversation

Oasis-Git commented Apr 28, 2026 • edited by merrymercy Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Oasis-Git commented Apr 28, 2026 •

edited by merrymercy

Loading