upstream-sync 2/5: SessionController port + cache architecture reconcile#13
upstream-sync 2/5: SessionController port + cache architecture reconcile#13KHAEntertainment merged 26 commits intomainfrom
Conversation
Adopts upstream's SessionController extraction (5acb45c), streaming session support with SessionAwareCache (c6cb0c9), session gating (e08ef06), and multimodal release on close (e1ee68d). Key changes: - Extract SessionController class from Scheduler.sessions dict - Add SessionAwareCache wrapper (gated by --enable-streaming-session) - Refactor handle_generate_request to 3-way routing - Change Req.session_id → Req.session (Session object reference) - Add streaming session guards in tokenizer communicator - Add MultimodalInputs.release_features() method - Design decision: Option B (post-create_req hook) for snapshot restore Snapshot hooks remain on Scheduler and will be re-wired in Phase 07 after the cache architecture reconcile (Phase 04). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sgl-kernel 0.3.21 → sglang-kernel 0.4.0 renamed the package but the marlin repack functions were moved to JIT kernels upstream (2cdde5d). Make the import chain: JIT kernel → sgl_kernel → None, so non-AWQ-Marlin workloads (like Mamba) aren't blocked at import time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The original import error was caused by sglang-kernel 0.4.0 shadowing sgl-kernel 0.3.21's sgl_kernel module. The fix is to keep sgl-kernel 0.3.21 installed (it has all required symbols including awq_marlin_moe_repack and min_p_sampling_from_probs). Revert the import fallback chain and engine version check workaround. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The hybrid linear attention backend unconditionally imports the CuTe DSL GDN kernel which requires cutlass.cute.nvgpu — a module not available in all CUTLASS DSL versions. Guard with try/except so Mamba models can still serve without this optional dependency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…refill cudagraph Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pool init Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ase semantics Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ontroller Adds GPU↔host offload path via HybridCacheController and MambaPoolHost. Our TierManager (disk persistence) layers on top — it interacts with MambaPool directly for extract/inject, not through HybridCacheController. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nel for Mamba ops
…in Mamba SSM EAGLE verify
…states from ssd_combined
…for idle batch
…emotronH hybrid models
…iguous input perf
…r TP size mismatch
- test_mamba_radix_cache_comprehensive: create KV tensors on device (upstream _insert_helper now frees duplicate KV pages during insert) - test_mamba_unittest: remove dangling available_and_evictable_str() standalone call, add missing method to MambaRadixCache from upstream
WalkthroughThis PR introduces comprehensive Mamba hybrid-cache architecture with hierarchical host-tiered storage, session-aware streaming semantics, and pluggable backend dispatch. Major additions include Mamba-specific radix cache with host LRU, session-scoped KV state lifecycle management, Mamba selective state update backend abstraction, and enhanced memory pool plumbing for layer-wise transfer synchronization. Approximately 70 files modified with ~3500+ lines added. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Scheduler
participant SessionController
participant TreeCache as Prefix Cache
participant MambaBackend as Mamba Backend<br/>(Dispatch)
Client->>Scheduler: handle_generate_request()
alt Session Exists
Scheduler->>SessionController: get(session_id)
SessionController->>TreeCache: restore KV state for session
TreeCache-->>SessionController: MatchResult + sentinel nodes
SessionController-->>Scheduler: Session object
else New Session
Scheduler->>SessionController: open(OpenSessionReqInput)
SessionController->>TreeCache: initialize new session slot
SessionController-->>Scheduler: Session created
end
Scheduler->>Scheduler: init_mamba_backend()
Note over Scheduler: Select backend based on<br/>server_args.mamba_backend
Scheduler->>MambaBackend: initialize_mamba_selective_state_update_backend()
Note over Scheduler,MambaBackend: Backend ready for forward passes
alt Session Finishes
Scheduler->>SessionController: cache_finished_req()
SessionController->>TreeCache: store KV state in session slot
alt Streaming Enabled
SessionController->>SessionController: save session state
Note over SessionController: State restored on next<br/>request with same session_id
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
…sion-cache # Conflicts: # python/sglang/srt/entrypoints/http_server.py # test/sglang/agents/test_tool_execution.py
There was a problem hiding this comment.
Actionable comments posted: 5
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
python/sglang/srt/utils/common.py (1)
124-128:⚠️ Potential issue | 🟡 MinorUpdate the comment to reflect the code change.
The comment on line 124 states "with a newline at the end", but the
BAR_FORMATstring on line 128 no longer includes a trailing\n. This creates a documentation mismatch that could confuse future maintainers.Since the newline was intentionally removed (as indicated by the PR), please update the comment to accurately describe the current behavior. You might also want to clarify whether removing the newline affects the ray/multiprocessing behavior mentioned in lines 126-127, or if that concern is no longer applicable.
📝 Suggested comment update
-# explicitly use pure text format, with a newline at the end +# explicitly use pure text format, without a trailing newline # this makes it impossible to see the animation in the progress bar -# but will avoid messing up with ray or multiprocessing, which wraps -# each line of output with some prefix. +# (previously included a newline to avoid issues with ray or multiprocessing +# output wrapping, but this has been changed)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/utils/common.py` around lines 124 - 128, Update the outdated comment above BAR_FORMAT to reflect that the format no longer appends a trailing newline (the BAR_FORMAT string has no "\n"); remove or reword the sentence that says "with a newline at the end" and clarify whether the intent to avoid ray/multiprocessing output-prefix issues still applies now that the newline was removed (i.e., state that using a pure text format without a trailing newline is intentional and whether it preserves the behavior with ray/multiprocessing). Ensure the comment references BAR_FORMAT so future maintainers understand the relationship between the comment and the format string.python/sglang/srt/disaggregation/mooncake/conn.py (1)
739-746:⚠️ Potential issue | 🟠 MajorReject non-divisible TP ratios before computing writer offsets.
writers_per_decode = self.attn_tp_size // dst_attn_tp_sizeonly works when prefill TP is an integer multiple of decode TP. On a ratio like6 -> 4, floor division makes several writers “park” in the same destination slice and silently corrupt the transferred Mamba state.🔧 Proposed guard
if self.attn_tp_size > dst_attn_tp_size: + if self.attn_tp_size % dst_attn_tp_size != 0: + raise RuntimeError( + "Mamba state slice transfer requires prefill attn_tp_size " + "to be an integer multiple of decode attn_tp_size when " + f"prefill TP > decode TP ({self.attn_tp_size} -> {dst_attn_tp_size})." + ) # Multiple prefill ranks send to 1 decode rank # Each prefill sends all its dims to the appropriate offset in decode src_dim_start = 0 num_dims_to_send = src_dim writers_per_decode = self.attn_tp_size // dst_attn_tp_size + if src_dim * writers_per_decode != dst_dim: + raise RuntimeError( + f"Mamba state slice geometry mismatch: {src_dim=} " + f"{writers_per_decode=} {dst_dim=}" + ) local_writer_idx = local_tp_rank_in_group % writers_per_decode dst_dim_start = local_writer_idx * src_dim🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/disaggregation/mooncake/conn.py` around lines 739 - 746, The code assumes attn_tp_size is an integer multiple of dst_attn_tp_size when computing writers_per_decode and writer offsets; add an explicit guard before computing writers_per_decode (e.g., in the same block that accesses attn_tp_size and dst_attn_tp_size) to detect non-divisible ratios and raise a clear exception (ValueError or AssertionError) with context including attn_tp_size and dst_attn_tp_size; this prevents silent corruption when calculating writers_per_decode, local_writer_idx, and dst_dim_start (referenced symbols: attn_tp_size, dst_attn_tp_size, writers_per_decode, local_tp_rank_in_group, dst_dim_start).python/sglang/cli/main.py (1)
18-28:⚠️ Potential issue | 🔴 CriticalCLI dispatch is broken:
generatewill crash immediately, andservelacks a handler.The
generatesubcommand will fail on line 28 with aNameErrorbecausegenerate_parseris never assigned—it's created on line 23 but the result is discarded. Theservesubcommand, while it parses, has no handler bound to it, so it will crash at runtime when the dispatcher tries to call a nonexistentfuncattribute.The missing import of the
servefunction (line 3) compounds the issue.Required fix
import argparse from sglang.cli.generate import generate +from sglang.cli.serve import serve from sglang.cli.utils import get_git_commit_hash from sglang.version import __version__ @@ - subparsers.add_parser( + serve_parser = subparsers.add_parser( "serve", help="Launch an SGLang server.", add_help=False, ) - subparsers.add_parser( + serve_parser.set_defaults(func=serve) + + generate_parser = subparsers.add_parser( "generate", help="Run inference on a multimodal model.", add_help=False, # Defer help to the specific parser ) generate_parser.set_defaults(func=generate)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/cli/main.py` around lines 18 - 28, The CLI dispatch is broken because the return value of subparsers.add_parser was not assigned and the serve handler was not imported or bound; fix by assigning the parser objects to variables (e.g., generate_parser = subparsers.add_parser(...) and serve_parser = subparsers.add_parser(...)), import the serve function at the top alongside generate, and call set_defaults(func=generate) on generate_parser and set_defaults(func=serve) on serve_parser so the dispatcher can invoke the correct handler.
🟠 Major comments (16)
.beads-1-1 (1)
1-1:⚠️ Potential issue | 🟠 MajorAvoid committing machine-specific absolute paths in repo config.
This path hard-codes one workstation and will fail for anyone else. It’s like committing your own
/etc/hostsand expecting every machine to resolve the same way.🔧 Proposed fix
-/home/jeanclawdai/.beads-stores/upstream-sync-2026-Q1 +.beads-stores/upstream-sync-2026-Q1If
.beadsmust stay local-only, prefer ignoring it and documenting a one-time setup step instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.beads at line 1, Remove the machine-specific absolute path "/home/jeanclawdai/.beads-stores/upstream-sync-2026-Q1" from the .beads file and replace it with a portable alternative (e.g., a relative path, an environment-variable placeholder like ${BEADS_STORE_PATH}, or a generic store name), or remove the file from source control; if you choose to keep it local-only, add .beads to .gitignore and add a short setup note in the repo README or CONTRIBUTING documenting the one-time step to create the user-specific store path or set BEADS_STORE_PATH so others can configure it on their machines.docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md-503-504 (1)
503-504:⚠️ Potential issue | 🟠 MajorReplace hard-coded workstation path with repo-relative execution.
This makes the plan brittle outside one machine. Use repo-root relative commands so any contributor/runner can execute it.
🔧 Suggested adjustment
-cd /home/jeanclawdai/sglang-mamba/worktrees/upstream-sync-2026-Q1 -pytest test/registered/radix_cache/test_mamba_radix_cache_comprehensive.py -v --timeout=120 2>&1 | tail -30 +git rev-parse --show-toplevel >/dev/null || exit 1 +pytest test/registered/radix_cache/test_mamba_radix_cache_comprehensive.py -v --timeout=120 2>&1 | tail -30🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md` around lines 503 - 504, The plan uses a hard-coded workstation path ("cd /home/jeanclawdai/sglang-mamba/worktrees/upstream-sync-2026-Q1") making it non-portable; change it to run relative to the repo root (e.g., replace the cd with a repo-root discovery like cd "$(git rev-parse --show-toplevel)" or use ./ from the project root) and ensure the pytest invocation (pytest test/registered/radix_cache/test_mamba_radix_cache_comprehensive.py -v --timeout=120 2>&1 | tail -30) is executed from that repo-root context so any contributor can run it reliably.docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md-345-351 (1)
345-351:⚠️ Potential issue | 🟠 MajorGuard the sync call to prevent crashes on non-CUDA devices.
The plan recommends unconditional
torch.cuda.synchronize()before extraction. This will fail on CPU or other non-CUDA backends (xpu, hpu, npu, musa) sincetorch.cuda.synchronize()is CUDA-specific. The codebase explicitly supports multiple device types via DeviceConfig, so this guidance needs a device-aware guard.🔧 Suggested adjustment
-# Ensure all pending GPU ops on Mamba state are complete before CPU extraction -torch.cuda.synchronize() +# Ensure pending CUDA ops complete before CPU extraction (CUDA-only) +if torch.cuda.is_available() and mamba_pool.device.startswith("cuda"): + torch.cuda.synchronize()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md` around lines 345 - 351, The unconditional torch.cuda.synchronize() will crash on non-CUDA backends; update extract_state_from_pool in mamba_snapshot.py to perform a device-aware sync before the .clone().cpu() calls: detect the Mamba state device (e.g., via the DeviceConfig or the tensor/device.type used in extract_state_from_pool) and only call the appropriate sync for CUDA (torch.cuda.synchronize()) when device.type == "cuda" (or torch.cuda.is_available()); skip or use the correct backend sync for other device types so CPU-only or non-CUDA backends do not call torch.cuda.synchronize().python/sglang/srt/managers/schedule_batch.py-366-370 (1)
366-370:⚠️ Potential issue | 🟠 MajorRelease
precomputed_embeddingshere as well.This helper only drops
item.feature, butMultimodalDataItemstores its payload in eitherfeatureorprecomputed_embeddings. If this replaces the oldmm_items = Nonecleanup path, precomputed-embedding requests now keep the large tensor alive and lose the memory win.🔧 Proposed fix
def release_features(self): - """Release feature tensors to free GPU memory.""" + """Release multimodal payload tensors to free GPU memory.""" for item in self.mm_items: item.feature = None + item.precomputed_embeddings = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/managers/schedule_batch.py` around lines 366 - 370, The release_features method currently only clears item.feature, leaving MultimodalDataItem.precomputed_embeddings intact and thus retaining GPU memory; update release_features to also set item.precomputed_embeddings = None for each item in self.mm_items (and safely handle cases where mm_items or the attributes may be missing) so both storage paths are released when freeing tensors.python/sglang/srt/mem_cache/base_prefix_cache.py-113-120 (1)
113-120:⚠️ Potential issue | 🟠 MajorConverge the
init_load_backAPI across abstract base, implementations, and call sites.The abstract method in
BasePrefixCache(lines 224–228) still uses the old positional signature(last_host_node, host_hit_length, …), buthi_mamba_radix_cache.pyhas already been updated to acceptInitLoadBackParamsas a single parameter object. Meanwhile,hiradix_cache.pyand thesession_aware_cachewrapper still use the old signature, creating an API boundary mismatch—implementations and callers are now pulling in different directions.To fix: Update the abstract method and all remaining implementations (
hiradix_cache,session_aware_cache, and any others) to acceptInitLoadBackParamsas the parameter, matching whatschedule_policy.pyis already attempting to pass. This eliminates the risk of drift between the contract and the implementations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/mem_cache/base_prefix_cache.py` around lines 113 - 120, Update the init_load_back API to use the InitLoadBackParams object across the abstract base and all implementations: change BasePrefixCache.init_load_back to accept a single parameter typed as InitLoadBackParams, then refactor hiradix_cache.py and session_aware_cache (and any other implementations) to match that signature and consume fields from the InitLoadBackParams instance; ensure callers (e.g., schedule_policy.py) that currently pass an InitLoadBackParams continue to pass it unchanged and adjust any call sites that still pass positional args (last_host_node, host_hit_length, etc.) to construct and pass an InitLoadBackParams instead so the abstract, implementations, and callers are consistent with hi_mamba_radix_cache.py.test/registered/4-gpu-models/test_qwen3_next_models_mtp.py-217-251 (1)
217-251:⚠️ Potential issue | 🟠 MajorRestore the previous
SGLANG_ENABLE_SPEC_V2value.This flips a process-global flag to
True, but teardown always writesFalseandsetUpClasshas no rollback ifpopen_launch_server(...)fails. That makes later suites order-dependent. Save the old value, restore it on launch failure, and put that same value back intearDownClass.🔁 Suggested fix
`@classmethod` def setUpClass(cls): cls.model = QWEN3_NEXT_MODEL - envs.SGLANG_ENABLE_SPEC_V2.set(True) + cls._prev_spec_v2 = envs.SGLANG_ENABLE_SPEC_V2.get() + envs.SGLANG_ENABLE_SPEC_V2.set(True) cls.base_url = DEFAULT_URL_FOR_TEST - cls.process = popen_launch_server( - cls.model, - cls.base_url, - timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH, - other_args=[ - "--trust-remote-code", - "--speculative-algorithm", - "NEXTN", - "--speculative-num-steps", - "3", - "--speculative-eagle-topk", - "1", - "--speculative-num-draft-tokens", - "4", - "--mem-fraction-static", - "0.8", - "--tp", - "4", - "--chunked-prefill-size", - "2048", - "--mamba-scheduler-strategy", - "extra_buffer", - "--mamba-track-interval", - "128", - ], - ) + try: + cls.process = popen_launch_server( + cls.model, + cls.base_url, + timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH, + other_args=[ + "--trust-remote-code", + "--speculative-algorithm", + "NEXTN", + "--speculative-num-steps", + "3", + "--speculative-eagle-topk", + "1", + "--speculative-num-draft-tokens", + "4", + "--mem-fraction-static", + "0.8", + "--tp", + "4", + "--chunked-prefill-size", + "2048", + "--mamba-scheduler-strategy", + "extra_buffer", + "--mamba-track-interval", + "128", + ], + ) + except Exception: + envs.SGLANG_ENABLE_SPEC_V2.set(cls._prev_spec_v2) + raise `@classmethod` def tearDownClass(cls): - envs.SGLANG_ENABLE_SPEC_V2.set(False) - kill_process_tree(cls.process.pid) + envs.SGLANG_ENABLE_SPEC_V2.set(getattr(cls, "_prev_spec_v2", False)) + if hasattr(cls, "process"): + kill_process_tree(cls.process.pid)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/registered/4-gpu-models/test_qwen3_next_models_mtp.py` around lines 217 - 251, setUpClass currently flips the process-global flag envs.SGLANG_ENABLE_SPEC_V2 to True without preserving the prior value and tearDownClass always sets it to False; change setUpClass to read and save the previous value (e.g., prev_spec_v2 = envs.SGLANG_ENABLE_SPEC_V2.get()), then set True, and if popen_launch_server fails restore the saved value before propagating the error; also change tearDownClass to restore the saved previous value (not hardcode False) so the original state is preserved; reference setUpClass, tearDownClass, envs.SGLANG_ENABLE_SPEC_V2, and popen_launch_server when making these changes.python/sglang/srt/disaggregation/decode.py-165-165 (1)
165-165:⚠️ Potential issue | 🟠 Major
mamba_sizeis wired to thin air right now.
effective_mamba_sizeis computed, but_init_mamba_pool(...)still getssize + pre_alloc_sizefor its capacity args. That makes the newmamba_sizeinput a no-op, so decode-side Mamba state will still be sized incorrectly whenever that cap is set.Also applies to: 184-197
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/disaggregation/decode.py` at line 165, The new mamba_size parameter is ignored because _init_mamba_pool(...) is still called with size + pre_alloc_size; compute the effective_mamba_size (use mamba_size when not None, otherwise keep existing logic) and pass effective_mamba_size into all _init_mamba_pool(...) capacity arguments (replace occurrences using size + pre_alloc_size around the effective_mamba_size computation and in the blocks referenced roughly at lines 184-197). Also ensure default None handling remains correct so callers can override capacity by providing mamba_size.python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py-51-56 (1)
51-56:⚠️ Potential issue | 🟠 MajorGuard the CuTe DSL fallback path.
If the import fails here,
cutedsl_fused_sigmoid_gating_delta_rule_updatebecomesNone, butGDNAttnBackend.__init__still selects it wheneverSGLANG_USE_CUTEDSL_GDN_DECODEis enabled. That turns the first decode into aNoneTypecall instead of a clean fallback.Suggested guard
use_cutedsl = Envs.SGLANG_USE_CUTEDSL_GDN_DECODE.get() +if use_cutedsl and cutedsl_fused_sigmoid_gating_delta_rule_update is None: + rank0_log( + "CuTe DSL GDN decode requested but kernel is unavailable; falling back." + ) + use_cutedsl = False self._kernel_func = ( cutedsl_fused_sigmoid_gating_delta_rule_update if use_cutedsl else fused_sigmoid_gating_delta_rule_update )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py` around lines 51 - 56, The import fall-back leaves cutedsl_fused_sigmoid_gating_delta_rule_update as None but GDNAttnBackend.__init__ still picks it when SGLANG_USE_CUTEDSL_GDN_DECODE is enabled, causing a NoneType call on decode; update GDNAttnBackend.__init__ to check that cutedsl_fused_sigmoid_gating_delta_rule_update is not None before selecting it (and only enable the CuTe DSL path when both SGLANG_USE_CUTEDSL_GDN_DECODE is true and cutedsl_fused_sigmoid_gating_delta_rule_update is callable), otherwise fall back to the existing safe implementation so decode never tries to call None.test/unit/test_mamba_state_scatter_triton.py-164-167 (1)
164-167:⚠️ Potential issue | 🟠 MajorFail fast when the fused kernel cannot import on CUDA.
On a CUDA worker, that import is exactly what this test is supposed to validate.
skipTest()turns a broken Triton module into a false green.Suggested fix
- if fused_mamba_state_scatter_with_mask is None: - self.skipTest( - f"fused_mamba_state_scatter_with_mask import failed: {_FUSED_IMPORT_ERROR}" - ) + if fused_mamba_state_scatter_with_mask is None: + self.fail( + f"fused_mamba_state_scatter_with_mask import failed: {_FUSED_IMPORT_ERROR}" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/unit/test_mamba_state_scatter_triton.py` around lines 164 - 167, The current test silently skips when fused_mamba_state_scatter_with_mask is None by calling self.skipTest, which masks import failures on CUDA; change this to fail fast by replacing the skipTest call with an explicit test failure (e.g., self.fail or raise ImportError) that includes _FUSED_IMPORT_ERROR so the test fails when the Triton fused kernel cannot be imported; locate the check referencing fused_mamba_state_scatter_with_mask and _FUSED_IMPORT_ERROR and swap skipTest(...) for a failing call that reports the import error.test/unit/test_mamba_state_scatter_triton.py-1-2 (1)
1-2: 🛠️ Refactor suggestion | 🟠 MajorUse pytest for this new test module.
This adds a fresh
unittestsuite undertest/, which makes it an outlier for fixtures, markers, and selection.As per coding guidelines,
test/**/*.py: Use pytest for testing with the 9-phase test plan structure in test/phases/.Also applies to: 160-354
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/unit/test_mamba_state_scatter_triton.py` around lines 1 - 2, The new test module test/unit/test_mamba_state_scatter_triton.py is written with unittest; convert it to pytest-style tests by removing the unittest import and TestCase subclasses and rewriting tests as plain functions (or pytest-style classes without unittest.TestCase) that use pytest fixtures/markers; ensure you adopt the repository's 9-phase test plan by hooking into the appropriate fixtures from test/phases/ and add any required pytest markers (e.g., `@pytest.mark`.*) used elsewhere in tests so the file is consistent with pytest discovery and selection.python/sglang/srt/mem_cache/memory_pool_host.py-1480-1487 (1)
1480-1487:⚠️ Potential issue | 🟠 Major
HostPoolGroupserializes only the anchor pool.
get_data_page(),get_dummy_flat_data_page(), andset_from_flat_data_page()all delegate toanchor_entry.host_pool. If this wrapper is used for L3 write/read, every extraPoolEntrydisappears on disk round-trips, so restore can bring back KV without the matching Mamba state — like saving the label on the box but not what’s inside it.Based on learnings, Mamba SSM snapshot persistence is the primary architectural feature being added to SGLang for fast multi-turn inference in the fork.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/mem_cache/memory_pool_host.py` around lines 1480 - 1487, HostPoolGroup currently only proxies get_data_page, get_dummy_flat_data_page, and set_from_flat_data_page to anchor_entry.host_pool which causes only the anchor pool to be serialized; change these methods in HostPoolGroup to operate on the specific PoolEntry corresponding to the requested index (or iterate/coordinate across all PoolEntry.host_pool instances for dummy/read/write operations) so that every PoolEntry.host_pool state is included in persistence/restore. Locate HostPoolGroup and revise get_data_page(index, flat), get_dummy_flat_data_page(), and set_from_flat_data_page(index, data_page) to reference the correct PoolEntry (not just anchor_entry) and ensure the serialization logic writes/reads each PoolEntry.host_pool state so KV and Mamba state remain consistent after disk round-trips.python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py-242-247 (1)
242-247:⚠️ Potential issue | 🟠 Major
DRAFT_EXTEND_V2still initializes linear metadata withNone.
HybridLinearAttnBackend.init_forward_metadata()calls every sub-backend. In this file, bothMamba2AttnBackend.init_forward_metadata()andLightningAttentionBackend.init_forward_metadata()immediately passmetadata.query_start_locintoprepare_*, so returningNonehere can still fail before layer dispatch ever skips the Mamba path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py` around lines 242 - 247, The branch for forward_batch.forward_mode.is_draft_extend_v2() sets query_start_loc = None which breaks downstream init_forward_metadata calls that immediately forward metadata.query_start_loc into prepare_*; change this to a valid integer (e.g., 0) instead of None so Mamba2AttnBackend.init_forward_metadata and LightningAttentionBackend.init_forward_metadata receive a usable metadata.query_start_loc during HybridLinearAttnBackend.init_forward_metadata execution (ensure the value is an int derived from the forward_batch or default to 0).python/sglang/srt/mem_cache/session_aware_cache.py-20-27 (1)
20-27:⚠️ Potential issue | 🟠 MajorGive
_VirtualNodethe tiny HiCache node interface, or stop returning it aslast_host_node.
match_prefix()currently hands downstream HiCache code a sentinel with nobackuped,parent,get_last_hash_value(), orget_prefix_hash_values(). With streaming sessions plus storage prefetch enabled, the first slot hit will crash withAttributeErrorinstead of just skipping prefetch.🩹 One lightweight way to make the sentinel safe
class _VirtualNode: """Sentinel node for streaming session requests. @@ - pass + backuped = False + parent = None + + def get_last_hash_value(self): + return None + + def get_prefix_hash_values(self, _node): + return []Also applies to: 177-181
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/mem_cache/session_aware_cache.py` around lines 20 - 27, match_prefix() can return the sentinel _VirtualNode as last_host_node but that sentinel lacks the small HiCache node interface leading to AttributeError when downstream code expects attributes/methods like backuped, parent, get_last_hash_value(), and get_prefix_hash_values(); fix by giving _VirtualNode that tiny interface (add attributes backuped and parent and stub methods get_last_hash_value and get_prefix_hash_values that return safe defaults) or change match_prefix() to never return _VirtualNode as last_host_node; update the _VirtualNode class (and the other sentinel at lines ~177-181) to implement these minimal members so streaming-session code and prefetch can safely call them.python/sglang/srt/managers/scheduler.py-2594-2606 (1)
2594-2606:⚠️ Potential issue | 🟠 MajorKeep the fast-fail
Reqwired to the original worker and trace.This fallback constructor drops
http_worker_ipcandtime_stats. In multi-http-worker mode the abort can lose its way home, and this path also disappears from the request-level timing/trace data.🔧 Minimal fix
req = Req( recv_req.rid, recv_req.input_text, recv_req.input_ids, recv_req.sampling_params, vocab_size=self.model_config.vocab_size, + http_worker_ipc=recv_req.http_worker_ipc, + time_stats=recv_req.time_stats, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/managers/scheduler.py` around lines 2594 - 2606, The fast-fail Req being constructed drops the original request's worker/trace info so aborts and timing are lost; when creating the fallback Req (the one using recv_req.rid/... and calling set_finish_with_abort, init_req_max_new_tokens, and _add_request_to_queue) copy over recv_req.http_worker_ipc and recv_req.time_stats (and any trace/worker fields used for routing/metrics) onto the new req so the abort will route back to the original worker and the request-level timing/trace data is preserved.python/sglang/srt/mem_cache/session_aware_cache.py-68-75 (1)
68-75:⚠️ Potential issue | 🟠 MajorRelease the slot-held Mamba allocation when the session closes.
save_from_req()movesmamba_pool_idxout of the request, butrelease_session()only frees KV indices and the req-pool slot. On hybrid/Mamba models, closing enough streaming sessions will slowly exhaust the Mamba pool even though the sessions are gone.Also applies to: 218-238
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/mem_cache/session_aware_cache.py` around lines 68 - 75, The session leaves a held Mamba allocation behind because save_from_req() moves mamba_pool_idx out of the request but release_session() only clears req_pool_idx and KV indices; update release_session() to also free the mamba slot when present by checking session.mamba_pool_idx (and any related fields like mamba_ping_pong_track_buffer, mamba_next_track_idx, mamba_last_track_seqlen, mamba_branching_seqlen), return that index to the Mamba pool allocator and set session.mamba_pool_idx (and the corresponding req.mamba_* fields if applicable) to None/cleared; apply the same cleanup logic to the other release branch referenced around the 218-238 area so hybrid/Mamba sessions do not leak pool slots.python/sglang/srt/managers/session_controller.py-116-133 (1)
116-133:⚠️ Potential issue | 🟠 MajorDon’t rotate a streaming session onto a still-running request.
The streaming path reuses the previous request without checking
last_req.finished(). If another turn sneaks in while the first one is still decoding, the new request is built from partialoutput_idsand the old request loses its session link mid-flight — like changing train tracks while the cars are still moving.🧭 Suggested guard
elif self.req_nodes: assert len(self.req_nodes) == 1 - _, last_req_node = self.req_nodes.popitem() + last_rid, last_req_node = next(iter(self.req_nodes.items())) last_req = last_req_node.req + if not last_req.finished(): + abort = True + abort_message = ( + "Streaming session request is appending to a request that hasn't finished." + ) + logging.warning(abort_message) + else: + self.req_nodes.pop(last_rid)Also applies to: 225-233
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/managers/session_controller.py` around lines 116 - 133, When reusing the previous request in the streaming path, guard against rotating onto a still-running request: in the block where self.streaming and self.req_nodes is inspected (the code that pops last_req_node and sets last_req), first peek the last_req_node.req and call its finished() (or equivalent) — if finished() is False then do not pop or reuse it; instead set abort = True and abort_message = "Cannot rotate streaming session onto a still-running request." Apply the same guard in the other similar block referenced (the block around lines 225-233) so both places check last_req.finished() before reusing or popping the previous request.
🟡 Minor comments (7)
python/sglang/test/run_eval.py-101-104 (1)
101-104:⚠️ Potential issue | 🟡 MinorFix: Change
--dataset-pathdefault toNoneto allow per-eval defaults.The GPQA fallback pattern looks defensive, but there's a logical trap. Since
--dataset-pathhas a default value of"THUDM/LongBench-v2"(line 270), the command:python3 -m sglang.test.run_eval --eval-name gpqa...will set
args.dataset_pathto the LongBench-v2 default, making it truthy. Theorfallback to the GPQA CSV URL never triggers—it's like setting up a safety net under the wrong apparatus. GPQAEval then receives the wrong dataset and fails onpandas.read_csv().The fix is simple: change the
--dataset-pathdefault toNoneso each evaluator applies its own appropriate default. This is safe becauseLongBenchV2Evalhas internal fallback logic (if not data_source: data_source = DEFAULT_DATASET), so passingNonestill works correctly for LongBench v2.Proposed fix: Change default to None
parser.add_argument( "--dataset-path", type=str, - default="THUDM/LongBench-v2", + default=None, help="Path to dataset file or HuggingFace dataset name for LongBench-v2", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/test/run_eval.py` around lines 101 - 104, Change the CLI default for --dataset-path from the string "THUDM/LongBench-v2" to None so per-evaluator defaults can apply; update the argument definition that creates args.dataset_path (the add_argument call for "dataset_path") to use default=None, ensuring GPQAEval(filename, ...) sees the intended None/fallback path and LongBenchV2Eval still handles None via its internal fallback logic.docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md-477-484 (1)
477-484:⚠️ Potential issue | 🟡 MinorRemove the first import check;
python.sglangis not a valid import path in this repo.The first import (
import python.sglang...) is like checking if a door exists by walking into a wall—it never works and masks whether the real door (the sys.path pattern) actually opens. Since the package root issglang(perpyproject.toml), only the second import pattern withsys.path.insert(0, 'python')is valid. The codebase uses this pattern consistently everywhere, so simplify the script to rely on the one path that actually works.🔧 Suggested adjustment
-python -c " -import python.sglang.srt.snapshot.mamba_snapshot as ms -print('mamba_snapshot imports OK') -" 2>&1 || python -c " +python -c " import sys; sys.path.insert(0, 'python') from sglang.srt.snapshot.mamba_snapshot import MambaSnapshotManager print('MambaSnapshotManager importable') " 2>&1 || echo "Import check deferred to runtime validation"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md` around lines 477 - 484, Remove the invalid import check that tries to import python.sglang.srt.snapshot.mamba_snapshot and instead keep the working pattern that prepends 'python' to sys.path and imports MambaSnapshotManager from sglang.srt.snapshot.mamba_snapshot; specifically, delete the first python -c block (the "import python.sglang..." check) and rely on the second block that does sys.path.insert(0, 'python') followed by from sglang.srt.snapshot.mamba_snapshot import MambaSnapshotManager so the import validation uses the correct package root and matches other scripts.python/sglang/srt/mem_cache/base_prefix_cache.py-132-136 (1)
132-136:⚠️ Potential issue | 🟡 MinorKeep
host_hit_lengthas a count, not a sentinel.Using
max(kv_host_tokens, 1-if-mamba-on-host)turns this field into both an odometer and a warning light. Downstream accounting still treats it as literal token count, so a state-only host restore will be reported as a one-token host hit unless every consumer special-cases the sentinel.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/mem_cache/base_prefix_cache.py` around lines 132 - 136, The host_hit_length field is being used as both a token count and a sentinel via max(kv_host_tokens, 1 - if-mamba-on-host); change it to always carry the literal token count (i.e., set host_hit_length = kv_host_tokens) and stop injecting the 1-as-sentinel behavior. Introduce or set a separate boolean/enum flag (e.g., host_state_only or mamba_host_restore) to indicate a state-only/mamba-on-host restore when applicable (use the existing mamba-on-host condition to populate that flag). Update the code paths that compute and consume host_hit_length (refer to host_hit_length and kv_host_tokens in this module) and adjust any docs/comments to state host_hit_length is a pure count and the new flag indicates state-only restores.python/sglang/srt/managers/io_struct.py-1642-1643 (1)
1642-1643:⚠️ Potential issue | 🟡 MinorValidate
timeoutto reject non-positive values.
timeoutcurrently accepts negative (or zero) values, which can cause confusing immediate-timeout behavior at session open.Suggested fix
`@dataclass` class OpenSessionReqInput(BaseReq): capacity_of_str_len: int session_id: Optional[str] = None streaming: bool = False timeout: Optional[float] = None + + def __post_init__(self): + if self.timeout is not None and self.timeout <= 0: + raise ValueError("timeout must be > 0 when provided.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/managers/io_struct.py` around lines 1642 - 1643, Validate the timeout attribute so it cannot be zero or negative: in the class or dataclass that defines the streaming and timeout fields (the object containing attributes streaming: bool and timeout: Optional[float]), add a check in the constructor/__post_init__ (or the setter used for timeout) that raises a ValueError if timeout is not None and timeout <= 0, with a clear message like "timeout must be positive or None"; ensure callers that pass timeout get the same validation behavior when constructing or modifying the instance.python/sglang/srt/server_args.py-2096-2110 (1)
2096-2110:⚠️ Potential issue | 🟡 MinorReject unknown
mamba_backendvalues in_handle_mamba_backend.
argparseguards the CLI path, but directServerArgs(...)callers can still pass arbitrary strings and this method currently treats any non-flashinfervalue as valid. That lets a typo slip past the front desk and blow up later in backend dispatch.Suggested guard
def _handle_mamba_backend(self): + if self.mamba_backend not in MAMBA_BACKEND_CHOICES: + raise ValueError( + f"Invalid mamba backend: {self.mamba_backend}. " + f"Expected one of {MAMBA_BACKEND_CHOICES}." + ) + if self.mamba_backend == "flashinfer": if is_flashinfer_available(): try: import flashinfer.mamba # noqa: F401🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/server_args.py` around lines 2096 - 2110, The _handle_mamba_backend method currently only handles the "flashinfer" case and silently accepts any other mamba_backend value; update it to explicitly reject unknown values by validating self.mamba_backend against allowed backends (e.g., "flashinfer") and raising a ValueError with a clear message when the value is not recognized. Locate the _handle_mamba_backend method and add an else branch (or an initial validation) that checks self.mamba_backend and raises ValueError("Unknown mamba_backend: {self.mamba_backend}. Supported backends: flashinfer") so direct ServerArgs(...) callers cannot pass arbitrary strings.python/sglang/srt/layers/attention/mamba/ops/ssu_dispatch.py-21-42 (1)
21-42:⚠️ Potential issue | 🟡 MinorMove the abstract
__call__body onto its own line.
def __call__(...) -> None: ...is the E704 flake8 error here, so this file will keep pre-commit red until the body is rewritten as a normal block.✏️ Lint-safe form
`@abstractmethod` def __call__( self, state: torch.Tensor, x: torch.Tensor, @@ cache_steps: int | None = None, retrieve_parent_token: torch.Tensor | None = None, intermediate_state_indices: torch.Tensor | None = None, - ) -> None: ... + ) -> None: + passAs per coding guidelines, "Lint code using ruff via pre-commit hooks".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/layers/attention/mamba/ops/ssu_dispatch.py` around lines 21 - 42, The abstractmethod decorated __call__ signature currently ends with a trailing ellipsis causing E704; change the stub from a single-line "def __call__(...) -> None: ..." to a proper block for the abstractmethod (e.g., place the colon at end of the signature line and add an indented body on the next line) and use a suitable no-op or explicit abstract implementation such as "raise NotImplementedError" (or "pass") inside the body so the abstract __call__ in this class is lint-compliant; look for the __call__ method definition with the `@abstractmethod` decorator to apply this change.python/sglang/srt/mem_cache/memory_pool.py-551-557 (1)
551-557:⚠️ Potential issue | 🟡 MinorF-strings missing placeholders — should be regular strings.
Static analysis flagged these f-strings as having no interpolation. They should be plain strings to avoid confusion.
assert len(select_index) == len( mamba_indices - ), f"Not enough space for mamba cache, try to increase --mamba-full-memory-ratio or --max-mamba-cache-size." + ), "Not enough space for mamba cache, try to increase --mamba-full-memory-ratio or --max-mamba-cache-size." if self.enable_mamba_extra_buffer: assert len(select_index) == len( mamba_ping_pong_track_buffers - ), f"Not enough space for mamba ping pong idx, try to increase --mamba-full-memory-ratio." + ), "Not enough space for mamba ping pong idx, try to increase --mamba-full-memory-ratio."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/mem_cache/memory_pool.py` around lines 551 - 557, Replace the two assert messages that are written as f-strings but contain no interpolation with plain string literals; locate the assertions in memory_pool.py where select_index, mamba_indices, and (when enable_mamba_extra_buffer is True) mamba_ping_pong_track_buffers are compared and change the f"... " messages to regular strings like "Not enough space for mamba cache, try to increase --mamba-full-memory-ratio or --max-mamba-cache-size." and "Not enough space for mamba ping pong idx, try to increase --mamba-full-memory-ratio." to avoid misleading f-strings in the code paths using select_index, mamba_indices, and enable_mamba_extra_buffer.
🧹 Nitpick comments (5)
.gitignore (1)
241-242: Nice organizational touch!The new "# Other repos" section clearly groups external repository ignores.
If
lmms-evalrefers specifically to a cloned directory (which seems likely), consider usinglmms-eval/with a trailing slash for precision—think of it like a targeted spell versus an area-of-effect one. Without the slash, it'll match both files and directories namedlmms-eval, which might be broader than needed.📁 Optional refinement for directory-specific pattern
# Other repos -lmms-eval +lmms-eval/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.gitignore around lines 241 - 242, Change the .gitignore entry so it matches only the directory by updating the existing "lmms-eval" pattern to a directory-specific pattern; locate the "lmms-eval" line under the "# Other repos" section and replace it with a trailing-slash pattern (use "lmms-eval/") so only the cloned directory is ignored, not files that might share the same name.test/registered/models/test_nvidia_nemotron_3_nano.py (1)
29-39: Nice coverage target; please land it in the phase-based pytest plan instead of anotherunittestclass.This grows the legacy registered matrix again, so the flashinfer coverage lives in a parallel lane rather than the repo’s main 9-phase flow.
As per coding guidelines,
test/**/*.py: Use pytest for testing with the 9-phase test plan structure in test/phases/.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/registered/models/test_nvidia_nemotron_3_nano.py` around lines 29 - 39, The test class TestNvidiaNemotron3Nano30BBF16FlashInfer should be converted from a unittest-style registered matrix class into a pytest phase-based test under the repo’s 9-phase plan: remove the class and instead implement a pytest test in the phases structure that invokes the same model ("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"), model_config_name ("lm_eval_configs/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.yaml"), other_args (including "--tp-size","2","--mamba-backend","flashinfer" plus NEMOTRON_3_NANO_THINKING_ARGS) and integrates with the existing phase fixtures/markers used by the test/phases flow so the flashinfer coverage runs in the main 9-phase pipeline rather than as a parallel unittest class.test/registered/4-gpu-models/test_qwen3_next_models_mtp.py (1)
215-297: Useful spec-v2 coverage; please land it intest/phases/with pytest.Adding another
unittestsuite here keeps the new validation outside the repo’s phase plan, which makes the test story harder to read and maintain.As per coding guidelines,
test/**/*.py: Use pytest for testing with the 9-phase test plan structure in test/phases/.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/registered/4-gpu-models/test_qwen3_next_models_mtp.py` around lines 215 - 297, The TestQwen3NextMTPV2 unittest class should be converted into a pytest test module under test/phases/ to follow the 9-phase test plan; replace the class-based setUpClass/tearDownClass (envs.SGLANG_ENABLE_SPEC_V2 toggling and popen_launch_server/kill_process_tree usage) with pytest fixtures that launch the server (popen_launch_server) and yield the base_url/process, and rewrite test_gsm8k as a plain pytest test function that calls run_eval and asserts against ACC_THRESHOLDS[self.model]["gsm8k"]; ensure helper names (popen_launch_server, kill_process_tree, run_eval, ACC_THRESHOLDS, QWEN3_NEXT_MODEL, envs.SGLANG_ENABLE_SPEC_V2) are imported or referenced the same way and place the file in the appropriate phase slot so it integrates with the repo phase plan.python/sglang/srt/mem_cache/hi_mamba_radix_cache.py (1)
2079-2086: Potential issue:_alloc_with_evictcould return None when indexing[0].At line 2086,
_alloc_with_evict(...)[0]assumes the allocation succeeded. However,_alloc_with_evictreturnsNoneif allocation fails even after eviction. Althougherror_messageis set which would raiseRuntimeError, the exception path might not be the intended behavior here.Consider explicitly checking for
Nonebefore indexing, or ensuring the exception is the desired outcome:indices = self._alloc_with_evict( self.req_to_token_pool.mamba_pool, len(last_hit_node.mamba_host_value), self.evict_mamba, lock_node=last_hit_node, error_message="Cannot alloc request mamba cache for host load back", ) # indices is guaranteed non-None due to error_message raising RuntimeError req.mamba_pool_idx = indices[0]The current code works correctly because
error_messageis provided, triggering the exception. But the inline[0]makes it easy to miss this dependency during code review.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/sglang/srt/mem_cache/hi_mamba_radix_cache.py` around lines 2079 - 2086, The code currently indexes the result of _alloc_with_evict(...) directly with [0], which can be None on failure; change it to first capture the return value from _alloc_with_evict (called with self.req_to_token_pool.mamba_pool, len(last_hit_node.mamba_host_value), self.evict_mamba, lock_node=last_hit_node, error_message="Cannot alloc request mamba cache for host load back") into a local variable, then either assert/raise if that variable is None or rely on the provided error_message to raise and only then set req.mamba_pool_idx = indices[0]; this makes the dependency on _alloc_with_evict’s exception behavior explicit and avoids an accidental None indexing bug.python/sglang/srt/mem_cache/hicache_storage.py (1)
493-517: Consider edge case whenpage_sizeis not defined on host_pool.Line 498 uses
getattr(host_pool, "page_size", 1) or 1which handles missing/None/0 cases well. However, line 514 assumeshost_indicesis indexable byi * page_size. Ifpage_size > 1buthost_indicesdoesn't have the expected length (despite the check at 502), accessinghost_indices[i * page_size]could still be out of bounds for the last few entries whenlen(keys) * page_size != host_indices.numel().The validation at 502-510 should catch this, but the error message could be clearer about what "expected" vs "got" means in terms of page alignment.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: db6f4581-f1c2-4242-9a7b-7a059a7d256f
📒 Files selected for processing (60)
.beads.gitignore.remember/tmp/save-session.pidCLAUDE.mdbenchmark/hicache/bench_multiturn.pydocs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.mdpython/sglang/cli/main.pypython/sglang/cli/serve.pypython/sglang/cli/utils.pypython/sglang/launch_server.pypython/sglang/multimodal_gen/runtime/platforms/__init__.pypython/sglang/srt/batch_overlap/two_batch_overlap.pypython/sglang/srt/compilation/backend.pypython/sglang/srt/compilation/compile.pypython/sglang/srt/connector/__init__.pypython/sglang/srt/disaggregation/ascend/conn.pypython/sglang/srt/disaggregation/decode.pypython/sglang/srt/disaggregation/mooncake/conn.pypython/sglang/srt/layers/attention/hybrid_linear_attn_backend.pypython/sglang/srt/layers/attention/mamba/causal_conv1d.pypython/sglang/srt/layers/attention/mamba/mamba_state_scatter_triton.pypython/sglang/srt/layers/attention/mamba/ops/__init__.pypython/sglang/srt/layers/attention/mamba/ops/ssd_combined.pypython/sglang/srt/layers/attention/mamba/ops/ssu_dispatch.pypython/sglang/srt/managers/cache_controller.pypython/sglang/srt/managers/io_struct.pypython/sglang/srt/managers/schedule_batch.pypython/sglang/srt/managers/schedule_policy.pypython/sglang/srt/managers/scheduler.pypython/sglang/srt/managers/scheduler_output_processor_mixin.pypython/sglang/srt/managers/session_controller.pypython/sglang/srt/managers/tokenizer_communicator_mixin.pypython/sglang/srt/mem_cache/base_prefix_cache.pypython/sglang/srt/mem_cache/hi_mamba_radix_cache.pypython/sglang/srt/mem_cache/hicache_storage.pypython/sglang/srt/mem_cache/hybrid_cache/hybrid_cache_controller.pypython/sglang/srt/mem_cache/mamba_radix_cache.pypython/sglang/srt/mem_cache/memory_pool.pypython/sglang/srt/mem_cache/memory_pool_host.pypython/sglang/srt/mem_cache/session_aware_cache.pypython/sglang/srt/model_executor/forward_batch_info.pypython/sglang/srt/model_executor/model_runner.pypython/sglang/srt/model_executor/model_runner_kv_cache_mixin.pypython/sglang/srt/models/nemotron_h.pypython/sglang/srt/observability/scheduler_metrics_mixin.pypython/sglang/srt/server_args.pypython/sglang/srt/speculative/eagle_info.pypython/sglang/srt/speculative/eagle_info_v2.pypython/sglang/srt/speculative/eagle_worker.pypython/sglang/srt/speculative/eagle_worker_v2.pypython/sglang/srt/utils/common.pypython/sglang/test/run_eval.pytest/registered/4-gpu-models/test_qwen3_next_models_mtp.pytest/registered/layers/mamba/conftest.pytest/registered/layers/mamba/test_mamba_ssm_ssd.pytest/registered/models/test_nvidia_nemotron_3_nano.pytest/registered/radix_cache/test_mamba_radix_cache_comprehensive.pytest/registered/radix_cache/test_mamba_unittest.pytest/run_suite.pytest/unit/test_mamba_state_scatter_triton.py
💤 Files with no reviewable changes (1)
- python/sglang/srt/compilation/compile.py
| @@ -0,0 +1 @@ | |||
| 215077 | |||
There was a problem hiding this comment.
Remove this temporary PID file from version control.
This appears to be a process ID file from a development session—think of it like accidentally committing your laptop's Bluetooth pairing code. The number 215077 is a process ID that was valid on your machine at one moment in time, but it's meaningless on other machines and becomes stale the instant that process exits.
Files in .remember/tmp/ with .pid extensions are ephemeral runtime state and should never be committed to the repository. They'll cause unnecessary churn, potential merge conflicts, and clutter the commit history.
🧹 Recommended fix
Step 1: Remove the file from this commit:
git rm .remember/tmp/save-session.pidStep 2: Add this pattern to .gitignore to prevent similar files from being committed in the future:
+# Runtime state files
+.remember/tmp/🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.remember/tmp/save-session.pid at line 1, Remove the committed ephemeral PID
file .remember/tmp/save-session.pid from the repository and stop tracking it:
run git rm on that file (to remove it from the index) and commit the removal;
then add an ignore rule to .gitignore (e.g., ignore .remember/tmp/*.pid or
.remember/tmp/) so future PID/runtime files are not tracked, and commit the
updated .gitignore. Ensure you do not alter other files and that the removal
commit message clearly states the file was removed because it is ephemeral.
There was a problem hiding this comment.
Good catch — this is an accidental commit from a dev session. Removing from tracking and adding .remember/tmp/ to .gitignore.
There was a problem hiding this comment.
@KHAEntertainment, thanks for confirming! Removing it with git rm --cached .remember/tmp/save-session.pid (to drop it from the index without deleting the file on disk, if you still need it locally) and adding .remember/tmp/ to .gitignore is exactly the right move. That pattern will catch any future .pid or other ephemeral runtime files that land in that directory. 🎉
🧠 Learnings used
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: .gitignore:283-284
Timestamp: 2026-03-31T17:46:59.790Z
Learning: In the KHAEntertainment/sglang-mamba repository, the `*.db` pattern in `.gitignore` is intentional and project-wide. It targets only Beads/Dolt artifacts generated by `bd init`. There are no DB fixture files in this project that need to be tracked, so the broad glob is acceptable. Do not flag this as overly broad in future reviews unless DB fixtures are introduced.
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/managers/scheduler.py:1446-1479
Timestamp: 2026-03-31T20:56:26.697Z
Learning: In `python/sglang/srt/managers/scheduler.py`, the `handle_restore_snapshot` method's `create_new_request=True` code path is an intentional stub in the KHAEntertainment/sglang-mamba fork. The full lifecycle (rid_to_state registration, http_worker_ipc plumbing, queue-acceptance feedback, and bind/continue RPC) is planned to be addressed in PR 5 (phase-08 snapshot re-port + validation). Do not flag the missing owner registration or fire-and-forget _add_request_to_queue call as bugs requiring a fix in earlier PRs.
If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!
| # Barrier logic: release next round when all clients for | ||
| # current barrier round have completed | ||
| if ( | ||
| self.enable_round_barrier | ||
| and current_barrier_round < self.max_rounds | ||
| ): | ||
| barrier_round_completed += 1 | ||
| expected = self.clients_per_round[current_barrier_round] | ||
| if barrier_round_completed == expected: | ||
| print( | ||
| f"\n Barrier: round {current_barrier_round} complete " | ||
| f"({expected} clients), releasing {len(next_round_reqs)} " | ||
| f"requests for round {current_barrier_round + 1}" | ||
| ) | ||
| self._send_heartbeat(input_len=100, output_len=100) | ||
| time.sleep(10) | ||
| for req in next_round_reqs: | ||
| self.ready_queue.append(req) | ||
| next_round_reqs = [] | ||
| current_barrier_round += 1 | ||
| barrier_round_completed = 0 |
There was a problem hiding this comment.
Barrier mode currently crashes and still leaks the next round early.
Lines 371-386 use current_barrier_round / barrier_round_completed before assignment, and self.max_rounds / self.clients_per_round are never initialized. Even after fixing that, Line 360 already drains next_round_reqs, so this block does not actually own the gate — it’s an airlock with one door still open.
🛠️ Proposed fix
def response_handler(self):
next_round_reqs = []
+ barrier_round_completed = 0
while True:
try:
client_id, response = self.response_queue.get(
timeout=10
) # Block until response is available
@@
if self.enable_round_barrier:
next_round_reqs.append(new_req)
- if len(next_round_reqs) == self.num_clients:
- for req in next_round_reqs:
- self.ready_queue.append(req)
- next_round_reqs = []
else:
self.ready_queue.append(new_req)
# Barrier logic: release next round when all clients for
# current barrier round have completed
if (
self.enable_round_barrier
- and current_barrier_round < self.max_rounds
+ and current_round < self.num_rounds - 1
):
barrier_round_completed += 1
- expected = self.clients_per_round[current_barrier_round]
+ expected = self.num_clients
if barrier_round_completed == expected:
print(
- f"\n Barrier: round {current_barrier_round} complete "
+ f"\n Barrier: round {current_round} complete "
f"({expected} clients), releasing {len(next_round_reqs)} "
- f"requests for round {current_barrier_round + 1}"
+ f"requests for round {current_round + 1}"
)
self._send_heartbeat(input_len=100, output_len=100)
time.sleep(10)
for req in next_round_reqs:
self.ready_queue.append(req)
next_round_reqs = []
- current_barrier_round += 1
barrier_round_completed = 0🧰 Tools
🪛 Flake8 (7.3.0)
[error] 371-371: undefined name 'current_barrier_round'
(F821)
[error] 373-373: undefined name 'barrier_round_completed'
(F821)
[error] 374-374: undefined name 'current_barrier_round'
(F821)
[error] 377-377: undefined name 'current_barrier_round'
(F821)
[error] 379-379: undefined name 'current_barrier_round'
(F821)
[error] 386-386: undefined name 'current_barrier_round'
(F821)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@benchmark/hicache/bench_multiturn.py` around lines 367 - 387, The barrier
branch uses and compares variables that may be uninitialized
(current_barrier_round, barrier_round_completed, self.max_rounds,
self.clients_per_round) and also loses control because next_round_reqs is
drained earlier; fix by initializing barrier state in the class (e.g., set
self.max_rounds and self.clients_per_round when barrier mode is enabled and
initialize current_barrier_round and barrier_round_completed to 0), ensure this
block only reads/writes those instance vars (not local undefined names), and
change the earlier logic that currently moves items from next_round_reqs into
ready_queue on line ~360 so that when self.enable_round_barrier is True items
are appended to next_round_reqs (not ready_queue) until the barrier condition in
this block (compare barrier_round_completed to
self.clients_per_round[current_barrier_round]) is met; keep the release sequence
here (print, _send_heartbeat, sleep, extend ready_queue from next_round_reqs,
clear next_round_reqs, increment current_barrier_round, reset
barrier_round_completed).
There was a problem hiding this comment.
Valid — barrier mode has several undefined variable references. Fixing: initializing barrier_round_completed, replacing current_barrier_round with the correct local variable, and fixing self.max_rounds/self.clients_per_round references.
There was a problem hiding this comment.
@KHAEntertainment, glad the analysis was helpful! The key fixes to land are:
- Initialize
barrier_round_completed = 0at the top ofresponse_handler(before thewhile Trueloop). - Replace
current_barrier_roundwith the already-available localcurrent_roundvariable (or introduce a properself.current_barrier_roundinstance field initialized in__init__). - Replace
self.max_rounds→self.num_roundsandself.clients_per_round[...]→self.num_clients(or initializeself.clients_per_round/self.max_roundsin__init__whenenable_round_barrieris true). - Make sure the earlier
next_round_reqsdrain (theif len(next_round_reqs) == self.num_clientsblock) is removed or gated so the barrier block at Lines 370–386 is the sole gate keeper — otherwise the gate is bypassed on every full round.
Once those are in place the barrier should work correctly end-to-end. 🎯
🧠 Learnings used
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/managers/scheduler.py:1446-1479
Timestamp: 2026-03-31T20:56:26.697Z
Learning: In `python/sglang/srt/managers/scheduler.py`, the `handle_restore_snapshot` method's `create_new_request=True` code path is an intentional stub in the KHAEntertainment/sglang-mamba fork. The full lifecycle (rid_to_state registration, http_worker_ipc plumbing, queue-acceptance feedback, and bind/continue RPC) is planned to be addressed in PR 5 (phase-08 snapshot re-port + validation). Do not flag the missing owner registration or fire-and-forget _add_request_to_queue call as bugs requiring a fix in earlier PRs.
If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!
| mem_pool_device = token_to_kv_pool_allocator.get_kvcache() | ||
| from sglang.srt.mem_cache.memory_pool import HybridLinearKVPool | ||
|
|
||
| if isinstance(mem_pool_device, HybridLinearKVPool): | ||
| mem_pool_device = mem_pool_device.full_kv_pool | ||
| self.mem_pool_device = mem_pool_device |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Inspect HybridLinearKVPool transfer-counter ownership =="
fd '^memory_pool\.py$' -x rg -n -C3 'class HybridLinearKVPool|def register_layer_transfer_counter|full_kv_pool|layer_transfer_counter' {}
echo
echo "== Inspect HiCacheController use sites =="
fd '^cache_controller\.py$' -x rg -n -C2 'get_kvcache|full_kv_pool|register_layer_transfer_counter' {}Repository: KHAEntertainment/sglang-mamba
Length of output: 9755
Unwrapping HybridLinearKVPool before registering the transfer counter bypasses the hybrid wrapper's synchronization logic.
The pool is unwrapped to full_kv_pool at lines 270–271, then register_layer_transfer_counter is called on the unwrapped pool at line 296. This means the HybridLinearKVPool wrapper instance never receives the counter.
The design intent is clear in HybridLinearKVPool.register_layer_transfer_counter (memory_pool.py, lines 1329–1335): the comment states "The layer-wise wait logic is executed at the Hybrid LinearPool level; no additional wait is needed in the full_kv_pool." The wrapper's _wait_for_layer method checks self.layer_transfer_counter before synchronizing, but that field remains None because registration happens only on the inner pool. Result: the wrapper never waits, even though that's where the wait logic is supposed to live.
Move the registration to happen on the original pool before unwrapping, or store the wrapper reference separately for synchronization.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/sglang/srt/managers/cache_controller.py` around lines 267 - 272, The
code unwraps HybridLinearKVPool to its full_kv_pool and then registers the
transfer counter on the inner pool, which bypasses the HybridLinearKVPool's
synchronization; instead, ensure register_layer_transfer_counter is called on
the HybridLinearKVPool wrapper (the original mem_pool_device) before you replace
mem_pool_device with mem_pool_device.full_kv_pool, or keep a reference to the
wrapper and call wrapper.register_layer_transfer_counter(...) so that
HybridLinearKVPool._wait_for_layer sees the layer_transfer_counter; update the
logic around mem_pool_device, HybridLinearKVPool, full_kv_pool, and
register_layer_transfer_counter accordingly.
There was a problem hiding this comment.
Not addressing — this is upstream HybridCacheController.__init__ code. The HybridLinearKVPool unwrapping and register_layer_transfer_counter call ordering is upstream's design from sgl-project/sglang. Modifying it here would create fork divergence.
There was a problem hiding this comment.
@KHAEntertainment, understood — noted that the HybridLinearKVPool unwrapping and register_layer_transfer_counter call ordering in HiCacheController.__init__ is upstream design from sgl-project/sglang and should not be diverged from in the fork. I'll remember this for future reviews.
✏️ Learnings added
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 13
File: python/sglang/srt/managers/cache_controller.py:267-272
Timestamp: 2026-03-31T22:17:57.025Z
Learning: In `python/sglang/srt/managers/cache_controller.py`, the `HiCacheController.__init__` method's pattern of unwrapping `HybridLinearKVPool` to `full_kv_pool` before assigning `self.mem_pool_device`, and subsequently calling `register_layer_transfer_counter` on the inner pool (not on the `HybridLinearKVPool` wrapper), is intentional upstream design from `sgl-project/sglang`. Do NOT flag the registration ordering or suggest calling `register_layer_transfer_counter` on the wrapper — diverging from upstream is undesirable for the KHAEntertainment/sglang-mamba sync series.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: CR
Repo: KHAEntertainment/sglang-mamba PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T17:06:58.750Z
Learning: Applies to python/sglang/srt/mem_cache/memory_pool.py : Implement HybridReqToTokenPool, MambaPool, and HybridLinearKVPool classes for unified memory pool management across request and token indices
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/managers/scheduler.py:1446-1479
Timestamp: 2026-03-31T20:56:26.697Z
Learning: In `python/sglang/srt/managers/scheduler.py`, the `handle_restore_snapshot` method's `create_new_request=True` code path is an intentional stub in the KHAEntertainment/sglang-mamba fork. The full lifecycle (rid_to_state registration, http_worker_ipc plumbing, queue-acceptance feedback, and bind/continue RPC) is planned to be addressed in PR 5 (phase-08 snapshot re-port + validation). Do not flag the missing owner registration or fire-and-forget _add_request_to_queue call as bugs requiring a fix in earlier PRs.
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/observability/trace.py:160-163
Timestamp: 2026-03-31T18:57:52.401Z
Learning: In `python/sglang/srt/observability/trace.py`, the `process_tracing_init` function contains an unused `global get_cur_time_ns` declaration (around line 162). This is intentional upstream code from `sgl-project/sglang` (observability cherry-pick). Do NOT flag this as a lint issue or suggest removing it — diverging from upstream is undesirable for the KHAEntertainment/sglang-mamba sync series.
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: test/manual/test_tracing.py:72-73
Timestamp: 2026-03-31T18:57:50.323Z
Learning: In the KHAEntertainment/sglang-mamba fork, the bare `except:` in `test/manual/test_tracing.py` (`__clear_trace_file` method, around line 72) is intentionally inherited from upstream `sgl-project/sglang` test code. Do not flag this as a bug requiring a fix in fork PRs; any fix should be submitted upstream to avoid fork divergence. The same rationale applies to other upstream-inherited test structures in this file.
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/observability/req_time_stats.py:586-618
Timestamp: 2026-03-31T17:47:54.215Z
Learning: In `python/sglang/srt/observability/req_time_stats.py`, the `SchedulerReqTimeStats.set_retract_time()` method intentionally clears only `last_forward_entry_time` (and related `last_*` fields) but leaves `forward_entry_time` unchanged. This is upstream bookkeeping design from `sgl-project/sglang` (observability cherry-pick commit 3b8930227). Do NOT flag the apparent timer-lap mismatch or suggest resetting `forward_entry_time` in this method — diverging from upstream is undesirable for the KHAEntertainment/sglang-mamba sync series.
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/pyproject.toml:26-26
Timestamp: 2026-03-31T17:47:07.606Z
Learning: In the KHAEntertainment/sglang-mamba fork, the `decord`/`torchcodec` conditional import gap in `python/sglang/srt/utils/common.py` (functions `load_video` and `encode_video`) and `python/pyproject.toml` is intentionally inherited from upstream `sgl-project/sglang`. Do not flag this as a bug requiring a fix in fork PRs; any fix should be submitted to upstream to avoid fork divergence.
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/observability/req_time_stats.py:565-578
Timestamp: 2026-03-31T17:47:12.892Z
Learning: In `python/sglang/srt/observability/req_time_stats.py`, the `SchedulerReqTimeStats.__getstate__` intentionally returns `{}` when `enable_metrics` is `False`. This is upstream design from `sgl-project/sglang` (observability cherry-pick commit 3b8930227) and should NOT be flagged or changed in the KHAEntertainment/sglang-mamba fork, as diverging from upstream is undesirable for the sync series.
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/observability/trace.py:124-125
Timestamp: 2026-03-31T17:47:19.266Z
Learning: In `python/sglang/srt/observability/trace.py`, the `TraceCustomIdGenerator.generate_trace_id()` method intentionally uses `self.local_random.getrandbits(64)` (64-bit) to match upstream `sgl-project/sglang` implementation. Do not flag this as a bug — changing it would cause divergence from upstream. The repo `KHAEntertainment/sglang-mamba` syncs from upstream SGLang and deviates only for Mamba-specific architectural changes, not for general observability/tracing code.
- Remove accidental .remember/tmp/save-session.pid from tracking - Add .remember/tmp/ to .gitignore - Fix undefined variables in bench_multiturn.py barrier mode - Add missing self.lora_path in WorkloadGenerator.__init__ - Zero padded tail in nemotron_h piecewise graph mode Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/sglang/srt/models/nemotron_h.py`:
- Around line 431-433: The code calls nemotron_mamba2_with_output(hidden_states,
output, self.layer_id) while in piecewise CUDA graphs the decoder absolute
self.layer_id must not be used to index the compact context.attention_layers;
instead compute or look up the dense piecewise-layer index (the offset into
context.attention_layers) and pass that to nemotron_mamba2_with_output (or
maintain a decoder-layer-id → attention-layer-offset map created by the model
runner) so the function receives the correct attention-layer index; update the
calls inside is_in_piecewise_cuda_graph() (and the other occurrence around lines
880-883) to translate self.layer_id into the compact attention index before
invoking nemotron_mamba2_with_output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1dfb21b5-bb60-4d09-b02d-a459c9a2d739
📒 Files selected for processing (3)
.gitignorebenchmark/hicache/bench_multiturn.pypython/sglang/srt/models/nemotron_h.py
🚧 Files skipped from review as they are similar to previous changes (1)
- .gitignore
| if is_in_piecewise_cuda_graph(): | ||
| output = torch.empty_like(hidden_states) | ||
| nemotron_mamba2_with_output(hidden_states, output, self.layer_id) |
There was a problem hiding this comment.
Don't use the absolute decoder layer_id to index context.attention_layers.
python/sglang/srt/model_executor/model_runner.py:2173-2202 builds attention_layers as a compact list of only attention/Mamba-capable layers. In hybrid patterns, earlier MLP/MOE blocks compress that list, so passing self.layer_id here is like using seat numbers on a filtered passenger list: you can hit the wrong layer or blow up with IndexError. Please pass a dense piecewise-layer index instead, or keep a decoder-layer-id → attention-layer-offset map for the split op.
Also applies to: 880-883
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/sglang/srt/models/nemotron_h.py` around lines 431 - 433, The code
calls nemotron_mamba2_with_output(hidden_states, output, self.layer_id) while in
piecewise CUDA graphs the decoder absolute self.layer_id must not be used to
index the compact context.attention_layers; instead compute or look up the dense
piecewise-layer index (the offset into context.attention_layers) and pass that
to nemotron_mamba2_with_output (or maintain a decoder-layer-id →
attention-layer-offset map created by the model runner) so the function receives
the correct attention-layer index; update the calls inside
is_in_piecewise_cuda_graph() (and the other occurrence around lines 880-883) to
translate self.layer_id into the compact attention index before invoking
nemotron_mamba2_with_output.
There was a problem hiding this comment.
Acknowledged — investigating. In NemotronH's hybrid architecture, every decoder block is either a Mamba2 layer or a standard attention layer (no MLP-only blocks), so self.layer_id and the attention_layers index may be 1:1. Need to verify against model_runner.py:2173-2202 to confirm whether the compact list construction preserves this mapping for NemotronH specifically.
If it turns out there IS a gap, we'll add a _dense_layer_id computed at model init time. Tracking.
There was a problem hiding this comment.
🧩 Analysis chain
Good question — let me pull the exact attention_layers construction logic and NemotronH's hybrid pattern to verify the indexing assumption.
🏁 Script executed:
#!/bin/bash
# 1. Check init_piecewise_cuda_graphs in model_runner.py (the changed section + surrounding context)
rg -n "attention_layers|attn_layer|_forward_mamba|moe_layer|moe_fusion" python/sglang/srt/model_executor/model_runner.py -A2 -B2 | head -120
# 2. Check NemotronH config for hybrid_override_pattern or layer type definitions
fd "nemotron_h" python/sglang/srt/configs/ --exec cat {}
# 3. Check what hybrid_override_pattern looks like for actual NemotronH model checkpoints
rg -n "hybrid_override_pattern|ATTENTION|MAMBA|MLP|MOE" python/sglang/srt/configs/nemotron_h.py | head -60
Length of output: 18248
@KHAEntertainment — investigation complete. Your hypothesis doesn't hold. MLP blocks (-) are present in the default pattern, which breaks the 1:1 assumption.
What the data shows
The default hybrid_override_pattern is:
M - M - M - M * - M - M - M - M - M * - M - M - M - M - M * ...
In model_runner.py (lines 2173–2202), attention_layers is built by skipping any layer where attn_layer stays None. For NemotronH:
NemotronHMambaDecoderLayer→ has_forward_mamba→ ✅ appendedNemotronHAttentionDecoderLayer→ haslayer.mixer.attn→ ✅ appendedNemotronHMLPDecoderLayer→mixerisNemotronHMLP(no.attn, no_forward_mamba) → ❌ skippedNemotronHMoEDecoderLayer→ caught by themoe_blockpath butattn_layerstaysNone→ ❌ skipped
So with the default 52-layer pattern, the indices diverge immediately:
Absolute layer_id |
Type | attention_layers index |
|---|---|---|
| 0 | M | 0 |
| 1 | - (MLP) |
skipped |
| 2 | M | 1 ← self.layer_id=2 would read attention_layers[2] = wrong layer |
| 3 | M | 2 |
| 4 | M | 3 |
| 5 | * (Attn) |
4 |
Passing self.layer_id directly into nemotron_mamba2_with_output will read the wrong layer (or cause an IndexError for large layer IDs).
Fix
Compute a dense _attn_layer_id at init time by counting only M/* entries in the pattern up to the current layer:
class NemotronHMambaDecoderLayer(nn.Module):
def __init__(self, config, layer_idx, ...):
...
self.layer_id = layer_idx
# Compute dense index into the compact attention_layers list
self._attn_layer_id = sum(
1 for i in range(layer_idx)
if config.hybrid_override_pattern[i] in (MAMBA, ATTENTION)
)
...
def forward(self, *, hidden_states, residual, forward_batch):
...
if is_in_piecewise_cuda_graph():
output = torch.empty_like(hidden_states)
nemotron_mamba2_with_output(hidden_states, output, self._attn_layer_id)
return output, residual
...And update the split op to use the received index directly (no change needed there — it already does attention_layers[layer_id]).
The _attn_layer_id for the second call site (lines 880–883 in the nemotron_mamba2_with_output function context itself) is already correct since layer_id there is the value passed in — so it's the caller that needs to pass the dense index.
🧠 Learnings used
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 13
File: python/sglang/srt/managers/cache_controller.py:267-272
Timestamp: 2026-03-31T22:17:57.025Z
Learning: In `python/sglang/srt/managers/cache_controller.py`, the `HiCacheController.__init__` method's pattern of unwrapping `HybridLinearKVPool` to `full_kv_pool` before assigning `self.mem_pool_device`, and subsequently calling `register_layer_transfer_counter` on the inner pool (not on the `HybridLinearKVPool` wrapper), is intentional upstream design from `sgl-project/sglang`. Do NOT flag the registration ordering or suggest calling `register_layer_transfer_counter` on the wrapper — diverging from upstream is undesirable for the KHAEntertainment/sglang-mamba sync series.
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/pyproject.toml:26-26
Timestamp: 2026-03-31T17:47:07.606Z
Learning: In the KHAEntertainment/sglang-mamba fork, the `decord`/`torchcodec` conditional import gap in `python/sglang/srt/utils/common.py` (functions `load_video` and `encode_video`) and `python/pyproject.toml` is intentionally inherited from upstream `sgl-project/sglang`. Do not flag this as a bug requiring a fix in fork PRs; any fix should be submitted to upstream to avoid fork divergence.
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/managers/scheduler.py:1446-1479
Timestamp: 2026-03-31T20:56:26.697Z
Learning: In `python/sglang/srt/managers/scheduler.py`, the `handle_restore_snapshot` method's `create_new_request=True` code path is an intentional stub in the KHAEntertainment/sglang-mamba fork. The full lifecycle (rid_to_state registration, http_worker_ipc plumbing, queue-acceptance feedback, and bind/continue RPC) is planned to be addressed in PR 5 (phase-08 snapshot re-port + validation). Do not flag the missing owner registration or fire-and-forget _add_request_to_queue call as bugs requiring a fix in earlier PRs.
Summary
SessionControllerabstraction (commits 5acb45c, c6cb0c9, e08ef06, e1ee68d). Design decision: post-create_req()hook for snapshot restore — keepsSession.create_reqtoken-only.HiMambaRadixCache,HybridCacheController, and host pool structures with our snapshot/tier system. TierManager (disk persistence, L2→L3) layers on top of upstream'sHybridCacheController(GPU↔host, L1↔L2). Includes sweep of 10 COMPLEMENT commits.Design decisions
Session.create_req()via post-creation hook. Keeps Session.create_req token-only, avoids modifying upstream's request creation path.Validation
Part of
Upstream SGLang sync series — PR 2 of 5. Depends on PR 1 (deps + observability). Contains the two highest-risk architectural changes.
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
--mamba-backendflagPerformance
Bug Fixes