Skip to content

upstream-sync 2/5: SessionController port + cache architecture reconcile#13

Merged
KHAEntertainment merged 26 commits intomainfrom
upstream-sync/pr2-session-cache
Mar 31, 2026
Merged

upstream-sync 2/5: SessionController port + cache architecture reconcile#13
KHAEntertainment merged 26 commits intomainfrom
upstream-sync/pr2-session-cache

Conversation

@KHAEntertainment
Copy link
Copy Markdown
Collaborator

@KHAEntertainment KHAEntertainment commented Mar 31, 2026

Summary

  • Phase 03 — SessionController Port (HIGH): Ports scheduler/session customizations onto upstream's new SessionController abstraction (commits 5acb45c, c6cb0c9, e08ef06, e1ee68d). Design decision: post-create_req() hook for snapshot restore — keeps Session.create_req token-only.
  • Phase 04 — Mamba Cache Architecture Reconcile (HIGH): Reconciles HiMambaRadixCache, HybridCacheController, and host pool structures with our snapshot/tier system. TierManager (disk persistence, L2→L3) layers on top of upstream's HybridCacheController (GPU↔host, L1↔L2). Includes sweep of 10 COMPLEMENT commits.

Design decisions

  1. Snapshot restore flow: Option B — inject restored Mamba state after Session.create_req() via post-creation hook. Keeps Session.create_req token-only, avoids modifying upstream's request creation path.
  2. TierManager layering: TierManager owns host↔disk (L2↔L3) persistence. Upstream's HybridCacheController owns GPU↔host (L1↔L2) offload. No replacement — additive layering.

Validation

  • Phase 1 (stateless inference) ✅
  • Phase 2 (MambaPool 5/5) ✅
  • Phase 3 (MambaRadixCache gauntlet 16/16) ✅
  • Phase 4 (live server) ✅
  • Phase 7 (snapshot E2E 6/6) ✅
  • Phase 8 (stateful inference) ✅

Part of

Upstream SGLang sync series — PR 2 of 5. Depends on PR 1 (deps + observability). Contains the two highest-risk architectural changes.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added streaming session support for continuous generation workflows with automatic timeout management
    • Introduced Mamba backend selection (Triton or FlashInfer) via --mamba-backend flag
    • Enhanced hierarchical cache system for hybrid Mamba/attention models
  • Performance

    • Introduced fused Triton kernels for Mamba state scatter operations
    • Optimized layer-wise transfer synchronization for better multi-GPU coordination
    • Improved cache hierarchy memory efficiency
  • Bug Fixes

    • Fixed batch extension flag propagation in filtered batches
    • Resolved multimodal feature cleanup for streaming sessions
    • Fixed Mamba prefix cache tracking across speculative decoding

Clarit-AI and others added 24 commits March 30, 2026 23:53
Adopts upstream's SessionController extraction (5acb45c), streaming
session support with SessionAwareCache (c6cb0c9), session gating
(e08ef06), and multimodal release on close (e1ee68d).

Key changes:
- Extract SessionController class from Scheduler.sessions dict
- Add SessionAwareCache wrapper (gated by --enable-streaming-session)
- Refactor handle_generate_request to 3-way routing
- Change Req.session_id → Req.session (Session object reference)
- Add streaming session guards in tokenizer communicator
- Add MultimodalInputs.release_features() method
- Design decision: Option B (post-create_req hook) for snapshot restore

Snapshot hooks remain on Scheduler and will be re-wired in Phase 07
after the cache architecture reconcile (Phase 04).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sgl-kernel 0.3.21 → sglang-kernel 0.4.0 renamed the package but
the marlin repack functions were moved to JIT kernels upstream
(2cdde5d). Make the import chain: JIT kernel → sgl_kernel → None,
so non-AWQ-Marlin workloads (like Mamba) aren't blocked at import time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The original import error was caused by sglang-kernel 0.4.0 shadowing
sgl-kernel 0.3.21's sgl_kernel module. The fix is to keep sgl-kernel
0.3.21 installed (it has all required symbols including
awq_marlin_moe_repack and min_p_sampling_from_probs). Revert the
import fallback chain and engine version check workaround.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The hybrid linear attention backend unconditionally imports the
CuTe DSL GDN kernel which requires cutlass.cute.nvgpu — a module
not available in all CUTLASS DSL versions. Guard with try/except
so Mamba models can still serve without this optional dependency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…refill cudagraph

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pool init

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ase semantics

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ontroller

Adds GPU↔host offload path via HybridCacheController and MambaPoolHost.
Our TierManager (disk persistence) layers on top — it interacts with
MambaPool directly for extract/inject, not through HybridCacheController.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- test_mamba_radix_cache_comprehensive: create KV tensors on device (upstream
  _insert_helper now frees duplicate KV pages during insert)
- test_mamba_unittest: remove dangling available_and_evictable_str() standalone
  call, add missing method to MambaRadixCache from upstream
@github-actions github-actions Bot added documentation Improvements or additions to documentation diffusion hicache piecewise-cuda-graph labels Mar 31, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 31, 2026

Walkthrough

This PR introduces comprehensive Mamba hybrid-cache architecture with hierarchical host-tiered storage, session-aware streaming semantics, and pluggable backend dispatch. Major additions include Mamba-specific radix cache with host LRU, session-scoped KV state lifecycle management, Mamba selective state update backend abstraction, and enhanced memory pool plumbing for layer-wise transfer synchronization. Approximately 70 files modified with ~3500+ lines added.

Changes

Cohort / File(s) Summary
Mamba Hierarchical Cache
python/sglang/srt/mem_cache/hi_mamba_radix_cache.py, python/sglang/srt/mem_cache/mamba_radix_cache.py
New hierarchical Mamba radix cache with device/host residency tracking, LRU eviction, and host KV tombstone management; extended TreeNode with Mamba host state, protection lifecycle, and hash history.
Session-Aware Streaming Cache
python/sglang/srt/mem_cache/session_aware_cache.py, python/sglang/srt/managers/session_controller.py
New SessionAwareCache decorator and SessionController for streaming request KV state restoration and session lifecycle (open/close/reap); extended Session with streaming/timeout fields and request-scope lifecycle tracking.
Host Memory Pool Mamba Support
python/sglang/srt/mem_cache/memory_pool_host.py, python/sglang/srt/mem_cache/memory_pool.py
New MambaPoolHost for hierarchical host-side Mamba state serialization; refactored MambaPool slot clearing and HybridReqToTokenPool with enable_overlap_schedule and layer-transfer counter registration.
Mamba Backend Dispatch
python/sglang/srt/layers/attention/mamba/ops/ssu_dispatch.py, python/sglang/srt/layers/attention/mamba/ops/__init__.py, python/sglang/srt/server_args.py
New abstract MambaSSUBackend interface with TritonSSUBackend and FlashInferSSUBackend implementations; new --mamba-backend CLI flag and _handle_mamba_backend() validation in ServerArgs.
Hybrid Cache Controller & Storage
python/sglang/srt/mem_cache/hybrid_cache/hybrid_cache_controller.py, python/sglang/srt/mem_cache/hicache_storage.py
New HybridCacheController with extra-pool transfer support; extended HiCacheStorage with v2 APIs (batch_exists_v2, batch_get_v2, batch_set_v2) and PoolTransfer/PoolHitPolicy abstractions.
Mamba State Scatter Kernel
python/sglang/srt/layers/attention/mamba/mamba_state_scatter_triton.py
New Triton fused kernel fused_mamba_state_scatter_with_mask replacing per-request advanced-indexing scatter updates for SSM/convolution states.
Scheduler & Cache Integration
python/sglang/srt/managers/scheduler.py, python/sglang/srt/managers/schedule_batch.py, python/sglang/srt/managers/schedule_policy.py
Mamba backend initialization in Scheduler.init; HiMambaRadixCache selection logic; session_controller replacement for direct session dict; new all_extend_in_batch field propagation through batch hierarchy.
Hybrid Attention Backend
python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py
Added conditional import of fused_mamba_state_scatter_with_mask; updated extend-forward metadata logic for draft_extend_v2; refactored update_mamba_state_after_mtp_verify to use fused kernel.
Mamba Causal Conv1d Fallback
python/sglang/srt/layers/attention/mamba/causal_conv1d.py
Added Triton fallback routing when sgl_kernel is unavailable; conditional seq_lens_cpu computation and _get_seq_lens_cpu helper.
Speculative Decoding v2
python/sglang/srt/speculative/eagle_worker_v2.py, python/sglang/srt/speculative/eagle_info_v2.py, python/sglang/srt/speculative/eagle_worker.py, python/sglang/srt/speculative/eagle_info.py
New _mamba_verify_update path in EAGLEWorkerV2; conditional Mamba track index updates; explicit nulling of mask/seqlen fields in verify prepare.
Model Runner Enhancements
python/sglang/srt/model_executor/model_runner.py, python/sglang/srt/model_executor/model_runner_kv_cache_mixin.py, python/sglang/srt/models/nemotron_h.py
Extended attention-layer discovery with mixer-based patterns; refactored NemotronHMambaDecoderLayer with _forward_mamba helper and piecewise CUDA graph path; added layer-wise synchronization via counter.
CLI & Server Configuration
python/sglang/cli/main.py, python/sglang/cli/serve.py, python/sglang/cli/utils.py, python/sglang/launch_server.py
Removed serve subcommand func binding; added --mamba-backend help text and conditional diffusion import; moved encoder_only branch before grpc check in launch_server.
Forward Batch & I/O Structures
python/sglang/srt/model_executor/forward_batch_info.py, python/sglang/srt/managers/io_struct.py
New all_extend_in_batch field on ForwardBatch with propagation; added streaming/timeout fields to OpenSessionReqInput.
Cache Controller Integration
python/sglang/srt/managers/cache_controller.py
HiCacheController now conditionally downcasts HybridLinearKVPool to full_kv_pool on assignment to self.mem_pool_device.
Logging & Observability
python/sglang/srt/utils/common.py, python/sglang/srt/observability/scheduler_metrics_mixin.py, python/sglang/srt/multimodal_gen/runtime/platforms/__init__.py
Lowered platform detection verbosity from info to debug; removed progress-bar trailing newline; unified prefill stats input-throughput logging.
Disaggregation Support
python/sglang/srt/disaggregation/decode.py, python/sglang/srt/disaggregation/ascend/conn.py, python/sglang/srt/disaggregation/mooncake/conn.py
Extended HybridMambaDecodeReqToTokenPool with enable_overlap_schedule and mamba_size; conditional state_data batch register; refactored prefill TP writer mapping logic.
Scheduler Output Processing
python/sglang/srt/managers/scheduler_output_processor_mixin.py
Mamba ping-pong track index updates delegated to pool's get_mamba_ping_pong_other_idx().
Tokenizer & Session I/O
python/sglang/srt/managers/tokenizer_communicator_mixin.py, python/sglang/srt/connector/__init__.py
Added TokenizerManager type annotation on open_session/close_session; added streaming/specv2 validation; device parameter made optional in create_remote_connector.
Tests & Infrastructure
test/registered/layers/mamba/test_mamba_ssm_ssd.py, test/registered/layers/mamba/conftest.py, test/registered/radix_cache/test_mamba_*.py, test/registered/models/test_nvidia_nemotron_3_nano.py, test/registered/4-gpu-models/test_qwen3_next_models_mtp.py, test/unit/test_mamba_state_scatter_triton.py, test/run_suite.py
New conftest fixture for Mamba SSU backend initialization; intermediate_states return path in ssd_minimal_discrete; new MTP/FlashInfer model tests; comprehensive scatter-kernel unit tests; excluded conftest.py from test collection.
Batch Overlap & Compilation
python/sglang/srt/batch_overlap/two_batch_overlap.py, python/sglang/srt/compilation/backend.py, python/sglang/srt/compilation/compile.py
TboForwardBatchPreparer propagates all_extend_in_batch; removed rank0_log calls from SGLangBackend and install_torch_compiled.
Cache & Prefix-Cache APIs
python/sglang/srt/mem_cache/base_prefix_cache.py
New dataclasses IncLockRefResult, DecLockRefParams, DecLockRefResult, InitLoadBackParams for structured lock-ref and load-back parameterization.
Documentation
CLAUDE.md, docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md
Added Beads issue tracker rules and work-session requirements; new Phase 04 implementation plan for upstream Mamba cache reconciliation with conflict-resolution guidance.
Project Config
.beads, .gitignore
Added upstream-sync beads store path; removed compile_commands.json/1 ignores; added lmms-eval and .remember/tmp/ ignore patterns.
Benchmark
benchmark/hicache/bench_multiturn.py
Added lora_path storage on WorkloadGenerator; round-barrier state machine with synchronized heartbeat dispatch on barrier completion.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Scheduler
    participant SessionController
    participant TreeCache as Prefix Cache
    participant MambaBackend as Mamba Backend<br/>(Dispatch)

    Client->>Scheduler: handle_generate_request()
    alt Session Exists
        Scheduler->>SessionController: get(session_id)
        SessionController->>TreeCache: restore KV state for session
        TreeCache-->>SessionController: MatchResult + sentinel nodes
        SessionController-->>Scheduler: Session object
    else New Session
        Scheduler->>SessionController: open(OpenSessionReqInput)
        SessionController->>TreeCache: initialize new session slot
        SessionController-->>Scheduler: Session created
    end
    
    Scheduler->>Scheduler: init_mamba_backend()
    Note over Scheduler: Select backend based on<br/>server_args.mamba_backend
    Scheduler->>MambaBackend: initialize_mamba_selective_state_update_backend()
    
    Note over Scheduler,MambaBackend: Backend ready for forward passes
    
    alt Session Finishes
        Scheduler->>SessionController: cache_finished_req()
        SessionController->>TreeCache: store KV state in session slot
        alt Streaming Enabled
            SessionController->>SessionController: save session state
            Note over SessionController: State restored on next<br/>request with same session_id
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • PR #2: Implements Mamba snapshot/persistence system with overlapping modifications to snapshot manager, scheduler integration, and host/tier pool structures.
  • PR #1: Introduces core Mamba state snapshot persistence system affecting the same snapshotting hooks and server args.
  • PR #4: Modifies scheduler snapshot restore/save flow and Mamba snapshot metadata handling that interacts with this PR's session/cache refactoring.

Poem

🎋 Mamba hierarchies bloom on tiers untold,
Sessions stream eternal, their secrets in gold,
Backends dispatch with fused-kernel might,
Cache whispers to host through the bandwidth night. ✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch upstream-sync/pr2-session-cache

@KHAEntertainment KHAEntertainment changed the base branch from upstream-sync/pr1-deps-observability to main March 31, 2026 21:00
…sion-cache

# Conflicts:
#	python/sglang/srt/entrypoints/http_server.py
#	test/sglang/agents/test_tool_execution.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
python/sglang/srt/utils/common.py (1)

124-128: ⚠️ Potential issue | 🟡 Minor

Update the comment to reflect the code change.

The comment on line 124 states "with a newline at the end", but the BAR_FORMAT string on line 128 no longer includes a trailing \n. This creates a documentation mismatch that could confuse future maintainers.

Since the newline was intentionally removed (as indicated by the PR), please update the comment to accurately describe the current behavior. You might also want to clarify whether removing the newline affects the ray/multiprocessing behavior mentioned in lines 126-127, or if that concern is no longer applicable.

📝 Suggested comment update
-# explicitly use pure text format, with a newline at the end
+# explicitly use pure text format, without a trailing newline
 # this makes it impossible to see the animation in the progress bar
-# but will avoid messing up with ray or multiprocessing, which wraps
-# each line of output with some prefix.
+# (previously included a newline to avoid issues with ray or multiprocessing
+# output wrapping, but this has been changed)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/utils/common.py` around lines 124 - 128, Update the
outdated comment above BAR_FORMAT to reflect that the format no longer appends a
trailing newline (the BAR_FORMAT string has no "\n"); remove or reword the
sentence that says "with a newline at the end" and clarify whether the intent to
avoid ray/multiprocessing output-prefix issues still applies now that the
newline was removed (i.e., state that using a pure text format without a
trailing newline is intentional and whether it preserves the behavior with
ray/multiprocessing). Ensure the comment references BAR_FORMAT so future
maintainers understand the relationship between the comment and the format
string.
python/sglang/srt/disaggregation/mooncake/conn.py (1)

739-746: ⚠️ Potential issue | 🟠 Major

Reject non-divisible TP ratios before computing writer offsets.

writers_per_decode = self.attn_tp_size // dst_attn_tp_size only works when prefill TP is an integer multiple of decode TP. On a ratio like 6 -> 4, floor division makes several writers “park” in the same destination slice and silently corrupt the transferred Mamba state.

🔧 Proposed guard
             if self.attn_tp_size > dst_attn_tp_size:
+                if self.attn_tp_size % dst_attn_tp_size != 0:
+                    raise RuntimeError(
+                        "Mamba state slice transfer requires prefill attn_tp_size "
+                        "to be an integer multiple of decode attn_tp_size when "
+                        f"prefill TP > decode TP ({self.attn_tp_size} -> {dst_attn_tp_size})."
+                    )
                 # Multiple prefill ranks send to 1 decode rank
                 # Each prefill sends all its dims to the appropriate offset in decode
                 src_dim_start = 0
                 num_dims_to_send = src_dim
                 writers_per_decode = self.attn_tp_size // dst_attn_tp_size
+                if src_dim * writers_per_decode != dst_dim:
+                    raise RuntimeError(
+                        f"Mamba state slice geometry mismatch: {src_dim=} "
+                        f"{writers_per_decode=} {dst_dim=}"
+                    )
                 local_writer_idx = local_tp_rank_in_group % writers_per_decode
                 dst_dim_start = local_writer_idx * src_dim
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/disaggregation/mooncake/conn.py` around lines 739 - 746,
The code assumes attn_tp_size is an integer multiple of dst_attn_tp_size when
computing writers_per_decode and writer offsets; add an explicit guard before
computing writers_per_decode (e.g., in the same block that accesses attn_tp_size
and dst_attn_tp_size) to detect non-divisible ratios and raise a clear exception
(ValueError or AssertionError) with context including attn_tp_size and
dst_attn_tp_size; this prevents silent corruption when calculating
writers_per_decode, local_writer_idx, and dst_dim_start (referenced symbols:
attn_tp_size, dst_attn_tp_size, writers_per_decode, local_tp_rank_in_group,
dst_dim_start).
python/sglang/cli/main.py (1)

18-28: ⚠️ Potential issue | 🔴 Critical

CLI dispatch is broken: generate will crash immediately, and serve lacks a handler.

The generate subcommand will fail on line 28 with a NameError because generate_parser is never assigned—it's created on line 23 but the result is discarded. The serve subcommand, while it parses, has no handler bound to it, so it will crash at runtime when the dispatcher tries to call a nonexistent func attribute.

The missing import of the serve function (line 3) compounds the issue.

Required fix
 import argparse
 
 from sglang.cli.generate import generate
+from sglang.cli.serve import serve
 from sglang.cli.utils import get_git_commit_hash
 from sglang.version import __version__
@@
-    subparsers.add_parser(
+    serve_parser = subparsers.add_parser(
         "serve",
         help="Launch an SGLang server.",
         add_help=False,
     )
-    subparsers.add_parser(
+    serve_parser.set_defaults(func=serve)
+
+    generate_parser = subparsers.add_parser(
         "generate",
         help="Run inference on a multimodal model.",
         add_help=False,  # Defer help to the specific parser
     )
     generate_parser.set_defaults(func=generate)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/cli/main.py` around lines 18 - 28, The CLI dispatch is broken
because the return value of subparsers.add_parser was not assigned and the serve
handler was not imported or bound; fix by assigning the parser objects to
variables (e.g., generate_parser = subparsers.add_parser(...) and serve_parser =
subparsers.add_parser(...)), import the serve function at the top alongside
generate, and call set_defaults(func=generate) on generate_parser and
set_defaults(func=serve) on serve_parser so the dispatcher can invoke the
correct handler.
🟠 Major comments (16)
.beads-1-1 (1)

1-1: ⚠️ Potential issue | 🟠 Major

Avoid committing machine-specific absolute paths in repo config.

This path hard-codes one workstation and will fail for anyone else. It’s like committing your own /etc/hosts and expecting every machine to resolve the same way.

🔧 Proposed fix
-/home/jeanclawdai/.beads-stores/upstream-sync-2026-Q1
+.beads-stores/upstream-sync-2026-Q1

If .beads must stay local-only, prefer ignoring it and documenting a one-time setup step instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.beads at line 1, Remove the machine-specific absolute path
"/home/jeanclawdai/.beads-stores/upstream-sync-2026-Q1" from the .beads file and
replace it with a portable alternative (e.g., a relative path, an
environment-variable placeholder like ${BEADS_STORE_PATH}, or a generic store
name), or remove the file from source control; if you choose to keep it
local-only, add .beads to .gitignore and add a short setup note in the repo
README or CONTRIBUTING documenting the one-time step to create the user-specific
store path or set BEADS_STORE_PATH so others can configure it on their machines.
docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md-503-504 (1)

503-504: ⚠️ Potential issue | 🟠 Major

Replace hard-coded workstation path with repo-relative execution.

This makes the plan brittle outside one machine. Use repo-root relative commands so any contributor/runner can execute it.

🔧 Suggested adjustment
-cd /home/jeanclawdai/sglang-mamba/worktrees/upstream-sync-2026-Q1
-pytest test/registered/radix_cache/test_mamba_radix_cache_comprehensive.py -v --timeout=120 2>&1 | tail -30
+git rev-parse --show-toplevel >/dev/null || exit 1
+pytest test/registered/radix_cache/test_mamba_radix_cache_comprehensive.py -v --timeout=120 2>&1 | tail -30
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md`
around lines 503 - 504, The plan uses a hard-coded workstation path ("cd
/home/jeanclawdai/sglang-mamba/worktrees/upstream-sync-2026-Q1") making it
non-portable; change it to run relative to the repo root (e.g., replace the cd
with a repo-root discovery like cd "$(git rev-parse --show-toplevel)" or use ./
from the project root) and ensure the pytest invocation (pytest
test/registered/radix_cache/test_mamba_radix_cache_comprehensive.py -v
--timeout=120 2>&1 | tail -30) is executed from that repo-root context so any
contributor can run it reliably.
docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md-345-351 (1)

345-351: ⚠️ Potential issue | 🟠 Major

Guard the sync call to prevent crashes on non-CUDA devices.

The plan recommends unconditional torch.cuda.synchronize() before extraction. This will fail on CPU or other non-CUDA backends (xpu, hpu, npu, musa) since torch.cuda.synchronize() is CUDA-specific. The codebase explicitly supports multiple device types via DeviceConfig, so this guidance needs a device-aware guard.

🔧 Suggested adjustment
-# Ensure all pending GPU ops on Mamba state are complete before CPU extraction
-torch.cuda.synchronize()
+# Ensure pending CUDA ops complete before CPU extraction (CUDA-only)
+if torch.cuda.is_available() and mamba_pool.device.startswith("cuda"):
+    torch.cuda.synchronize()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md`
around lines 345 - 351, The unconditional torch.cuda.synchronize() will crash on
non-CUDA backends; update extract_state_from_pool in mamba_snapshot.py to
perform a device-aware sync before the .clone().cpu() calls: detect the Mamba
state device (e.g., via the DeviceConfig or the tensor/device.type used in
extract_state_from_pool) and only call the appropriate sync for CUDA
(torch.cuda.synchronize()) when device.type == "cuda" (or
torch.cuda.is_available()); skip or use the correct backend sync for other
device types so CPU-only or non-CUDA backends do not call
torch.cuda.synchronize().
python/sglang/srt/managers/schedule_batch.py-366-370 (1)

366-370: ⚠️ Potential issue | 🟠 Major

Release precomputed_embeddings here as well.

This helper only drops item.feature, but MultimodalDataItem stores its payload in either feature or precomputed_embeddings. If this replaces the old mm_items = None cleanup path, precomputed-embedding requests now keep the large tensor alive and lose the memory win.

🔧 Proposed fix
     def release_features(self):
-        """Release feature tensors to free GPU memory."""
+        """Release multimodal payload tensors to free GPU memory."""
         for item in self.mm_items:
             item.feature = None
+            item.precomputed_embeddings = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/managers/schedule_batch.py` around lines 366 - 370, The
release_features method currently only clears item.feature, leaving
MultimodalDataItem.precomputed_embeddings intact and thus retaining GPU memory;
update release_features to also set item.precomputed_embeddings = None for each
item in self.mm_items (and safely handle cases where mm_items or the attributes
may be missing) so both storage paths are released when freeing tensors.
python/sglang/srt/mem_cache/base_prefix_cache.py-113-120 (1)

113-120: ⚠️ Potential issue | 🟠 Major

Converge the init_load_back API across abstract base, implementations, and call sites.

The abstract method in BasePrefixCache (lines 224–228) still uses the old positional signature (last_host_node, host_hit_length, …), but hi_mamba_radix_cache.py has already been updated to accept InitLoadBackParams as a single parameter object. Meanwhile, hiradix_cache.py and the session_aware_cache wrapper still use the old signature, creating an API boundary mismatch—implementations and callers are now pulling in different directions.

To fix: Update the abstract method and all remaining implementations (hiradix_cache, session_aware_cache, and any others) to accept InitLoadBackParams as the parameter, matching what schedule_policy.py is already attempting to pass. This eliminates the risk of drift between the contract and the implementations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/mem_cache/base_prefix_cache.py` around lines 113 - 120,
Update the init_load_back API to use the InitLoadBackParams object across the
abstract base and all implementations: change BasePrefixCache.init_load_back to
accept a single parameter typed as InitLoadBackParams, then refactor
hiradix_cache.py and session_aware_cache (and any other implementations) to
match that signature and consume fields from the InitLoadBackParams instance;
ensure callers (e.g., schedule_policy.py) that currently pass an
InitLoadBackParams continue to pass it unchanged and adjust any call sites that
still pass positional args (last_host_node, host_hit_length, etc.) to construct
and pass an InitLoadBackParams instead so the abstract, implementations, and
callers are consistent with hi_mamba_radix_cache.py.
test/registered/4-gpu-models/test_qwen3_next_models_mtp.py-217-251 (1)

217-251: ⚠️ Potential issue | 🟠 Major

Restore the previous SGLANG_ENABLE_SPEC_V2 value.

This flips a process-global flag to True, but teardown always writes False and setUpClass has no rollback if popen_launch_server(...) fails. That makes later suites order-dependent. Save the old value, restore it on launch failure, and put that same value back in tearDownClass.

🔁 Suggested fix
     `@classmethod`
     def setUpClass(cls):
         cls.model = QWEN3_NEXT_MODEL
-        envs.SGLANG_ENABLE_SPEC_V2.set(True)
+        cls._prev_spec_v2 = envs.SGLANG_ENABLE_SPEC_V2.get()
+        envs.SGLANG_ENABLE_SPEC_V2.set(True)
         cls.base_url = DEFAULT_URL_FOR_TEST
-        cls.process = popen_launch_server(
-            cls.model,
-            cls.base_url,
-            timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
-            other_args=[
-                "--trust-remote-code",
-                "--speculative-algorithm",
-                "NEXTN",
-                "--speculative-num-steps",
-                "3",
-                "--speculative-eagle-topk",
-                "1",
-                "--speculative-num-draft-tokens",
-                "4",
-                "--mem-fraction-static",
-                "0.8",
-                "--tp",
-                "4",
-                "--chunked-prefill-size",
-                "2048",
-                "--mamba-scheduler-strategy",
-                "extra_buffer",
-                "--mamba-track-interval",
-                "128",
-            ],
-        )
+        try:
+            cls.process = popen_launch_server(
+                cls.model,
+                cls.base_url,
+                timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
+                other_args=[
+                    "--trust-remote-code",
+                    "--speculative-algorithm",
+                    "NEXTN",
+                    "--speculative-num-steps",
+                    "3",
+                    "--speculative-eagle-topk",
+                    "1",
+                    "--speculative-num-draft-tokens",
+                    "4",
+                    "--mem-fraction-static",
+                    "0.8",
+                    "--tp",
+                    "4",
+                    "--chunked-prefill-size",
+                    "2048",
+                    "--mamba-scheduler-strategy",
+                    "extra_buffer",
+                    "--mamba-track-interval",
+                    "128",
+                ],
+            )
+        except Exception:
+            envs.SGLANG_ENABLE_SPEC_V2.set(cls._prev_spec_v2)
+            raise

     `@classmethod`
     def tearDownClass(cls):
-        envs.SGLANG_ENABLE_SPEC_V2.set(False)
-        kill_process_tree(cls.process.pid)
+        envs.SGLANG_ENABLE_SPEC_V2.set(getattr(cls, "_prev_spec_v2", False))
+        if hasattr(cls, "process"):
+            kill_process_tree(cls.process.pid)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/registered/4-gpu-models/test_qwen3_next_models_mtp.py` around lines 217
- 251, setUpClass currently flips the process-global flag
envs.SGLANG_ENABLE_SPEC_V2 to True without preserving the prior value and
tearDownClass always sets it to False; change setUpClass to read and save the
previous value (e.g., prev_spec_v2 = envs.SGLANG_ENABLE_SPEC_V2.get()), then set
True, and if popen_launch_server fails restore the saved value before
propagating the error; also change tearDownClass to restore the saved previous
value (not hardcode False) so the original state is preserved; reference
setUpClass, tearDownClass, envs.SGLANG_ENABLE_SPEC_V2, and popen_launch_server
when making these changes.
python/sglang/srt/disaggregation/decode.py-165-165 (1)

165-165: ⚠️ Potential issue | 🟠 Major

mamba_size is wired to thin air right now.

effective_mamba_size is computed, but _init_mamba_pool(...) still gets size + pre_alloc_size for its capacity args. That makes the new mamba_size input a no-op, so decode-side Mamba state will still be sized incorrectly whenever that cap is set.

Also applies to: 184-197

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/disaggregation/decode.py` at line 165, The new mamba_size
parameter is ignored because _init_mamba_pool(...) is still called with size +
pre_alloc_size; compute the effective_mamba_size (use mamba_size when not None,
otherwise keep existing logic) and pass effective_mamba_size into all
_init_mamba_pool(...) capacity arguments (replace occurrences using size +
pre_alloc_size around the effective_mamba_size computation and in the blocks
referenced roughly at lines 184-197). Also ensure default None handling remains
correct so callers can override capacity by providing mamba_size.
python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py-51-56 (1)

51-56: ⚠️ Potential issue | 🟠 Major

Guard the CuTe DSL fallback path.

If the import fails here, cutedsl_fused_sigmoid_gating_delta_rule_update becomes None, but GDNAttnBackend.__init__ still selects it whenever SGLANG_USE_CUTEDSL_GDN_DECODE is enabled. That turns the first decode into a NoneType call instead of a clean fallback.

Suggested guard
 use_cutedsl = Envs.SGLANG_USE_CUTEDSL_GDN_DECODE.get()
+if use_cutedsl and cutedsl_fused_sigmoid_gating_delta_rule_update is None:
+    rank0_log(
+        "CuTe DSL GDN decode requested but kernel is unavailable; falling back."
+    )
+    use_cutedsl = False
 self._kernel_func = (
     cutedsl_fused_sigmoid_gating_delta_rule_update
     if use_cutedsl
     else fused_sigmoid_gating_delta_rule_update
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py` around
lines 51 - 56, The import fall-back leaves
cutedsl_fused_sigmoid_gating_delta_rule_update as None but
GDNAttnBackend.__init__ still picks it when SGLANG_USE_CUTEDSL_GDN_DECODE is
enabled, causing a NoneType call on decode; update GDNAttnBackend.__init__ to
check that cutedsl_fused_sigmoid_gating_delta_rule_update is not None before
selecting it (and only enable the CuTe DSL path when both
SGLANG_USE_CUTEDSL_GDN_DECODE is true and
cutedsl_fused_sigmoid_gating_delta_rule_update is callable), otherwise fall back
to the existing safe implementation so decode never tries to call None.
test/unit/test_mamba_state_scatter_triton.py-164-167 (1)

164-167: ⚠️ Potential issue | 🟠 Major

Fail fast when the fused kernel cannot import on CUDA.

On a CUDA worker, that import is exactly what this test is supposed to validate. skipTest() turns a broken Triton module into a false green.

Suggested fix
-        if fused_mamba_state_scatter_with_mask is None:
-            self.skipTest(
-                f"fused_mamba_state_scatter_with_mask import failed: {_FUSED_IMPORT_ERROR}"
-            )
+        if fused_mamba_state_scatter_with_mask is None:
+            self.fail(
+                f"fused_mamba_state_scatter_with_mask import failed: {_FUSED_IMPORT_ERROR}"
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/unit/test_mamba_state_scatter_triton.py` around lines 164 - 167, The
current test silently skips when fused_mamba_state_scatter_with_mask is None by
calling self.skipTest, which masks import failures on CUDA; change this to fail
fast by replacing the skipTest call with an explicit test failure (e.g.,
self.fail or raise ImportError) that includes _FUSED_IMPORT_ERROR so the test
fails when the Triton fused kernel cannot be imported; locate the check
referencing fused_mamba_state_scatter_with_mask and _FUSED_IMPORT_ERROR and swap
skipTest(...) for a failing call that reports the import error.
test/unit/test_mamba_state_scatter_triton.py-1-2 (1)

1-2: 🛠️ Refactor suggestion | 🟠 Major

Use pytest for this new test module.

This adds a fresh unittest suite under test/, which makes it an outlier for fixtures, markers, and selection.

As per coding guidelines, test/**/*.py: Use pytest for testing with the 9-phase test plan structure in test/phases/.

Also applies to: 160-354

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/unit/test_mamba_state_scatter_triton.py` around lines 1 - 2, The new
test module test/unit/test_mamba_state_scatter_triton.py is written with
unittest; convert it to pytest-style tests by removing the unittest import and
TestCase subclasses and rewriting tests as plain functions (or pytest-style
classes without unittest.TestCase) that use pytest fixtures/markers; ensure you
adopt the repository's 9-phase test plan by hooking into the appropriate
fixtures from test/phases/ and add any required pytest markers (e.g.,
`@pytest.mark`.*) used elsewhere in tests so the file is consistent with pytest
discovery and selection.
python/sglang/srt/mem_cache/memory_pool_host.py-1480-1487 (1)

1480-1487: ⚠️ Potential issue | 🟠 Major

HostPoolGroup serializes only the anchor pool.

get_data_page(), get_dummy_flat_data_page(), and set_from_flat_data_page() all delegate to anchor_entry.host_pool. If this wrapper is used for L3 write/read, every extra PoolEntry disappears on disk round-trips, so restore can bring back KV without the matching Mamba state — like saving the label on the box but not what’s inside it.

Based on learnings, Mamba SSM snapshot persistence is the primary architectural feature being added to SGLang for fast multi-turn inference in the fork.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/mem_cache/memory_pool_host.py` around lines 1480 - 1487,
HostPoolGroup currently only proxies get_data_page, get_dummy_flat_data_page,
and set_from_flat_data_page to anchor_entry.host_pool which causes only the
anchor pool to be serialized; change these methods in HostPoolGroup to operate
on the specific PoolEntry corresponding to the requested index (or
iterate/coordinate across all PoolEntry.host_pool instances for dummy/read/write
operations) so that every PoolEntry.host_pool state is included in
persistence/restore. Locate HostPoolGroup and revise get_data_page(index, flat),
get_dummy_flat_data_page(), and set_from_flat_data_page(index, data_page) to
reference the correct PoolEntry (not just anchor_entry) and ensure the
serialization logic writes/reads each PoolEntry.host_pool state so KV and Mamba
state remain consistent after disk round-trips.
python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py-242-247 (1)

242-247: ⚠️ Potential issue | 🟠 Major

DRAFT_EXTEND_V2 still initializes linear metadata with None.

HybridLinearAttnBackend.init_forward_metadata() calls every sub-backend. In this file, both Mamba2AttnBackend.init_forward_metadata() and LightningAttentionBackend.init_forward_metadata() immediately pass metadata.query_start_loc into prepare_*, so returning None here can still fail before layer dispatch ever skips the Mamba path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py` around
lines 242 - 247, The branch for forward_batch.forward_mode.is_draft_extend_v2()
sets query_start_loc = None which breaks downstream init_forward_metadata calls
that immediately forward metadata.query_start_loc into prepare_*; change this to
a valid integer (e.g., 0) instead of None so
Mamba2AttnBackend.init_forward_metadata and
LightningAttentionBackend.init_forward_metadata receive a usable
metadata.query_start_loc during HybridLinearAttnBackend.init_forward_metadata
execution (ensure the value is an int derived from the forward_batch or default
to 0).
python/sglang/srt/mem_cache/session_aware_cache.py-20-27 (1)

20-27: ⚠️ Potential issue | 🟠 Major

Give _VirtualNode the tiny HiCache node interface, or stop returning it as last_host_node.

match_prefix() currently hands downstream HiCache code a sentinel with no backuped, parent, get_last_hash_value(), or get_prefix_hash_values(). With streaming sessions plus storage prefetch enabled, the first slot hit will crash with AttributeError instead of just skipping prefetch.

🩹 One lightweight way to make the sentinel safe
 class _VirtualNode:
     """Sentinel node for streaming session requests.
@@
-    pass
+    backuped = False
+    parent = None
+
+    def get_last_hash_value(self):
+        return None
+
+    def get_prefix_hash_values(self, _node):
+        return []

Also applies to: 177-181

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/mem_cache/session_aware_cache.py` around lines 20 - 27,
match_prefix() can return the sentinel _VirtualNode as last_host_node but that
sentinel lacks the small HiCache node interface leading to AttributeError when
downstream code expects attributes/methods like backuped, parent,
get_last_hash_value(), and get_prefix_hash_values(); fix by giving _VirtualNode
that tiny interface (add attributes backuped and parent and stub methods
get_last_hash_value and get_prefix_hash_values that return safe defaults) or
change match_prefix() to never return _VirtualNode as last_host_node; update the
_VirtualNode class (and the other sentinel at lines ~177-181) to implement these
minimal members so streaming-session code and prefetch can safely call them.
python/sglang/srt/managers/scheduler.py-2594-2606 (1)

2594-2606: ⚠️ Potential issue | 🟠 Major

Keep the fast-fail Req wired to the original worker and trace.

This fallback constructor drops http_worker_ipc and time_stats. In multi-http-worker mode the abort can lose its way home, and this path also disappears from the request-level timing/trace data.

🔧 Minimal fix
             req = Req(
                 recv_req.rid,
                 recv_req.input_text,
                 recv_req.input_ids,
                 recv_req.sampling_params,
                 vocab_size=self.model_config.vocab_size,
+                http_worker_ipc=recv_req.http_worker_ipc,
+                time_stats=recv_req.time_stats,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/managers/scheduler.py` around lines 2594 - 2606, The
fast-fail Req being constructed drops the original request's worker/trace info
so aborts and timing are lost; when creating the fallback Req (the one using
recv_req.rid/... and calling set_finish_with_abort, init_req_max_new_tokens, and
_add_request_to_queue) copy over recv_req.http_worker_ipc and
recv_req.time_stats (and any trace/worker fields used for routing/metrics) onto
the new req so the abort will route back to the original worker and the
request-level timing/trace data is preserved.
python/sglang/srt/mem_cache/session_aware_cache.py-68-75 (1)

68-75: ⚠️ Potential issue | 🟠 Major

Release the slot-held Mamba allocation when the session closes.

save_from_req() moves mamba_pool_idx out of the request, but release_session() only frees KV indices and the req-pool slot. On hybrid/Mamba models, closing enough streaming sessions will slowly exhaust the Mamba pool even though the sessions are gone.

Also applies to: 218-238

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/mem_cache/session_aware_cache.py` around lines 68 - 75, The
session leaves a held Mamba allocation behind because save_from_req() moves
mamba_pool_idx out of the request but release_session() only clears req_pool_idx
and KV indices; update release_session() to also free the mamba slot when
present by checking session.mamba_pool_idx (and any related fields like
mamba_ping_pong_track_buffer, mamba_next_track_idx, mamba_last_track_seqlen,
mamba_branching_seqlen), return that index to the Mamba pool allocator and set
session.mamba_pool_idx (and the corresponding req.mamba_* fields if applicable)
to None/cleared; apply the same cleanup logic to the other release branch
referenced around the 218-238 area so hybrid/Mamba sessions do not leak pool
slots.
python/sglang/srt/managers/session_controller.py-116-133 (1)

116-133: ⚠️ Potential issue | 🟠 Major

Don’t rotate a streaming session onto a still-running request.

The streaming path reuses the previous request without checking last_req.finished(). If another turn sneaks in while the first one is still decoding, the new request is built from partial output_ids and the old request loses its session link mid-flight — like changing train tracks while the cars are still moving.

🧭 Suggested guard
         elif self.req_nodes:
             assert len(self.req_nodes) == 1
-            _, last_req_node = self.req_nodes.popitem()
+            last_rid, last_req_node = next(iter(self.req_nodes.items()))
             last_req = last_req_node.req
+            if not last_req.finished():
+                abort = True
+                abort_message = (
+                    "Streaming session request is appending to a request that hasn't finished."
+                )
+                logging.warning(abort_message)
+            else:
+                self.req_nodes.pop(last_rid)

Also applies to: 225-233

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/managers/session_controller.py` around lines 116 - 133,
When reusing the previous request in the streaming path, guard against rotating
onto a still-running request: in the block where self.streaming and
self.req_nodes is inspected (the code that pops last_req_node and sets
last_req), first peek the last_req_node.req and call its finished() (or
equivalent) — if finished() is False then do not pop or reuse it; instead set
abort = True and abort_message = "Cannot rotate streaming session onto a
still-running request." Apply the same guard in the other similar block
referenced (the block around lines 225-233) so both places check
last_req.finished() before reusing or popping the previous request.
🟡 Minor comments (7)
python/sglang/test/run_eval.py-101-104 (1)

101-104: ⚠️ Potential issue | 🟡 Minor

Fix: Change --dataset-path default to None to allow per-eval defaults.

The GPQA fallback pattern looks defensive, but there's a logical trap. Since --dataset-path has a default value of "THUDM/LongBench-v2" (line 270), the command:

python3 -m sglang.test.run_eval --eval-name gpqa

...will set args.dataset_path to the LongBench-v2 default, making it truthy. The or fallback to the GPQA CSV URL never triggers—it's like setting up a safety net under the wrong apparatus. GPQAEval then receives the wrong dataset and fails on pandas.read_csv().

The fix is simple: change the --dataset-path default to None so each evaluator applies its own appropriate default. This is safe because LongBenchV2Eval has internal fallback logic (if not data_source: data_source = DEFAULT_DATASET), so passing None still works correctly for LongBench v2.

Proposed fix: Change default to None
     parser.add_argument(
         "--dataset-path",
         type=str,
-        default="THUDM/LongBench-v2",
+        default=None,
         help="Path to dataset file or HuggingFace dataset name for LongBench-v2",
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/test/run_eval.py` around lines 101 - 104, Change the CLI
default for --dataset-path from the string "THUDM/LongBench-v2" to None so
per-evaluator defaults can apply; update the argument definition that creates
args.dataset_path (the add_argument call for "dataset_path") to use
default=None, ensuring GPQAEval(filename, ...) sees the intended None/fallback
path and LongBenchV2Eval still handles None via its internal fallback logic.
docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md-477-484 (1)

477-484: ⚠️ Potential issue | 🟡 Minor

Remove the first import check; python.sglang is not a valid import path in this repo.

The first import (import python.sglang...) is like checking if a door exists by walking into a wall—it never works and masks whether the real door (the sys.path pattern) actually opens. Since the package root is sglang (per pyproject.toml), only the second import pattern with sys.path.insert(0, 'python') is valid. The codebase uses this pattern consistently everywhere, so simplify the script to rely on the one path that actually works.

🔧 Suggested adjustment
-python -c "
-import python.sglang.srt.snapshot.mamba_snapshot as ms
-print('mamba_snapshot imports OK')
-" 2>&1 || python -c "
+python -c "
 import sys; sys.path.insert(0, 'python')
 from sglang.srt.snapshot.mamba_snapshot import MambaSnapshotManager
 print('MambaSnapshotManager importable')
 " 2>&1 || echo "Import check deferred to runtime validation"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md`
around lines 477 - 484, Remove the invalid import check that tries to import
python.sglang.srt.snapshot.mamba_snapshot and instead keep the working pattern
that prepends 'python' to sys.path and imports MambaSnapshotManager from
sglang.srt.snapshot.mamba_snapshot; specifically, delete the first python -c
block (the "import python.sglang..." check) and rely on the second block that
does sys.path.insert(0, 'python') followed by from
sglang.srt.snapshot.mamba_snapshot import MambaSnapshotManager so the import
validation uses the correct package root and matches other scripts.
python/sglang/srt/mem_cache/base_prefix_cache.py-132-136 (1)

132-136: ⚠️ Potential issue | 🟡 Minor

Keep host_hit_length as a count, not a sentinel.

Using max(kv_host_tokens, 1-if-mamba-on-host) turns this field into both an odometer and a warning light. Downstream accounting still treats it as literal token count, so a state-only host restore will be reported as a one-token host hit unless every consumer special-cases the sentinel.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/mem_cache/base_prefix_cache.py` around lines 132 - 136, The
host_hit_length field is being used as both a token count and a sentinel via
max(kv_host_tokens, 1 - if-mamba-on-host); change it to always carry the literal
token count (i.e., set host_hit_length = kv_host_tokens) and stop injecting the
1-as-sentinel behavior. Introduce or set a separate boolean/enum flag (e.g.,
host_state_only or mamba_host_restore) to indicate a state-only/mamba-on-host
restore when applicable (use the existing mamba-on-host condition to populate
that flag). Update the code paths that compute and consume host_hit_length
(refer to host_hit_length and kv_host_tokens in this module) and adjust any
docs/comments to state host_hit_length is a pure count and the new flag
indicates state-only restores.
python/sglang/srt/managers/io_struct.py-1642-1643 (1)

1642-1643: ⚠️ Potential issue | 🟡 Minor

Validate timeout to reject non-positive values.

timeout currently accepts negative (or zero) values, which can cause confusing immediate-timeout behavior at session open.

Suggested fix
 `@dataclass`
 class OpenSessionReqInput(BaseReq):
     capacity_of_str_len: int
     session_id: Optional[str] = None
     streaming: bool = False
     timeout: Optional[float] = None
+
+    def __post_init__(self):
+        if self.timeout is not None and self.timeout <= 0:
+            raise ValueError("timeout must be > 0 when provided.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/managers/io_struct.py` around lines 1642 - 1643, Validate
the timeout attribute so it cannot be zero or negative: in the class or
dataclass that defines the streaming and timeout fields (the object containing
attributes streaming: bool and timeout: Optional[float]), add a check in the
constructor/__post_init__ (or the setter used for timeout) that raises a
ValueError if timeout is not None and timeout <= 0, with a clear message like
"timeout must be positive or None"; ensure callers that pass timeout get the
same validation behavior when constructing or modifying the instance.
python/sglang/srt/server_args.py-2096-2110 (1)

2096-2110: ⚠️ Potential issue | 🟡 Minor

Reject unknown mamba_backend values in _handle_mamba_backend.

argparse guards the CLI path, but direct ServerArgs(...) callers can still pass arbitrary strings and this method currently treats any non-flashinfer value as valid. That lets a typo slip past the front desk and blow up later in backend dispatch.

Suggested guard
 def _handle_mamba_backend(self):
+    if self.mamba_backend not in MAMBA_BACKEND_CHOICES:
+        raise ValueError(
+            f"Invalid mamba backend: {self.mamba_backend}. "
+            f"Expected one of {MAMBA_BACKEND_CHOICES}."
+        )
+
     if self.mamba_backend == "flashinfer":
         if is_flashinfer_available():
             try:
                 import flashinfer.mamba  # noqa: F401
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/server_args.py` around lines 2096 - 2110, The
_handle_mamba_backend method currently only handles the "flashinfer" case and
silently accepts any other mamba_backend value; update it to explicitly reject
unknown values by validating self.mamba_backend against allowed backends (e.g.,
"flashinfer") and raising a ValueError with a clear message when the value is
not recognized. Locate the _handle_mamba_backend method and add an else branch
(or an initial validation) that checks self.mamba_backend and raises
ValueError("Unknown mamba_backend: {self.mamba_backend}. Supported backends:
flashinfer") so direct ServerArgs(...) callers cannot pass arbitrary strings.
python/sglang/srt/layers/attention/mamba/ops/ssu_dispatch.py-21-42 (1)

21-42: ⚠️ Potential issue | 🟡 Minor

Move the abstract __call__ body onto its own line.

def __call__(...) -> None: ... is the E704 flake8 error here, so this file will keep pre-commit red until the body is rewritten as a normal block.

✏️ Lint-safe form
     `@abstractmethod`
     def __call__(
         self,
         state: torch.Tensor,
         x: torch.Tensor,
@@
         cache_steps: int | None = None,
         retrieve_parent_token: torch.Tensor | None = None,
         intermediate_state_indices: torch.Tensor | None = None,
-    ) -> None: ...
+    ) -> None:
+        pass

As per coding guidelines, "Lint code using ruff via pre-commit hooks".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/layers/attention/mamba/ops/ssu_dispatch.py` around lines 21
- 42, The abstractmethod decorated __call__ signature currently ends with a
trailing ellipsis causing E704; change the stub from a single-line "def
__call__(...) -> None: ..." to a proper block for the abstractmethod (e.g.,
place the colon at end of the signature line and add an indented body on the
next line) and use a suitable no-op or explicit abstract implementation such as
"raise NotImplementedError" (or "pass") inside the body so the abstract __call__
in this class is lint-compliant; look for the __call__ method definition with
the `@abstractmethod` decorator to apply this change.
python/sglang/srt/mem_cache/memory_pool.py-551-557 (1)

551-557: ⚠️ Potential issue | 🟡 Minor

F-strings missing placeholders — should be regular strings.

Static analysis flagged these f-strings as having no interpolation. They should be plain strings to avoid confusion.

         assert len(select_index) == len(
             mamba_indices
-        ), f"Not enough space for mamba cache, try to increase --mamba-full-memory-ratio or --max-mamba-cache-size."
+        ), "Not enough space for mamba cache, try to increase --mamba-full-memory-ratio or --max-mamba-cache-size."
         if self.enable_mamba_extra_buffer:
             assert len(select_index) == len(
                 mamba_ping_pong_track_buffers
-            ), f"Not enough space for mamba ping pong idx, try to increase --mamba-full-memory-ratio."
+            ), "Not enough space for mamba ping pong idx, try to increase --mamba-full-memory-ratio."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/mem_cache/memory_pool.py` around lines 551 - 557, Replace
the two assert messages that are written as f-strings but contain no
interpolation with plain string literals; locate the assertions in
memory_pool.py where select_index, mamba_indices, and (when
enable_mamba_extra_buffer is True) mamba_ping_pong_track_buffers are compared
and change the f"... " messages to regular strings like "Not enough space for
mamba cache, try to increase --mamba-full-memory-ratio or
--max-mamba-cache-size." and "Not enough space for mamba ping pong idx, try to
increase --mamba-full-memory-ratio." to avoid misleading f-strings in the code
paths using select_index, mamba_indices, and enable_mamba_extra_buffer.
🧹 Nitpick comments (5)
.gitignore (1)

241-242: Nice organizational touch!

The new "# Other repos" section clearly groups external repository ignores.

If lmms-eval refers specifically to a cloned directory (which seems likely), consider using lmms-eval/ with a trailing slash for precision—think of it like a targeted spell versus an area-of-effect one. Without the slash, it'll match both files and directories named lmms-eval, which might be broader than needed.

📁 Optional refinement for directory-specific pattern
 # Other repos
-lmms-eval
+lmms-eval/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 241 - 242, Change the .gitignore entry so it matches
only the directory by updating the existing "lmms-eval" pattern to a
directory-specific pattern; locate the "lmms-eval" line under the "# Other
repos" section and replace it with a trailing-slash pattern (use "lmms-eval/")
so only the cloned directory is ignored, not files that might share the same
name.
test/registered/models/test_nvidia_nemotron_3_nano.py (1)

29-39: Nice coverage target; please land it in the phase-based pytest plan instead of another unittest class.

This grows the legacy registered matrix again, so the flashinfer coverage lives in a parallel lane rather than the repo’s main 9-phase flow.

As per coding guidelines, test/**/*.py: Use pytest for testing with the 9-phase test plan structure in test/phases/.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/registered/models/test_nvidia_nemotron_3_nano.py` around lines 29 - 39,
The test class TestNvidiaNemotron3Nano30BBF16FlashInfer should be converted from
a unittest-style registered matrix class into a pytest phase-based test under
the repo’s 9-phase plan: remove the class and instead implement a pytest test in
the phases structure that invokes the same model
("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"), model_config_name
("lm_eval_configs/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.yaml"), other_args
(including "--tp-size","2","--mamba-backend","flashinfer" plus
NEMOTRON_3_NANO_THINKING_ARGS) and integrates with the existing phase
fixtures/markers used by the test/phases flow so the flashinfer coverage runs in
the main 9-phase pipeline rather than as a parallel unittest class.
test/registered/4-gpu-models/test_qwen3_next_models_mtp.py (1)

215-297: Useful spec-v2 coverage; please land it in test/phases/ with pytest.

Adding another unittest suite here keeps the new validation outside the repo’s phase plan, which makes the test story harder to read and maintain.

As per coding guidelines, test/**/*.py: Use pytest for testing with the 9-phase test plan structure in test/phases/.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/registered/4-gpu-models/test_qwen3_next_models_mtp.py` around lines 215
- 297, The TestQwen3NextMTPV2 unittest class should be converted into a pytest
test module under test/phases/ to follow the 9-phase test plan; replace the
class-based setUpClass/tearDownClass (envs.SGLANG_ENABLE_SPEC_V2 toggling and
popen_launch_server/kill_process_tree usage) with pytest fixtures that launch
the server (popen_launch_server) and yield the base_url/process, and rewrite
test_gsm8k as a plain pytest test function that calls run_eval and asserts
against ACC_THRESHOLDS[self.model]["gsm8k"]; ensure helper names
(popen_launch_server, kill_process_tree, run_eval, ACC_THRESHOLDS,
QWEN3_NEXT_MODEL, envs.SGLANG_ENABLE_SPEC_V2) are imported or referenced the
same way and place the file in the appropriate phase slot so it integrates with
the repo phase plan.
python/sglang/srt/mem_cache/hi_mamba_radix_cache.py (1)

2079-2086: Potential issue: _alloc_with_evict could return None when indexing [0].

At line 2086, _alloc_with_evict(...)[0] assumes the allocation succeeded. However, _alloc_with_evict returns None if allocation fails even after eviction. Although error_message is set which would raise RuntimeError, the exception path might not be the intended behavior here.

Consider explicitly checking for None before indexing, or ensuring the exception is the desired outcome:

indices = self._alloc_with_evict(
    self.req_to_token_pool.mamba_pool,
    len(last_hit_node.mamba_host_value),
    self.evict_mamba,
    lock_node=last_hit_node,
    error_message="Cannot alloc request mamba cache for host load back",
)
# indices is guaranteed non-None due to error_message raising RuntimeError
req.mamba_pool_idx = indices[0]

The current code works correctly because error_message is provided, triggering the exception. But the inline [0] makes it easy to miss this dependency during code review.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/mem_cache/hi_mamba_radix_cache.py` around lines 2079 -
2086, The code currently indexes the result of _alloc_with_evict(...) directly
with [0], which can be None on failure; change it to first capture the return
value from _alloc_with_evict (called with self.req_to_token_pool.mamba_pool,
len(last_hit_node.mamba_host_value), self.evict_mamba, lock_node=last_hit_node,
error_message="Cannot alloc request mamba cache for host load back") into a
local variable, then either assert/raise if that variable is None or rely on the
provided error_message to raise and only then set req.mamba_pool_idx =
indices[0]; this makes the dependency on _alloc_with_evict’s exception behavior
explicit and avoids an accidental None indexing bug.
python/sglang/srt/mem_cache/hicache_storage.py (1)

493-517: Consider edge case when page_size is not defined on host_pool.

Line 498 uses getattr(host_pool, "page_size", 1) or 1 which handles missing/None/0 cases well. However, line 514 assumes host_indices is indexable by i * page_size. If page_size > 1 but host_indices doesn't have the expected length (despite the check at 502), accessing host_indices[i * page_size] could still be out of bounds for the last few entries when len(keys) * page_size != host_indices.numel().

The validation at 502-510 should catch this, but the error message could be clearer about what "expected" vs "got" means in terms of page alignment.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: db6f4581-f1c2-4242-9a7b-7a059a7d256f

📥 Commits

Reviewing files that changed from the base of the PR and between 8bbfae0 and c813876.

📒 Files selected for processing (60)
  • .beads
  • .gitignore
  • .remember/tmp/save-session.pid
  • CLAUDE.md
  • benchmark/hicache/bench_multiturn.py
  • docs/superpowers/plans/2026-03-31-phase-04-mamba-cache-architecture-reconcile.md
  • python/sglang/cli/main.py
  • python/sglang/cli/serve.py
  • python/sglang/cli/utils.py
  • python/sglang/launch_server.py
  • python/sglang/multimodal_gen/runtime/platforms/__init__.py
  • python/sglang/srt/batch_overlap/two_batch_overlap.py
  • python/sglang/srt/compilation/backend.py
  • python/sglang/srt/compilation/compile.py
  • python/sglang/srt/connector/__init__.py
  • python/sglang/srt/disaggregation/ascend/conn.py
  • python/sglang/srt/disaggregation/decode.py
  • python/sglang/srt/disaggregation/mooncake/conn.py
  • python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py
  • python/sglang/srt/layers/attention/mamba/causal_conv1d.py
  • python/sglang/srt/layers/attention/mamba/mamba_state_scatter_triton.py
  • python/sglang/srt/layers/attention/mamba/ops/__init__.py
  • python/sglang/srt/layers/attention/mamba/ops/ssd_combined.py
  • python/sglang/srt/layers/attention/mamba/ops/ssu_dispatch.py
  • python/sglang/srt/managers/cache_controller.py
  • python/sglang/srt/managers/io_struct.py
  • python/sglang/srt/managers/schedule_batch.py
  • python/sglang/srt/managers/schedule_policy.py
  • python/sglang/srt/managers/scheduler.py
  • python/sglang/srt/managers/scheduler_output_processor_mixin.py
  • python/sglang/srt/managers/session_controller.py
  • python/sglang/srt/managers/tokenizer_communicator_mixin.py
  • python/sglang/srt/mem_cache/base_prefix_cache.py
  • python/sglang/srt/mem_cache/hi_mamba_radix_cache.py
  • python/sglang/srt/mem_cache/hicache_storage.py
  • python/sglang/srt/mem_cache/hybrid_cache/hybrid_cache_controller.py
  • python/sglang/srt/mem_cache/mamba_radix_cache.py
  • python/sglang/srt/mem_cache/memory_pool.py
  • python/sglang/srt/mem_cache/memory_pool_host.py
  • python/sglang/srt/mem_cache/session_aware_cache.py
  • python/sglang/srt/model_executor/forward_batch_info.py
  • python/sglang/srt/model_executor/model_runner.py
  • python/sglang/srt/model_executor/model_runner_kv_cache_mixin.py
  • python/sglang/srt/models/nemotron_h.py
  • python/sglang/srt/observability/scheduler_metrics_mixin.py
  • python/sglang/srt/server_args.py
  • python/sglang/srt/speculative/eagle_info.py
  • python/sglang/srt/speculative/eagle_info_v2.py
  • python/sglang/srt/speculative/eagle_worker.py
  • python/sglang/srt/speculative/eagle_worker_v2.py
  • python/sglang/srt/utils/common.py
  • python/sglang/test/run_eval.py
  • test/registered/4-gpu-models/test_qwen3_next_models_mtp.py
  • test/registered/layers/mamba/conftest.py
  • test/registered/layers/mamba/test_mamba_ssm_ssd.py
  • test/registered/models/test_nvidia_nemotron_3_nano.py
  • test/registered/radix_cache/test_mamba_radix_cache_comprehensive.py
  • test/registered/radix_cache/test_mamba_unittest.py
  • test/run_suite.py
  • test/unit/test_mamba_state_scatter_triton.py
💤 Files with no reviewable changes (1)
  • python/sglang/srt/compilation/compile.py

Comment thread .remember/tmp/save-session.pid Outdated
@@ -0,0 +1 @@
215077
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove this temporary PID file from version control.

This appears to be a process ID file from a development session—think of it like accidentally committing your laptop's Bluetooth pairing code. The number 215077 is a process ID that was valid on your machine at one moment in time, but it's meaningless on other machines and becomes stale the instant that process exits.

Files in .remember/tmp/ with .pid extensions are ephemeral runtime state and should never be committed to the repository. They'll cause unnecessary churn, potential merge conflicts, and clutter the commit history.

🧹 Recommended fix

Step 1: Remove the file from this commit:

git rm .remember/tmp/save-session.pid

Step 2: Add this pattern to .gitignore to prevent similar files from being committed in the future:

+# Runtime state files
+.remember/tmp/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.remember/tmp/save-session.pid at line 1, Remove the committed ephemeral PID
file .remember/tmp/save-session.pid from the repository and stop tracking it:
run git rm on that file (to remove it from the index) and commit the removal;
then add an ignore rule to .gitignore (e.g., ignore .remember/tmp/*.pid or
.remember/tmp/) so future PID/runtime files are not tracked, and commit the
updated .gitignore. Ensure you do not alter other files and that the removal
commit message clearly states the file was removed because it is ephemeral.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — this is an accidental commit from a dev session. Removing from tracking and adding .remember/tmp/ to .gitignore.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KHAEntertainment, thanks for confirming! Removing it with git rm --cached .remember/tmp/save-session.pid (to drop it from the index without deleting the file on disk, if you still need it locally) and adding .remember/tmp/ to .gitignore is exactly the right move. That pattern will catch any future .pid or other ephemeral runtime files that land in that directory. 🎉


🧠 Learnings used
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: .gitignore:283-284
Timestamp: 2026-03-31T17:46:59.790Z
Learning: In the KHAEntertainment/sglang-mamba repository, the `*.db` pattern in `.gitignore` is intentional and project-wide. It targets only Beads/Dolt artifacts generated by `bd init`. There are no DB fixture files in this project that need to be tracked, so the broad glob is acceptable. Do not flag this as overly broad in future reviews unless DB fixtures are introduced.

Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/managers/scheduler.py:1446-1479
Timestamp: 2026-03-31T20:56:26.697Z
Learning: In `python/sglang/srt/managers/scheduler.py`, the `handle_restore_snapshot` method's `create_new_request=True` code path is an intentional stub in the KHAEntertainment/sglang-mamba fork. The full lifecycle (rid_to_state registration, http_worker_ipc plumbing, queue-acceptance feedback, and bind/continue RPC) is planned to be addressed in PR 5 (phase-08 snapshot re-port + validation). Do not flag the missing owner registration or fire-and-forget _add_request_to_queue call as bugs requiring a fix in earlier PRs.

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Comment on lines +367 to +387
# Barrier logic: release next round when all clients for
# current barrier round have completed
if (
self.enable_round_barrier
and current_barrier_round < self.max_rounds
):
barrier_round_completed += 1
expected = self.clients_per_round[current_barrier_round]
if barrier_round_completed == expected:
print(
f"\n Barrier: round {current_barrier_round} complete "
f"({expected} clients), releasing {len(next_round_reqs)} "
f"requests for round {current_barrier_round + 1}"
)
self._send_heartbeat(input_len=100, output_len=100)
time.sleep(10)
for req in next_round_reqs:
self.ready_queue.append(req)
next_round_reqs = []
current_barrier_round += 1
barrier_round_completed = 0
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Barrier mode currently crashes and still leaks the next round early.

Lines 371-386 use current_barrier_round / barrier_round_completed before assignment, and self.max_rounds / self.clients_per_round are never initialized. Even after fixing that, Line 360 already drains next_round_reqs, so this block does not actually own the gate — it’s an airlock with one door still open.

🛠️ Proposed fix
     def response_handler(self):
         next_round_reqs = []
+        barrier_round_completed = 0
         while True:
             try:
                 client_id, response = self.response_queue.get(
                     timeout=10
                 )  # Block until response is available
@@
                     if self.enable_round_barrier:
                         next_round_reqs.append(new_req)
-                        if len(next_round_reqs) == self.num_clients:
-                            for req in next_round_reqs:
-                                self.ready_queue.append(req)
-                            next_round_reqs = []
                     else:
                         self.ready_queue.append(new_req)

                 # Barrier logic: release next round when all clients for
                 # current barrier round have completed
                 if (
                     self.enable_round_barrier
-                    and current_barrier_round < self.max_rounds
+                    and current_round < self.num_rounds - 1
                 ):
                     barrier_round_completed += 1
-                    expected = self.clients_per_round[current_barrier_round]
+                    expected = self.num_clients
                     if barrier_round_completed == expected:
                         print(
-                            f"\n  Barrier: round {current_barrier_round} complete "
+                            f"\n  Barrier: round {current_round} complete "
                             f"({expected} clients), releasing {len(next_round_reqs)} "
-                            f"requests for round {current_barrier_round + 1}"
+                            f"requests for round {current_round + 1}"
                         )
                         self._send_heartbeat(input_len=100, output_len=100)
                         time.sleep(10)
                         for req in next_round_reqs:
                             self.ready_queue.append(req)
                         next_round_reqs = []
-                        current_barrier_round += 1
                         barrier_round_completed = 0
🧰 Tools
🪛 Flake8 (7.3.0)

[error] 371-371: undefined name 'current_barrier_round'

(F821)


[error] 373-373: undefined name 'barrier_round_completed'

(F821)


[error] 374-374: undefined name 'current_barrier_round'

(F821)


[error] 377-377: undefined name 'current_barrier_round'

(F821)


[error] 379-379: undefined name 'current_barrier_round'

(F821)


[error] 386-386: undefined name 'current_barrier_round'

(F821)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmark/hicache/bench_multiturn.py` around lines 367 - 387, The barrier
branch uses and compares variables that may be uninitialized
(current_barrier_round, barrier_round_completed, self.max_rounds,
self.clients_per_round) and also loses control because next_round_reqs is
drained earlier; fix by initializing barrier state in the class (e.g., set
self.max_rounds and self.clients_per_round when barrier mode is enabled and
initialize current_barrier_round and barrier_round_completed to 0), ensure this
block only reads/writes those instance vars (not local undefined names), and
change the earlier logic that currently moves items from next_round_reqs into
ready_queue on line ~360 so that when self.enable_round_barrier is True items
are appended to next_round_reqs (not ready_queue) until the barrier condition in
this block (compare barrier_round_completed to
self.clients_per_round[current_barrier_round]) is met; keep the release sequence
here (print, _send_heartbeat, sleep, extend ready_queue from next_round_reqs,
clear next_round_reqs, increment current_barrier_round, reset
barrier_round_completed).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid — barrier mode has several undefined variable references. Fixing: initializing barrier_round_completed, replacing current_barrier_round with the correct local variable, and fixing self.max_rounds/self.clients_per_round references.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KHAEntertainment, glad the analysis was helpful! The key fixes to land are:

  1. Initialize barrier_round_completed = 0 at the top of response_handler (before the while True loop).
  2. Replace current_barrier_round with the already-available local current_round variable (or introduce a proper self.current_barrier_round instance field initialized in __init__).
  3. Replace self.max_roundsself.num_rounds and self.clients_per_round[...]self.num_clients (or initialize self.clients_per_round / self.max_rounds in __init__ when enable_round_barrier is true).
  4. Make sure the earlier next_round_reqs drain (the if len(next_round_reqs) == self.num_clients block) is removed or gated so the barrier block at Lines 370–386 is the sole gate keeper — otherwise the gate is bypassed on every full round.

Once those are in place the barrier should work correctly end-to-end. 🎯


🧠 Learnings used
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/managers/scheduler.py:1446-1479
Timestamp: 2026-03-31T20:56:26.697Z
Learning: In `python/sglang/srt/managers/scheduler.py`, the `handle_restore_snapshot` method's `create_new_request=True` code path is an intentional stub in the KHAEntertainment/sglang-mamba fork. The full lifecycle (rid_to_state registration, http_worker_ipc plumbing, queue-acceptance feedback, and bind/continue RPC) is planned to be addressed in PR 5 (phase-08 snapshot re-port + validation). Do not flag the missing owner registration or fire-and-forget _add_request_to_queue call as bugs requiring a fix in earlier PRs.

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

Comment thread benchmark/hicache/bench_multiturn.py
Comment on lines +267 to +272
mem_pool_device = token_to_kv_pool_allocator.get_kvcache()
from sglang.srt.mem_cache.memory_pool import HybridLinearKVPool

if isinstance(mem_pool_device, HybridLinearKVPool):
mem_pool_device = mem_pool_device.full_kv_pool
self.mem_pool_device = mem_pool_device
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Inspect HybridLinearKVPool transfer-counter ownership =="
fd '^memory_pool\.py$' -x rg -n -C3 'class HybridLinearKVPool|def register_layer_transfer_counter|full_kv_pool|layer_transfer_counter' {}

echo
echo "== Inspect HiCacheController use sites =="
fd '^cache_controller\.py$' -x rg -n -C2 'get_kvcache|full_kv_pool|register_layer_transfer_counter' {}

Repository: KHAEntertainment/sglang-mamba

Length of output: 9755


Unwrapping HybridLinearKVPool before registering the transfer counter bypasses the hybrid wrapper's synchronization logic.

The pool is unwrapped to full_kv_pool at lines 270–271, then register_layer_transfer_counter is called on the unwrapped pool at line 296. This means the HybridLinearKVPool wrapper instance never receives the counter.

The design intent is clear in HybridLinearKVPool.register_layer_transfer_counter (memory_pool.py, lines 1329–1335): the comment states "The layer-wise wait logic is executed at the Hybrid LinearPool level; no additional wait is needed in the full_kv_pool." The wrapper's _wait_for_layer method checks self.layer_transfer_counter before synchronizing, but that field remains None because registration happens only on the inner pool. Result: the wrapper never waits, even though that's where the wait logic is supposed to live.

Move the registration to happen on the original pool before unwrapping, or store the wrapper reference separately for synchronization.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/managers/cache_controller.py` around lines 267 - 272, The
code unwraps HybridLinearKVPool to its full_kv_pool and then registers the
transfer counter on the inner pool, which bypasses the HybridLinearKVPool's
synchronization; instead, ensure register_layer_transfer_counter is called on
the HybridLinearKVPool wrapper (the original mem_pool_device) before you replace
mem_pool_device with mem_pool_device.full_kv_pool, or keep a reference to the
wrapper and call wrapper.register_layer_transfer_counter(...) so that
HybridLinearKVPool._wait_for_layer sees the layer_transfer_counter; update the
logic around mem_pool_device, HybridLinearKVPool, full_kv_pool, and
register_layer_transfer_counter accordingly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressing — this is upstream HybridCacheController.__init__ code. The HybridLinearKVPool unwrapping and register_layer_transfer_counter call ordering is upstream's design from sgl-project/sglang. Modifying it here would create fork divergence.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KHAEntertainment, understood — noted that the HybridLinearKVPool unwrapping and register_layer_transfer_counter call ordering in HiCacheController.__init__ is upstream design from sgl-project/sglang and should not be diverged from in the fork. I'll remember this for future reviews.


✏️ Learnings added
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 13
File: python/sglang/srt/managers/cache_controller.py:267-272
Timestamp: 2026-03-31T22:17:57.025Z
Learning: In `python/sglang/srt/managers/cache_controller.py`, the `HiCacheController.__init__` method's pattern of unwrapping `HybridLinearKVPool` to `full_kv_pool` before assigning `self.mem_pool_device`, and subsequently calling `register_layer_transfer_counter` on the inner pool (not on the `HybridLinearKVPool` wrapper), is intentional upstream design from `sgl-project/sglang`. Do NOT flag the registration ordering or suggest calling `register_layer_transfer_counter` on the wrapper — diverging from upstream is undesirable for the KHAEntertainment/sglang-mamba sync series.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CR
Repo: KHAEntertainment/sglang-mamba PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-29T17:06:58.750Z
Learning: Applies to python/sglang/srt/mem_cache/memory_pool.py : Implement HybridReqToTokenPool, MambaPool, and HybridLinearKVPool classes for unified memory pool management across request and token indices

Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/managers/scheduler.py:1446-1479
Timestamp: 2026-03-31T20:56:26.697Z
Learning: In `python/sglang/srt/managers/scheduler.py`, the `handle_restore_snapshot` method's `create_new_request=True` code path is an intentional stub in the KHAEntertainment/sglang-mamba fork. The full lifecycle (rid_to_state registration, http_worker_ipc plumbing, queue-acceptance feedback, and bind/continue RPC) is planned to be addressed in PR 5 (phase-08 snapshot re-port + validation). Do not flag the missing owner registration or fire-and-forget _add_request_to_queue call as bugs requiring a fix in earlier PRs.

Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/observability/trace.py:160-163
Timestamp: 2026-03-31T18:57:52.401Z
Learning: In `python/sglang/srt/observability/trace.py`, the `process_tracing_init` function contains an unused `global get_cur_time_ns` declaration (around line 162). This is intentional upstream code from `sgl-project/sglang` (observability cherry-pick). Do NOT flag this as a lint issue or suggest removing it — diverging from upstream is undesirable for the KHAEntertainment/sglang-mamba sync series.

Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: test/manual/test_tracing.py:72-73
Timestamp: 2026-03-31T18:57:50.323Z
Learning: In the KHAEntertainment/sglang-mamba fork, the bare `except:` in `test/manual/test_tracing.py` (`__clear_trace_file` method, around line 72) is intentionally inherited from upstream `sgl-project/sglang` test code. Do not flag this as a bug requiring a fix in fork PRs; any fix should be submitted upstream to avoid fork divergence. The same rationale applies to other upstream-inherited test structures in this file.

Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/observability/req_time_stats.py:586-618
Timestamp: 2026-03-31T17:47:54.215Z
Learning: In `python/sglang/srt/observability/req_time_stats.py`, the `SchedulerReqTimeStats.set_retract_time()` method intentionally clears only `last_forward_entry_time` (and related `last_*` fields) but leaves `forward_entry_time` unchanged. This is upstream bookkeeping design from `sgl-project/sglang` (observability cherry-pick commit 3b8930227). Do NOT flag the apparent timer-lap mismatch or suggest resetting `forward_entry_time` in this method — diverging from upstream is undesirable for the KHAEntertainment/sglang-mamba sync series.

Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/pyproject.toml:26-26
Timestamp: 2026-03-31T17:47:07.606Z
Learning: In the KHAEntertainment/sglang-mamba fork, the `decord`/`torchcodec` conditional import gap in `python/sglang/srt/utils/common.py` (functions `load_video` and `encode_video`) and `python/pyproject.toml` is intentionally inherited from upstream `sgl-project/sglang`. Do not flag this as a bug requiring a fix in fork PRs; any fix should be submitted to upstream to avoid fork divergence.

Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/observability/req_time_stats.py:565-578
Timestamp: 2026-03-31T17:47:12.892Z
Learning: In `python/sglang/srt/observability/req_time_stats.py`, the `SchedulerReqTimeStats.__getstate__` intentionally returns `{}` when `enable_metrics` is `False`. This is upstream design from `sgl-project/sglang` (observability cherry-pick commit 3b8930227) and should NOT be flagged or changed in the KHAEntertainment/sglang-mamba fork, as diverging from upstream is undesirable for the sync series.

Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/observability/trace.py:124-125
Timestamp: 2026-03-31T17:47:19.266Z
Learning: In `python/sglang/srt/observability/trace.py`, the `TraceCustomIdGenerator.generate_trace_id()` method intentionally uses `self.local_random.getrandbits(64)` (64-bit) to match upstream `sgl-project/sglang` implementation. Do not flag this as a bug — changing it would cause divergence from upstream. The repo `KHAEntertainment/sglang-mamba` syncs from upstream SGLang and deviates only for Mamba-specific architectural changes, not for general observability/tracing code.

Comment thread python/sglang/srt/models/nemotron_h.py
- Remove accidental .remember/tmp/save-session.pid from tracking
- Add .remember/tmp/ to .gitignore
- Fix undefined variables in bench_multiturn.py barrier mode
- Add missing self.lora_path in WorkloadGenerator.__init__
- Zero padded tail in nemotron_h piecewise graph mode

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/sglang/srt/models/nemotron_h.py`:
- Around line 431-433: The code calls nemotron_mamba2_with_output(hidden_states,
output, self.layer_id) while in piecewise CUDA graphs the decoder absolute
self.layer_id must not be used to index the compact context.attention_layers;
instead compute or look up the dense piecewise-layer index (the offset into
context.attention_layers) and pass that to nemotron_mamba2_with_output (or
maintain a decoder-layer-id → attention-layer-offset map created by the model
runner) so the function receives the correct attention-layer index; update the
calls inside is_in_piecewise_cuda_graph() (and the other occurrence around lines
880-883) to translate self.layer_id into the compact attention index before
invoking nemotron_mamba2_with_output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1dfb21b5-bb60-4d09-b02d-a459c9a2d739

📥 Commits

Reviewing files that changed from the base of the PR and between c813876 and be1c955.

📒 Files selected for processing (3)
  • .gitignore
  • benchmark/hicache/bench_multiturn.py
  • python/sglang/srt/models/nemotron_h.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • .gitignore

Comment on lines +431 to +433
if is_in_piecewise_cuda_graph():
output = torch.empty_like(hidden_states)
nemotron_mamba2_with_output(hidden_states, output, self.layer_id)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't use the absolute decoder layer_id to index context.attention_layers.

python/sglang/srt/model_executor/model_runner.py:2173-2202 builds attention_layers as a compact list of only attention/Mamba-capable layers. In hybrid patterns, earlier MLP/MOE blocks compress that list, so passing self.layer_id here is like using seat numbers on a filtered passenger list: you can hit the wrong layer or blow up with IndexError. Please pass a dense piecewise-layer index instead, or keep a decoder-layer-id → attention-layer-offset map for the split op.

Also applies to: 880-883

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/sglang/srt/models/nemotron_h.py` around lines 431 - 433, The code
calls nemotron_mamba2_with_output(hidden_states, output, self.layer_id) while in
piecewise CUDA graphs the decoder absolute self.layer_id must not be used to
index the compact context.attention_layers; instead compute or look up the dense
piecewise-layer index (the offset into context.attention_layers) and pass that
to nemotron_mamba2_with_output (or maintain a decoder-layer-id →
attention-layer-offset map created by the model runner) so the function receives
the correct attention-layer index; update the calls inside
is_in_piecewise_cuda_graph() (and the other occurrence around lines 880-883) to
translate self.layer_id into the compact attention index before invoking
nemotron_mamba2_with_output.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — investigating. In NemotronH's hybrid architecture, every decoder block is either a Mamba2 layer or a standard attention layer (no MLP-only blocks), so self.layer_id and the attention_layers index may be 1:1. Need to verify against model_runner.py:2173-2202 to confirm whether the compact list construction preserves this mapping for NemotronH specifically.

If it turns out there IS a gap, we'll add a _dense_layer_id computed at model init time. Tracking.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

Good question — let me pull the exact attention_layers construction logic and NemotronH's hybrid pattern to verify the indexing assumption.


🏁 Script executed:

#!/bin/bash
# 1. Check init_piecewise_cuda_graphs in model_runner.py (the changed section + surrounding context)
rg -n "attention_layers|attn_layer|_forward_mamba|moe_layer|moe_fusion" python/sglang/srt/model_executor/model_runner.py -A2 -B2 | head -120

# 2. Check NemotronH config for hybrid_override_pattern or layer type definitions
fd "nemotron_h" python/sglang/srt/configs/ --exec cat {}

# 3. Check what hybrid_override_pattern looks like for actual NemotronH model checkpoints
rg -n "hybrid_override_pattern|ATTENTION|MAMBA|MLP|MOE" python/sglang/srt/configs/nemotron_h.py | head -60

Length of output: 18248


@KHAEntertainment — investigation complete. Your hypothesis doesn't hold. MLP blocks (-) are present in the default pattern, which breaks the 1:1 assumption.

What the data shows

The default hybrid_override_pattern is:

M - M - M - M * - M - M - M - M - M * - M - M - M - M - M * ...

In model_runner.py (lines 2173–2202), attention_layers is built by skipping any layer where attn_layer stays None. For NemotronH:

  • NemotronHMambaDecoderLayer → has _forward_mamba → ✅ appended
  • NemotronHAttentionDecoderLayer → has layer.mixer.attn → ✅ appended
  • NemotronHMLPDecoderLayermixer is NemotronHMLP (no .attn, no _forward_mamba) → ❌ skipped
  • NemotronHMoEDecoderLayer → caught by the moe_block path but attn_layer stays None → ❌ skipped

So with the default 52-layer pattern, the indices diverge immediately:

Absolute layer_id Type attention_layers index
0 M 0
1 - (MLP) skipped
2 M 1 ← self.layer_id=2 would read attention_layers[2] = wrong layer
3 M 2
4 M 3
5 * (Attn) 4

Passing self.layer_id directly into nemotron_mamba2_with_output will read the wrong layer (or cause an IndexError for large layer IDs).

Fix

Compute a dense _attn_layer_id at init time by counting only M/* entries in the pattern up to the current layer:

class NemotronHMambaDecoderLayer(nn.Module):
    def __init__(self, config, layer_idx, ...):
        ...
        self.layer_id = layer_idx
        # Compute dense index into the compact attention_layers list
        self._attn_layer_id = sum(
            1 for i in range(layer_idx)
            if config.hybrid_override_pattern[i] in (MAMBA, ATTENTION)
        )
        ...

    def forward(self, *, hidden_states, residual, forward_batch):
        ...
        if is_in_piecewise_cuda_graph():
            output = torch.empty_like(hidden_states)
            nemotron_mamba2_with_output(hidden_states, output, self._attn_layer_id)
            return output, residual
        ...

And update the split op to use the received index directly (no change needed there — it already does attention_layers[layer_id]).

The _attn_layer_id for the second call site (lines 880–883 in the nemotron_mamba2_with_output function context itself) is already correct since layer_id there is the value passed in — so it's the caller that needs to pass the dense index.


🧠 Learnings used
Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 13
File: python/sglang/srt/managers/cache_controller.py:267-272
Timestamp: 2026-03-31T22:17:57.025Z
Learning: In `python/sglang/srt/managers/cache_controller.py`, the `HiCacheController.__init__` method's pattern of unwrapping `HybridLinearKVPool` to `full_kv_pool` before assigning `self.mem_pool_device`, and subsequently calling `register_layer_transfer_counter` on the inner pool (not on the `HybridLinearKVPool` wrapper), is intentional upstream design from `sgl-project/sglang`. Do NOT flag the registration ordering or suggest calling `register_layer_transfer_counter` on the wrapper — diverging from upstream is undesirable for the KHAEntertainment/sglang-mamba sync series.

Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/pyproject.toml:26-26
Timestamp: 2026-03-31T17:47:07.606Z
Learning: In the KHAEntertainment/sglang-mamba fork, the `decord`/`torchcodec` conditional import gap in `python/sglang/srt/utils/common.py` (functions `load_video` and `encode_video`) and `python/pyproject.toml` is intentionally inherited from upstream `sgl-project/sglang`. Do not flag this as a bug requiring a fix in fork PRs; any fix should be submitted to upstream to avoid fork divergence.

Learnt from: KHAEntertainment
Repo: KHAEntertainment/sglang-mamba PR: 12
File: python/sglang/srt/managers/scheduler.py:1446-1479
Timestamp: 2026-03-31T20:56:26.697Z
Learning: In `python/sglang/srt/managers/scheduler.py`, the `handle_restore_snapshot` method's `create_new_request=True` code path is an intentional stub in the KHAEntertainment/sglang-mamba fork. The full lifecycle (rid_to_state registration, http_worker_ipc plumbing, queue-acceptance feedback, and bind/continue RPC) is planned to be addressed in PR 5 (phase-08 snapshot re-port + validation). Do not flag the missing owner registration or fire-and-forget _add_request_to_queue call as bugs requiring a fix in earlier PRs.

@KHAEntertainment KHAEntertainment merged commit a2b5ab8 into main Mar 31, 2026
57 of 66 checks passed
@KHAEntertainment KHAEntertainment deleted the upstream-sync/pr2-session-cache branch April 1, 2026 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion documentation Improvements or additions to documentation hicache piecewise-cuda-graph

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants