Skip to content

[HiCache][HybridModel]: Support mamba state offloading & HybridCacheController#20457

Merged
xiezhq-hermann merged 33 commits intosgl-project:mainfrom
hzh0425:hicache/hicache-refactor4
Mar 24, 2026
Merged

[HiCache][HybridModel]: Support mamba state offloading & HybridCacheController#20457
xiezhq-hermann merged 33 commits intosgl-project:mainfrom
hzh0425:hicache/hicache-refactor4

Conversation

@hzh0425
Copy link
Copy Markdown
Collaborator

@hzh0425 hzh0425 commented Mar 12, 2026

TODO:

  • Support PageFirst Layout for MambaPoolHost
  • Check the mamba offloading logic for Hi-MambaRadixTree
  • Fix MambaMemory Leak issue
  • Copy Mamba State from cpu to req state
  • Benchmark
  • Future plan:
  • Support new V2 storage interface for mooncake & 3fs backend
  • Support pp

Startup

nohup python3 -m sglang.launch_server --model-path xxx/Qwen3.5-9B --mamba-scheduler-strategy extra_buffer --page-size 64 --port 30000 --tp 4   --enable-hierarchical-cache --hicache-ratio 2 --hicache-size 0  --hicache-write-policy write_through --hicache-storage-backend file --hicache-storage-prefetch-policy wait_complete --hicache-io-backend direct --hicache-mem-layout page_first_direct > sglang.out &

Accuracy Tests

AIME 2025, repeat 16 test:
first round:

thread to complete for 10 seconds...                        
nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 6214       | 502         | 64.79% ± 6.20%   | 2.71%    
nemo-run_1/0 majority@16       | 30          | 6214       | 502         | 76.67%           | 0.00%    
nemo-run_1/0 pass@16           | 30          | 6214       | 502         | 83.33%           | 0.00%    
nemo-run_1/0 
nemo-run_1/0 

second round (with flush_cache)

nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 6390       | 512         | 64.38% ± 5.40%   | 3.33%    
nemo-run_1/0 majority@16       | 30          | 6390       | 512         | 73.33%           | 0.00%    
nemo-run_1/0 pass@16           | 30          | 6390       | 512         | 86.67%           | 0.00%    

Benchmarking and Profiling

--max-mamba-cache-size 50 --hicache-storage-backend file --hicache-ratio 2

With L3(mamba offloading):
  Cache Hit Rate: 0.897569
Per-round metrics:
  Round 0: Average TTFT = 0.20s, Cache Hit Rate = 0.000000 (10 requests, 10 clients)
  Round 1: Average TTFT = 0.19s, Cache Hit Rate = 0.583333 (10 requests, 10 clients)
  Round 2: Average TTFT = 0.19s, Cache Hit Rate = 0.750000 (10 requests, 10 clients)
  Round 3: Average TTFT = 0.19s, Cache Hit Rate = 0.821429 (10 requests, 10 clients)
  Round 4: Average TTFT = 0.21s, Cache Hit Rate = 0.861111 (10 requests, 10 clients)
  Round 5: Average TTFT = 0.24s, Cache Hit Rate = 0.886364 (10 requests, 10 clients)
  Round 6: Average TTFT = 0.26s, Cache Hit Rate = 0.903846 (10 requests, 10 clients)
  Round 7: Average TTFT = 0.28s, Cache Hit Rate = 0.916667 (10 requests, 10 clients)
  Round 8: Average TTFT = 0.32s, Cache Hit Rate = 0.926471 (10 requests, 10 clients)
  Round 9: Average TTFT = 0.36s, Cache Hit Rate = 0.934211 (10 requests, 10 clients)
  Round 10: Average TTFT = 0.39s, Cache Hit Rate = 0.940476 (10 requests, 10 clients)
  Round 11: Average TTFT = 0.42s, Cache Hit Rate = 0.945652 (10 requests, 10 clients)


main branch l3(without mamba offloading):
  Cache Hit Rate: 0.547049
Per-round metrics:
  Round 0: Average TTFT = 0.16s, Cache Hit Rate = 0.000000 (10 requests, 10 clients)
  Round 1: Average TTFT = 0.15s, Cache Hit Rate = 0.583333 (10 requests, 10 clients)
  Round 2: Average TTFT = 0.18s, Cache Hit Rate = 0.600000 (10 requests, 10 clients)
  Round 3: Average TTFT = 0.24s, Cache Hit Rate = 0.575000 (10 requests, 10 clients)
  Round 4: Average TTFT = 0.31s, Cache Hit Rate = 0.602778 (10 requests, 10 clients)
  Round 5: Average TTFT = 0.30s, Cache Hit Rate = 0.443182 (10 requests, 10 clients)
  Round 6: Average TTFT = 0.24s, Cache Hit Rate = 0.723077 (10 requests, 10 clients)
  Round 7: Average TTFT = 0.41s, Cache Hit Rate = 0.550000 (10 requests, 10 clients)
  Round 8: Average TTFT = 0.39s, Cache Hit Rate = 0.648529 (10 requests, 10 clients)
  Round 9: Average TTFT = 0.44s, Cache Hit Rate = 0.467105 (10 requests, 10 clients)
  Round 10: Average TTFT = 0.40s, Cache Hit Rate = 0.564286 (10 requests, 10 clients)
  Round 11: Average TTFT = 0.52s, Cache Hit Rate = 0.447826 (10 requests, 10 clients)

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the hicache Hierarchical Caching for SGLang label Mar 12, 2026
@hzh0425 hzh0425 force-pushed the hicache/hicache-refactor4 branch from dde5a4c to 3556ed6 Compare March 14, 2026 15:04
Comment thread python/sglang/srt/managers/schedule_policy.py Outdated
Comment thread test/registered/4-gpu-models/test_qwen3_next_models.py Outdated
Comment thread python/sglang/srt/mem_cache/memory_pool.py Outdated
Comment thread python/sglang/srt/mem_cache/memory_pool.py Outdated
Comment thread python/sglang/srt/mem_cache/hicache_storage.py
@hzh0425
Copy link
Copy Markdown
Collaborator Author

hzh0425 commented Mar 16, 2026

/tag-and-rerun-ci

@hzh0425 hzh0425 force-pushed the hicache/hicache-refactor4 branch from c10a89a to ec983d3 Compare March 20, 2026 08:50
@hzh0425
Copy link
Copy Markdown
Collaborator Author

hzh0425 commented Mar 23, 2026

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

Comment thread python/sglang/srt/disaggregation/decode.py
@hzh0425
Copy link
Copy Markdown
Collaborator Author

hzh0425 commented Mar 24, 2026

image image StageC Mamba + PD CI has passed.

@xiezhq-hermann xiezhq-hermann merged commit 0986bed into sgl-project:main Mar 24, 2026
341 of 417 checks passed
adityavaid pushed a commit to adityavaid/sglang that referenced this pull request Mar 24, 2026
…ontroller (sgl-project#20457)

Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
adityavaid pushed a commit to adityavaid/sglang that referenced this pull request Mar 24, 2026
…ontroller (sgl-project#20457)

Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
…ontroller (sgl-project#20457)

Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
johnnycxm pushed a commit to johnnycxm/sglang that referenced this pull request Mar 25, 2026
…ontroller (sgl-project#20457)

Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
johnnycxm pushed a commit to johnnycxm/sglang that referenced this pull request Mar 25, 2026
…ontroller (sgl-project#20457)

Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
…ontroller (sgl-project#20457)

Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
parasol-aser pushed a commit to parasol-aser/sglang that referenced this pull request Apr 11, 2026
Implements the HiCacheStorage v2 interface for the 3FS backend so that
hybrid models (Mamba/linear-attention, and in the future DSA) can offload
both KV pages and auxiliary per-pool state to 3FS via HybridCacheController.

- Introduce _Hf3fsPoolEngine: a per-pool bundle of (file, client list,
  executor, metadata client, rank namespace, is_zero_copy, skip_backup)
  so each registered host pool has its own 3FS file and metadata scope.
- Construct the KV engine in __init__ so v1 callers keep working unchanged.
- Implement register_mem_host_pool_v2 to lazily allocate auxiliary
  (MAMBA/...) engines with their own preallocated files, clients and
  metadata namespaces. Idempotent and order-agnostic.
- Implement batch_exists_v2 / batch_get_v2 / batch_set_v2 mirroring the
  HiCacheFile semantics, including ALL_PAGES and TRAILING_PAGES hit
  policies, min-across-pools final hit, and per-pool result dicts.
- Refactor _batch_get / _batch_set to take an engine argument so both
  v1 and v2 entry points share the same IO core.
- Key namespacing: auxiliary pools prefix the metadata key with the
  pool name, KV keeps the bare key for backwards compatibility. MHA
  zero-copy -k/-v suffixing remains strictly KV-scoped.
- Per-pool skip_backup so MLA rank>0 still skips KV but backs up MAMBA
  on every rank. Fix a pre-existing bug where skip_backup returned a
  scalar True instead of a per-key list.
- close() now iterates all engines; _engines is populated before the
  SIGTERM handler is installed.

Test plan:
- New test/registered/hicache/test_hicache_storage_3fs_hybrid.py uses the
  mock HF3FS client to cover: construction sanity, KV-only v2 fallback,
  ALL_PAGES and TRAILING_PAGES exists semantics, v2 set/get round-trip,
  MHA zero-copy + mamba interplay, MLA skip_backup KV-only scoping,
  partial-pool failure, and a no-pool error contract.
- Extended test_hicache_storage_3fs_backend.py with TestHf3fsBackendHybrid
  end-to-end test for a hybrid model, gated on model availability.

Scope: PoolName.KV + PoolName.MAMBA. DSA is deferred until a caller
exists (see PLAN.md §3 and Appendix B).

Tracking issue: sgl-project#22572
Reference PRs: sgl-project#21259, sgl-project#20457

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
…ontroller (sgl-project#20457)

Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com>
Co-authored-by: ispobock <ispobaoke@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hicache Hierarchical Caching for SGLang high priority run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants