[HiCache][HybridModel]: Support mamba state offloading & HybridCacheController by hzh0425 · Pull Request #20457 · sgl-project/sglang

hzh0425 · 2026-03-12T13:35:06Z

TODO:

Support PageFirst Layout for MambaPoolHost
Check the mamba offloading logic for Hi-MambaRadixTree
Fix MambaMemory Leak issue
Copy Mamba State from cpu to req state
Benchmark
Future plan:
Support new V2 storage interface for mooncake & 3fs backend
Support pp

Startup

nohup python3 -m sglang.launch_server --model-path xxx/Qwen3.5-9B --mamba-scheduler-strategy extra_buffer --page-size 64 --port 30000 --tp 4   --enable-hierarchical-cache --hicache-ratio 2 --hicache-size 0  --hicache-write-policy write_through --hicache-storage-backend file --hicache-storage-prefetch-policy wait_complete --hicache-io-backend direct --hicache-mem-layout page_first_direct > sglang.out &

Accuracy Tests

AIME 2025, repeat 16 test:
first round:

thread to complete for 10 seconds...                        
nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 6214       | 502         | 64.79% ± 6.20%   | 2.71%    
nemo-run_1/0 majority@16       | 30          | 6214       | 502         | 76.67%           | 0.00%    
nemo-run_1/0 pass@16           | 30          | 6214       | 502         | 83.33%           | 0.00%    
nemo-run_1/0 
nemo-run_1/0

second round (with flush_cache)

nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 6390       | 512         | 64.38% ± 5.40%   | 3.33%    
nemo-run_1/0 majority@16       | 30          | 6390       | 512         | 73.33%           | 0.00%    
nemo-run_1/0 pass@16           | 30          | 6390       | 512         | 86.67%           | 0.00%

Benchmarking and Profiling

--max-mamba-cache-size 50 --hicache-storage-backend file --hicache-ratio 2

With L3(mamba offloading):
  Cache Hit Rate: 0.897569
Per-round metrics:
  Round 0: Average TTFT = 0.20s, Cache Hit Rate = 0.000000 (10 requests, 10 clients)
  Round 1: Average TTFT = 0.19s, Cache Hit Rate = 0.583333 (10 requests, 10 clients)
  Round 2: Average TTFT = 0.19s, Cache Hit Rate = 0.750000 (10 requests, 10 clients)
  Round 3: Average TTFT = 0.19s, Cache Hit Rate = 0.821429 (10 requests, 10 clients)
  Round 4: Average TTFT = 0.21s, Cache Hit Rate = 0.861111 (10 requests, 10 clients)
  Round 5: Average TTFT = 0.24s, Cache Hit Rate = 0.886364 (10 requests, 10 clients)
  Round 6: Average TTFT = 0.26s, Cache Hit Rate = 0.903846 (10 requests, 10 clients)
  Round 7: Average TTFT = 0.28s, Cache Hit Rate = 0.916667 (10 requests, 10 clients)
  Round 8: Average TTFT = 0.32s, Cache Hit Rate = 0.926471 (10 requests, 10 clients)
  Round 9: Average TTFT = 0.36s, Cache Hit Rate = 0.934211 (10 requests, 10 clients)
  Round 10: Average TTFT = 0.39s, Cache Hit Rate = 0.940476 (10 requests, 10 clients)
  Round 11: Average TTFT = 0.42s, Cache Hit Rate = 0.945652 (10 requests, 10 clients)


main branch l3(without mamba offloading):
  Cache Hit Rate: 0.547049
Per-round metrics:
  Round 0: Average TTFT = 0.16s, Cache Hit Rate = 0.000000 (10 requests, 10 clients)
  Round 1: Average TTFT = 0.15s, Cache Hit Rate = 0.583333 (10 requests, 10 clients)
  Round 2: Average TTFT = 0.18s, Cache Hit Rate = 0.600000 (10 requests, 10 clients)
  Round 3: Average TTFT = 0.24s, Cache Hit Rate = 0.575000 (10 requests, 10 clients)
  Round 4: Average TTFT = 0.31s, Cache Hit Rate = 0.602778 (10 requests, 10 clients)
  Round 5: Average TTFT = 0.30s, Cache Hit Rate = 0.443182 (10 requests, 10 clients)
  Round 6: Average TTFT = 0.24s, Cache Hit Rate = 0.723077 (10 requests, 10 clients)
  Round 7: Average TTFT = 0.41s, Cache Hit Rate = 0.550000 (10 requests, 10 clients)
  Round 8: Average TTFT = 0.39s, Cache Hit Rate = 0.648529 (10 requests, 10 clients)
  Round 9: Average TTFT = 0.44s, Cache Hit Rate = 0.467105 (10 requests, 10 clients)
  Round 10: Average TTFT = 0.40s, Cache Hit Rate = 0.564286 (10 requests, 10 clients)
  Round 11: Average TTFT = 0.52s, Cache Hit Rate = 0.447826 (10 requests, 10 clients)

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>

gemini-code-assist · 2026-03-12T13:35:12Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…ba space

hzh0425 · 2026-03-16T03:04:51Z

/tag-and-rerun-ci

hzh0425 · 2026-03-23T03:05:06Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-03-23T03:05:31Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies).

github-actions · 2026-03-23T03:05:37Z

🔗 View workflow run

hzh0425 · 2026-03-24T03:02:04Z

StageC Mamba + PD CI has passed.

…ontroller (sgl-project#20457) Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: ispobock <ispobaoke@gmail.com>

Implements the HiCacheStorage v2 interface for the 3FS backend so that hybrid models (Mamba/linear-attention, and in the future DSA) can offload both KV pages and auxiliary per-pool state to 3FS via HybridCacheController. - Introduce _Hf3fsPoolEngine: a per-pool bundle of (file, client list, executor, metadata client, rank namespace, is_zero_copy, skip_backup) so each registered host pool has its own 3FS file and metadata scope. - Construct the KV engine in __init__ so v1 callers keep working unchanged. - Implement register_mem_host_pool_v2 to lazily allocate auxiliary (MAMBA/...) engines with their own preallocated files, clients and metadata namespaces. Idempotent and order-agnostic. - Implement batch_exists_v2 / batch_get_v2 / batch_set_v2 mirroring the HiCacheFile semantics, including ALL_PAGES and TRAILING_PAGES hit policies, min-across-pools final hit, and per-pool result dicts. - Refactor _batch_get / _batch_set to take an engine argument so both v1 and v2 entry points share the same IO core. - Key namespacing: auxiliary pools prefix the metadata key with the pool name, KV keeps the bare key for backwards compatibility. MHA zero-copy -k/-v suffixing remains strictly KV-scoped. - Per-pool skip_backup so MLA rank>0 still skips KV but backs up MAMBA on every rank. Fix a pre-existing bug where skip_backup returned a scalar True instead of a per-key list. - close() now iterates all engines; _engines is populated before the SIGTERM handler is installed. Test plan: - New test/registered/hicache/test_hicache_storage_3fs_hybrid.py uses the mock HF3FS client to cover: construction sanity, KV-only v2 fallback, ALL_PAGES and TRAILING_PAGES exists semantics, v2 set/get round-trip, MHA zero-copy + mamba interplay, MLA skip_backup KV-only scoping, partial-pool failure, and a no-pool error contract. - Extended test_hicache_storage_3fs_backend.py with TestHf3fsBackendHybrid end-to-end test for a hybrid model, gated on model availability. Scope: PoolName.KV + PoolName.MAMBA. DSA is deferred until a caller exists (see PLAN.md §3 and Appendix B). Tracking issue: sgl-project#22572 Reference PRs: sgl-project#21259, sgl-project#20457 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ontroller (sgl-project#20457) Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: ispobock <ispobaoke@gmail.com>

Init support for himamba tree offloading

0d2651c

Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>

github-actions Bot added the hicache Hierarchical Caching for SGLang label Mar 12, 2026

hzh0425 assigned ispobock and pansicheng Mar 12, 2026

hzh0425 added 2 commits March 13, 2026 14:13

Fix mamba prefix match logic

638ff30

Enhance evict logic & Loading host mamba into both req and node's mam…

3556ed6

…ba space

hzh0425 force-pushed the hicache/hicache-refactor4 branch from dde5a4c to 3556ed6 Compare March 14, 2026 15:04

hzh0425 added 5 commits March 15, 2026 07:36

Fix hang issue

282737a

Refactor prefetch logic; fix memory leak issue

f2a0629

Support 'page_first_direct' layout for mambaPoolHost

7316d0e

Optimize code style

6874520

Optimize code style

a03ae57

hzh0425 marked this pull request as ready for review March 15, 2026 15:39

hzh0425 requested review from Fridge003, Ying1123, hanming-lu, hnyls2002, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners March 15, 2026 15:39

hzh0425 commented Mar 15, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/schedule_policy.py Outdated

ispobock assigned xiezhq-hermann Mar 16, 2026

ispobock reviewed Mar 16, 2026

View reviewed changes

Comment thread test/registered/4-gpu-models/test_qwen3_next_models.py Outdated

Comment thread python/sglang/srt/mem_cache/memory_pool.py Outdated

Comment thread python/sglang/srt/mem_cache/memory_pool.py Outdated

Optimize code style

2836195

hzh0425 commented Mar 16, 2026

View reviewed changes

Comment thread python/sglang/srt/mem_cache/hicache_storage.py

Remove pp filter

91f954f

github-actions Bot added the run-ci label Mar 16, 2026

hzh0425 force-pushed the hicache/hicache-refactor4 branch from c10a89a to ec983d3 Compare March 20, 2026 08:50

Merge branch 'main' into hicache/hicache-refactor4

df9ca77

xiezhq-hermann approved these changes Mar 21, 2026

View reviewed changes

xiezhq-hermann and others added 3 commits March 21, 2026 00:03

Merge branch 'main' into hicache/hicache-refactor4

1026893

Merge branch 'main' into hicache/hicache-refactor4

c884ca1

Fix decode missing

9b7c5cb

hzh0425 requested review from ByronHsu and ShangmingCai as code owners March 23, 2026 02:40

Merge branch 'main' into hicache/hicache-refactor4

68461e9

Merge branch 'main' into hicache/hicache-refactor4

98e2e27

ShangmingCai reviewed Mar 23, 2026

View reviewed changes

Comment thread python/sglang/srt/disaggregation/decode.py

ShangmingCai approved these changes Mar 24, 2026

View reviewed changes

xiezhq-hermann merged commit 0986bed into sgl-project:main Mar 24, 2026
341 of 417 checks passed

hzh0425 mentioned this pull request Mar 25, 2026

[HiCache & HybridModel] mooncake backend support DSA & mamba model #21259

Merged

5 tasks

hzh0425 mentioned this pull request Apr 1, 2026

[Roadmap]: SGLang Distributed KVCache System For Agentic Workload #21846

Open

25 tasks

hzh0425 mentioned this pull request Apr 11, 2026

[Feature] 3FS Storage backend Support HybridModel(Linear、DSA) #22572

Closed

2 tasks

parasol-aser mentioned this pull request Apr 11, 2026

[mem] Add v2 hybrid-pool (KV + MAMBA) support to HiCacheHF3FS #22601

Closed

13 tasks

lawrence-harmonic mentioned this pull request May 7, 2026

feat: option to disable Mamba cache offload #24561

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HiCache][HybridModel]: Support mamba state offloading & HybridCacheController#20457

[HiCache][HybridModel]: Support mamba state offloading & HybridCacheController#20457
xiezhq-hermann merged 33 commits intosgl-project:mainfrom
hzh0425:hicache/hicache-refactor4

hzh0425 commented Mar 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hzh0425 commented Mar 16, 2026

Uh oh!

hzh0425 commented Mar 23, 2026

Uh oh!

github-actions Bot commented Mar 23, 2026

Uh oh!

github-actions Bot commented Mar 23, 2026

Uh oh!

Uh oh!

hzh0425 commented Mar 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

hzh0425 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Startup

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hzh0425 commented Mar 16, 2026

Uh oh!

hzh0425 commented Mar 23, 2026

Uh oh!

github-actions Bot commented Mar 23, 2026

Uh oh!

github-actions Bot commented Mar 23, 2026

Uh oh!

Uh oh!

hzh0425 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hzh0425 commented Mar 12, 2026 •

edited

Loading

hzh0425 commented Mar 24, 2026 •

edited

Loading