[HiCache & HybridModel] mooncake backend support DSA & mamba model#21259
[HiCache & HybridModel] mooncake backend support DSA & mamba model#21259xiezhq-hermann merged 50 commits intosgl-project:mainfrom
Conversation
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
…tor6 # Conflicts: # python/sglang/srt/mem_cache/hi_mamba_radix_cache.py
This reverts commit abfc3d3.
…tor6 # Conflicts: # python/sglang/srt/managers/schedule_policy.py # python/sglang/srt/mem_cache/hi_mamba_radix_cache.py
Co-authored-by: hzh0425 <hzh0425@apache.org> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
Implements the HiCacheStorage v2 interface for the 3FS backend so that hybrid models (Mamba/linear-attention, and in the future DSA) can offload both KV pages and auxiliary per-pool state to 3FS via HybridCacheController. - Introduce _Hf3fsPoolEngine: a per-pool bundle of (file, client list, executor, metadata client, rank namespace, is_zero_copy, skip_backup) so each registered host pool has its own 3FS file and metadata scope. - Construct the KV engine in __init__ so v1 callers keep working unchanged. - Implement register_mem_host_pool_v2 to lazily allocate auxiliary (MAMBA/...) engines with their own preallocated files, clients and metadata namespaces. Idempotent and order-agnostic. - Implement batch_exists_v2 / batch_get_v2 / batch_set_v2 mirroring the HiCacheFile semantics, including ALL_PAGES and TRAILING_PAGES hit policies, min-across-pools final hit, and per-pool result dicts. - Refactor _batch_get / _batch_set to take an engine argument so both v1 and v2 entry points share the same IO core. - Key namespacing: auxiliary pools prefix the metadata key with the pool name, KV keeps the bare key for backwards compatibility. MHA zero-copy -k/-v suffixing remains strictly KV-scoped. - Per-pool skip_backup so MLA rank>0 still skips KV but backs up MAMBA on every rank. Fix a pre-existing bug where skip_backup returned a scalar True instead of a per-key list. - close() now iterates all engines; _engines is populated before the SIGTERM handler is installed. Test plan: - New test/registered/hicache/test_hicache_storage_3fs_hybrid.py uses the mock HF3FS client to cover: construction sanity, KV-only v2 fallback, ALL_PAGES and TRAILING_PAGES exists semantics, v2 set/get round-trip, MHA zero-copy + mamba interplay, MLA skip_backup KV-only scoping, partial-pool failure, and a no-pool error contract. - Extended test_hicache_storage_3fs_backend.py with TestHf3fsBackendHybrid end-to-end test for a hybrid model, gated on model availability. Scope: PoolName.KV + PoolName.MAMBA. DSA is deferred until a caller exists (see PLAN.md §3 and Appendix B). Tracking issue: sgl-project#22572 Reference PRs: sgl-project#21259, sgl-project#20457 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/rerun-failed-ci |
…g into hicache/hicache-refactor4
|
Thanks for this work. We’ve run accuracy tests based on this PR, and everything looks good. |
Thank you so much!! @ykwd |
|
/rerun-failed-ci |
| if getattr(entry, "share_indices_with_anchor", False): | ||
| entry.host_pool.free(indices) |
There was a problem hiding this comment.
Shouldn't this be done by the anchor pool already?
| def _page_backup(self, operation): | ||
| # Backup extra pools | ||
| if operation.pool_transfers: | ||
| self._resolve_shared_pool_transfers(operation) | ||
| results = self.storage_backend.batch_set_v2(operation.pool_transfers) | ||
| operation.pool_storage_result.update_extra_pool_hit_pages(results) | ||
|
|
||
| # Backup kv pools | ||
| super()._page_backup(operation) |
There was a problem hiding this comment.
This looks like atomic. But HiCacheController._page_backup does not seem to be atomic, will it cause a mismatch bug in the future?
ShangmingCai
left a comment
There was a problem hiding this comment.
No other comments, looks good.
|
/rerun-failed-ci |
…gl-project#21259) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: hzh0425 <hzh0425@apache.org> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
|
I encountered a sporadic Launch command: python -m sglang.launch_server \
--model-path GLM-5.1-FP8 \
--host 0.0.0.0 --port 8000 \
--enable-metrics --tp 8 \
--reasoning-parser glm45 --tool-call-parser glm47 \
--page-size 64 --mem-fraction-static 0.85 \
--enable-hierarchical-cache --hicache-size 60 \
--hicache-mem-layout page_first_direct \
--hicache-io-backend direct \
--hicache-write-policy write_through \
--admin-api-key secret_for_hicache \
--skip-server-warmupThen I attached L3 storage via the admin API. The scheduler crashed sporadically with: This is intermittent — retrying the same operation usually succeeds. It seems like a race condition where TP ranks reach |
Thank you! we have also noticed this issue. Do you happen to have a reliable way to reproduce it? You can reach me on Slack: Tingwei Huang. |
i cannot reliably reproduce this |
…gl-project#21259) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: hzh0425 <hzh0425@apache.org> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com>
| if entry is self.anchor_entry: | ||
| continue | ||
| if getattr(entry, "share_indices_with_anchor", False): | ||
| entry.host_pool.free(indices) |
Motivation
Added support for the Mooncake backend. Supports both Mamba and DSA models.
Roadmap:#21846
Modifications
Accuracy Tests
mamba
DSA (DeepSeek-V3.2-Exp)
mamba
gsm8k (first round)
second round (with flush_cache)
mmlu
DSA
gsm8k
mmlu
first round
flush_cache
second round
Performance Tests
bench serving
first round (no cache)
flush_cache
second round
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci