server: enable prompt cache for FULL-removal memory via context checkpoints#22932
server: enable prompt cache for FULL-removal memory via context checkpoints#22932leon7609 wants to merge 1 commit into
Conversation
…points Hybrid memory backends (compressed-KV / SWA-only / recurrent) advertise COMMON_CONTEXT_SEQ_RM_TYPE_FULL because they cannot reconstruct intermediate state at an arbitrary earlier token. In update_slots() the prefix-cache restore path then triggers GGML_ABORT on pos_min == -1, so prompt cache is silently unusable for an entire class of modern LLMs (DeepSeek V4 Flash, recurrent-state models, fully-SWA models, etc.). Add a FULL-removal branch that uses the existing context-checkpoint infrastructure (already populated at do_checkpoint time for these models) but matches by n_tokens instead of pos_min, restores via llama_state_seq_set_data_ext with LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY, and replays the suffix. Checkpoint invalidation is similarly switched to n_tokens for FULL memory. Models with partial seq removal support are unchanged. Reference: discussion in ggml-org#13194 about why hybrid/SWA/recurrent prompts re-process from scratch. Performance (DeepSeek V4 Flash, FULL-only hybrid-iswa memory, single RTX PRO 6000 Blackwell 96GB): - 44.8K-token prompt with shared prefix: 96.4s cold -> 4.2s warm (22.94x) - 12-turn agent loop: 2.45x cumulative wall, 8.33x p90 TTFT - probes/regression: 10/10 functional probes pass; multi-slot decode unchanged (N=2 ratio 1.25, same as pre-cache baseline). Co-authored-by: Codex <codex@local>
|
Hi @leon7609, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Re the ggml-gh-bot automated check above:
Thanks for the patience on the policy — should have caught both issues before opening. |
|
@leon7609 create an issue with a reproduction rather than creating a PR directly |
AI assistance disclosure
This PR was authored with substantial coding-agent assistance. Disclosing per the contribution guidelines:
parallel_load.pyandprobes.pyagainst my own DSv4 setup over several days, profiled the gap between cold-prefill and prefix-reuse cases, and confirmed the cache-disable behavior was the bottleneck for my agent workloads. The diagnosis is mine.ggml-org/masterin this PR (theif (slot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL) { ... }branch, theelsewrap of the legacypos_minpath, and the checkpoint-invalidation switch) was written by Claude (Anthropic) under my direction.I'm publishing as a human-supervised contribution; I stand behind the technical claims and the diff. If reviewers want a smaller / different shape (e.g. fewer lines under the new branch, no
elsewrap, alternative invalidation logic), I'll happily revise.Problem
tools/server/server-context.cpp'supdate_slots()aborts onpos_min == -1. This blocks prompt cache for any model whose memory backend reportsCOMMON_CONTEXT_SEQ_RM_TYPE_FULL:Hybrid memory models — compressed-KV, recurrent-state, SWA-only — can't return a meaningful
pos_minfor an arbitrary earlier token, so the existing prefix-cache restore path can't run. The result is that every turn of every agent / TD / RAG loop re-prefills the entire prompt from scratch. On a 32K-token system+history prompt that's 30-60s TTFT per turn, even when 95% of the prompt is identical to the previous turn.Approach
Use the existing context-checkpoint infrastructure (already populated for FULL memory at
do_checkpointtime — see line ~2618) and add a FULL-removal branch inupdate_slots()that matches byn_tokensinstead ofpos_min. Restore viallama_state_seq_set_data_extwithLLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY, then replay the suffix.Models with partial seq removal support hit the legacy
elsebranch and behave exactly as before.Diff
Single file,
tools/server/server-context.cpp, +51 / -2:if (n_past > 0 && n_past < slot.prompt.n_tokens()), before thepos_min == -1abort: a newif (slot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL)branch that does the n_tokens-based find + restore.pos_min-based code is wrapped in anelseso non-FULL memory keeps the current behavior.cur.n_tokens > n_pastfor FULL memory andcur.pos_max > pos_nextotherwise.No new public API. No new flags. Reuses existing
--cache-rambudget and existingllama_state_seq_*calls.Performance
Tested on a DeepSeek V4 Flash IQ2XXS GGUF (284B-A13B MoE, hybrid-iswa memory, 1M native ctx) on a single RTX PRO 6000 Blackwell 96GB.
Single-request prefix replay
12-turn agent loop (4K system prompt + growing history + 200-500 tokens/turn output)
Cache hit ratio ramps 0.83 (turn 2) → 0.93 (turn 12). TTFT stays nearly flat ~3s as context grows 5K → 17K tokens.
Regression
10/10 functional probes pass on the patched build (smoke / latency / tps / json_obj / json_schema / tools / cn_prose / thinking / long_12k / long_32k). N=2 parallel decode ratio is 1.25, unchanged from the no-cache baseline.
Memory footprint
Checkpoint sizes scale linearly with token count and live in CPU RAM (existing
--cache-ram <MiB>budget):--cache-ram 8192(8 GiB default in my testing) holds roughly 24× 32K-token snapshots.Compatibility
elsebranch).Relates to
Notes for reviewers
The key correctness lesson during development was that checkpoint invalidation must key on
n_tokens, notpos_maxfor FULL memory — hybrid recurrentpos_min/pos_maxare tail-state oriented and don't correlate with restorability. An earlier draft that re-used the existingpos_max > pos_nextinvalidation broke in subtle ways for long-running agent loops.I'm happy to split this into smaller PRs (e.g. the FULL-restore branch separate from the invalidation tweak) if reviewers prefer.
Hardware: NVIDIA RTX PRO 6000 Blackwell Workstation (96 GB), CUDA 13.0, driver 595.71.05. Tested model:
antirez/deepseek-v4-ggufIQ2XXS (86 GB).