server: enable prompt cache for FULL-removal memory via context checkpoints by leon7609 · Pull Request #22932 · ggml-org/llama.cpp

leon7609 · 2026-05-11T02:17:57Z

AI assistance disclosure

This PR was authored with substantial coding-agent assistance. Disclosing per the contribution guidelines:

Original bug investigation: I (the author) ran parallel_load.py and probes.py against my own DSv4 setup over several days, profiled the gap between cold-prefill and prefix-reuse cases, and confirmed the cache-disable behavior was the bottleneck for my agent workloads. The diagnosis is mine.
Patch implementation: the initial patch on a downstream DSv4 fork was generated by Codex under my supervision; the port to current ggml-org/master in this PR (the if (slot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL) { ... } branch, the else wrap of the legacy pos_min path, and the checkpoint-invalidation switch) was written by Claude (Anthropic) under my direction.
Description: drafted by Claude based on the technical work, edited by me.
Benchmarks and verification: all reproducible commands below were executed on my hardware and the results are my measurements.

I'm publishing as a human-supervised contribution; I stand behind the technical claims and the diff. If reviewers want a smaller / different shape (e.g. fewer lines under the new branch, no else wrap, alternative invalidation logic), I'll happily revise.

Problem

tools/server/server-context.cpp's update_slots() aborts on pos_min == -1. This blocks prompt cache for any model whose memory backend reports COMMON_CONTEXT_SEQ_RM_TYPE_FULL:

common_context_can_seq_rm: memory only supports full sequence removal
slot update_slots: forcing full prompt re-processing due to lack of cache data
  (likely due to SWA or hybrid/recurrent memory, see #13194)

Hybrid memory models — compressed-KV, recurrent-state, SWA-only — can't return a meaningful pos_min for an arbitrary earlier token, so the existing prefix-cache restore path can't run. The result is that every turn of every agent / TD / RAG loop re-prefills the entire prompt from scratch. On a 32K-token system+history prompt that's 30-60s TTFT per turn, even when 95% of the prompt is identical to the previous turn.

Approach

Use the existing context-checkpoint infrastructure (already populated for FULL memory at do_checkpoint time — see line ~2618) and add a FULL-removal branch in update_slots() that matches by n_tokens instead of pos_min. Restore via llama_state_seq_set_data_ext with LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY, then replay the suffix.

Models with partial seq removal support hit the legacy else branch and behave exactly as before.

Diff

Single file, tools/server/server-context.cpp, +51 / -2:

Inside if (n_past > 0 && n_past < slot.prompt.n_tokens()), before the pos_min == -1 abort: a new if (slot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL) branch that does the n_tokens-based find + restore.
The existing pos_min-based code is wrapped in an else so non-FULL memory keeps the current behavior.
The "erase invalidated checkpoints" loop uses cur.n_tokens > n_past for FULL memory and cur.pos_max > pos_next otherwise.

No new public API. No new flags. Reuses existing --cache-ram budget and existing llama_state_seq_* calls.

Performance

Tested on a DeepSeek V4 Flash IQ2XXS GGUF (284B-A13B MoE, hybrid-iswa memory, 1M native ctx) on a single RTX PRO 6000 Blackwell 96GB.

Single-request prefix replay

First request:   44,837-token prompt, 96.4 s TTFT (cold prefill)
Second request:  46,109-token prompt, 4.2 s TTFT (cache hit, 44,833 cached tokens)
Speedup:         22.94x

12-turn agent loop (4K system prompt + growing history + 200-500 tokens/turn output)

Metric	Cache disabled	Cache enabled	Delta
Cumulative wall	298.5 s	121.7 s	2.45x
p50 TTFT	18.7 s	3.15 s	5.94x
p90 TTFT	29.0 s	3.48 s	8.33x
Cache hit ratio (avg over 12 turns)	0	0.81	n/a

Cache hit ratio ramps 0.83 (turn 2) → 0.93 (turn 12). TTFT stays nearly flat ~3s as context grows 5K → 17K tokens.

Regression

10/10 functional probes pass on the patched build (smoke / latency / tps / json_obj / json_schema / tools / cn_prose / thinking / long_12k / long_32k). N=2 parallel decode ratio is 1.25, unchanged from the no-cache baseline.

Memory footprint

Checkpoint sizes scale linearly with token count and live in CPU RAM (existing --cache-ram <MiB> budget):

Tokens	Checkpoint
8,192	106.5 MiB
16,384	160.3 MiB
24,576	214.0 MiB
32,768	267.8 MiB
44,833	346.9 MiB

--cache-ram 8192 (8 GiB default in my testing) holds roughly 24× 32K-token snapshots.

Compatibility

Models with partial seq removal: unchanged behavior (legacy else branch).
Models with FULL-only seq removal (hybrid-iswa, recurrent, SWA-only, compressed-KV): cache now works via checkpoints.
Models with no seq removal support: cache still disabled (as today).

Relates to

kv-cache : add SWA support #13194 (FULL-only memory cache limitation discussion)

Notes for reviewers

The key correctness lesson during development was that checkpoint invalidation must key on n_tokens, not pos_max for FULL memory — hybrid recurrent pos_min/pos_max are tail-state oriented and don't correlate with restorability. An earlier draft that re-used the existing pos_max > pos_next invalidation broke in subtle ways for long-running agent loops.

I'm happy to split this into smaller PRs (e.g. the FULL-restore branch separate from the invalidation tweak) if reviewers prefer.

Hardware: NVIDIA RTX PRO 6000 Blackwell Workstation (96 GB), CUDA 13.0, driver 595.71.05. Tested model: antirez/deepseek-v4-gguf IQ2XXS (86 GB).

…points Hybrid memory backends (compressed-KV / SWA-only / recurrent) advertise COMMON_CONTEXT_SEQ_RM_TYPE_FULL because they cannot reconstruct intermediate state at an arbitrary earlier token. In update_slots() the prefix-cache restore path then triggers GGML_ABORT on pos_min == -1, so prompt cache is silently unusable for an entire class of modern LLMs (DeepSeek V4 Flash, recurrent-state models, fully-SWA models, etc.). Add a FULL-removal branch that uses the existing context-checkpoint infrastructure (already populated at do_checkpoint time for these models) but matches by n_tokens instead of pos_min, restores via llama_state_seq_set_data_ext with LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY, and replays the suffix. Checkpoint invalidation is similarly switched to n_tokens for FULL memory. Models with partial seq removal support are unchanged. Reference: discussion in ggml-org#13194 about why hybrid/SWA/recurrent prompts re-process from scratch. Performance (DeepSeek V4 Flash, FULL-only hybrid-iswa memory, single RTX PRO 6000 Blackwell 96GB): - 44.8K-token prompt with shared prefix: 96.4s cold -> 4.2s warm (22.94x) - 12-turn agent loop: 2.45x cumulative wall, 8.33x p90 TTFT - probes/regression: 10/10 functional probes pass; multi-slot decode unchanged (N=2 ratio 1.25, same as pre-cache baseline). Co-authored-by: Codex <codex@local>

ggml-gh-bot · 2026-05-11T02:22:06Z

Hi @leon7609, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.
AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

leon7609 · 2026-05-11T02:44:53Z

Re the ggml-gh-bot automated check above:

Multiple open PRs: I've closed the other one (server: adaptive low-yield MTP speculation fallback #22931). Will re-open after this is reviewed.
AI-generated content disclosure: I've updated the PR description with an explicit "AI assistance disclosure" section at the top, identifying which parts of the work involved coding-agent assistance and which were my own (investigation, benchmarking, supervision). The original commit message is also AI-drafted — happy to amend it with a force-push if reviewers want a more hand-written one. Let me know which form you'd prefer.

Thanks for the patience on the policy — should have caught both issues before opening.

am17an · 2026-05-11T04:52:56Z

@leon7609 create an issue with a reproduction rather than creating a PR directly

leon7609 requested a review from a team as a code owner May 11, 2026 02:17

github-actions Bot added examples server labels May 11, 2026

leon7609 mentioned this pull request May 11, 2026

server: adaptive low-yield MTP speculation fallback #22931

Closed

am17an closed this May 11, 2026

leon7609 mentioned this pull request May 11, 2026

server: prompt cache permanently disabled for FULL-only memory backends (Mamba, Jamba, hybrid-iswa, SWA-only) #22940

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: enable prompt cache for FULL-removal memory via context checkpoints#22932

server: enable prompt cache for FULL-removal memory via context checkpoints#22932
leon7609 wants to merge 1 commit into
ggml-org:masterfrom
leon7609:feat/server-prompt-cache-full-removal-memory

leon7609 commented May 11, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented May 11, 2026

Uh oh!

leon7609 commented May 11, 2026

Uh oh!

am17an commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leon7609 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI assistance disclosure

Problem

Approach

Diff

Performance

Single-request prefix replay

12-turn agent loop (4K system prompt + growing history + 200-500 tokens/turn output)

Regression

Memory footprint

Compatibility

Relates to

Notes for reviewers

Uh oh!

ggml-gh-bot Bot commented May 11, 2026

Uh oh!

leon7609 commented May 11, 2026

Uh oh!

am17an commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leon7609 commented May 11, 2026 •

edited

Loading