Skip to content

server: enable prompt cache for FULL-removal memory via context checkpoints#22932

Closed
leon7609 wants to merge 1 commit into
ggml-org:masterfrom
leon7609:feat/server-prompt-cache-full-removal-memory
Closed

server: enable prompt cache for FULL-removal memory via context checkpoints#22932
leon7609 wants to merge 1 commit into
ggml-org:masterfrom
leon7609:feat/server-prompt-cache-full-removal-memory

Conversation

@leon7609

@leon7609 leon7609 commented May 11, 2026

Copy link
Copy Markdown

AI assistance disclosure

This PR was authored with substantial coding-agent assistance. Disclosing per the contribution guidelines:

  • Original bug investigation: I (the author) ran parallel_load.py and probes.py against my own DSv4 setup over several days, profiled the gap between cold-prefill and prefix-reuse cases, and confirmed the cache-disable behavior was the bottleneck for my agent workloads. The diagnosis is mine.
  • Patch implementation: the initial patch on a downstream DSv4 fork was generated by Codex under my supervision; the port to current ggml-org/master in this PR (the if (slot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL) { ... } branch, the else wrap of the legacy pos_min path, and the checkpoint-invalidation switch) was written by Claude (Anthropic) under my direction.
  • Description: drafted by Claude based on the technical work, edited by me.
  • Benchmarks and verification: all reproducible commands below were executed on my hardware and the results are my measurements.

I'm publishing as a human-supervised contribution; I stand behind the technical claims and the diff. If reviewers want a smaller / different shape (e.g. fewer lines under the new branch, no else wrap, alternative invalidation logic), I'll happily revise.


Problem

tools/server/server-context.cpp's update_slots() aborts on pos_min == -1. This blocks prompt cache for any model whose memory backend reports COMMON_CONTEXT_SEQ_RM_TYPE_FULL:

common_context_can_seq_rm: memory only supports full sequence removal
slot update_slots: forcing full prompt re-processing due to lack of cache data
  (likely due to SWA or hybrid/recurrent memory, see #13194)

Hybrid memory models — compressed-KV, recurrent-state, SWA-only — can't return a meaningful pos_min for an arbitrary earlier token, so the existing prefix-cache restore path can't run. The result is that every turn of every agent / TD / RAG loop re-prefills the entire prompt from scratch. On a 32K-token system+history prompt that's 30-60s TTFT per turn, even when 95% of the prompt is identical to the previous turn.

Approach

Use the existing context-checkpoint infrastructure (already populated for FULL memory at do_checkpoint time — see line ~2618) and add a FULL-removal branch in update_slots() that matches by n_tokens instead of pos_min. Restore via llama_state_seq_set_data_ext with LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY, then replay the suffix.

Models with partial seq removal support hit the legacy else branch and behave exactly as before.

Diff

Single file, tools/server/server-context.cpp, +51 / -2:

  • Inside if (n_past > 0 && n_past < slot.prompt.n_tokens()), before the pos_min == -1 abort: a new if (slot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL) branch that does the n_tokens-based find + restore.
  • The existing pos_min-based code is wrapped in an else so non-FULL memory keeps the current behavior.
  • The "erase invalidated checkpoints" loop uses cur.n_tokens > n_past for FULL memory and cur.pos_max > pos_next otherwise.

No new public API. No new flags. Reuses existing --cache-ram budget and existing llama_state_seq_* calls.

Performance

Tested on a DeepSeek V4 Flash IQ2XXS GGUF (284B-A13B MoE, hybrid-iswa memory, 1M native ctx) on a single RTX PRO 6000 Blackwell 96GB.

Single-request prefix replay

First request:   44,837-token prompt, 96.4 s TTFT (cold prefill)
Second request:  46,109-token prompt, 4.2 s TTFT (cache hit, 44,833 cached tokens)
Speedup:         22.94x

12-turn agent loop (4K system prompt + growing history + 200-500 tokens/turn output)

Metric Cache disabled Cache enabled Delta
Cumulative wall 298.5 s 121.7 s 2.45x
p50 TTFT 18.7 s 3.15 s 5.94x
p90 TTFT 29.0 s 3.48 s 8.33x
Cache hit ratio (avg over 12 turns) 0 0.81 n/a

Cache hit ratio ramps 0.83 (turn 2) → 0.93 (turn 12). TTFT stays nearly flat ~3s as context grows 5K → 17K tokens.

Regression

10/10 functional probes pass on the patched build (smoke / latency / tps / json_obj / json_schema / tools / cn_prose / thinking / long_12k / long_32k). N=2 parallel decode ratio is 1.25, unchanged from the no-cache baseline.

Memory footprint

Checkpoint sizes scale linearly with token count and live in CPU RAM (existing --cache-ram <MiB> budget):

Tokens Checkpoint
8,192 106.5 MiB
16,384 160.3 MiB
24,576 214.0 MiB
32,768 267.8 MiB
44,833 346.9 MiB

--cache-ram 8192 (8 GiB default in my testing) holds roughly 24× 32K-token snapshots.

Compatibility

  • Models with partial seq removal: unchanged behavior (legacy else branch).
  • Models with FULL-only seq removal (hybrid-iswa, recurrent, SWA-only, compressed-KV): cache now works via checkpoints.
  • Models with no seq removal support: cache still disabled (as today).

Relates to

Notes for reviewers

The key correctness lesson during development was that checkpoint invalidation must key on n_tokens, not pos_max for FULL memory — hybrid recurrent pos_min/pos_max are tail-state oriented and don't correlate with restorability. An earlier draft that re-used the existing pos_max > pos_next invalidation broke in subtle ways for long-running agent loops.

I'm happy to split this into smaller PRs (e.g. the FULL-restore branch separate from the invalidation tweak) if reviewers prefer.

Hardware: NVIDIA RTX PRO 6000 Blackwell Workstation (96 GB), CUDA 13.0, driver 595.71.05. Tested model: antirez/deepseek-v4-gguf IQ2XXS (86 GB).

…points

Hybrid memory backends (compressed-KV / SWA-only / recurrent) advertise
COMMON_CONTEXT_SEQ_RM_TYPE_FULL because they cannot reconstruct intermediate
state at an arbitrary earlier token. In update_slots() the prefix-cache
restore path then triggers GGML_ABORT on pos_min == -1, so prompt cache is
silently unusable for an entire class of modern LLMs (DeepSeek V4 Flash,
recurrent-state models, fully-SWA models, etc.).

Add a FULL-removal branch that uses the existing context-checkpoint
infrastructure (already populated at do_checkpoint time for these models)
but matches by n_tokens instead of pos_min, restores via
llama_state_seq_set_data_ext with LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY, and
replays the suffix. Checkpoint invalidation is similarly switched to
n_tokens for FULL memory.

Models with partial seq removal support are unchanged.

Reference: discussion in ggml-org#13194 about why hybrid/SWA/recurrent prompts
re-process from scratch.

Performance (DeepSeek V4 Flash, FULL-only hybrid-iswa memory, single
RTX PRO 6000 Blackwell 96GB):

- 44.8K-token prompt with shared prefix: 96.4s cold -> 4.2s warm (22.94x)
- 12-turn agent loop: 2.45x cumulative wall, 8.33x p90 TTFT
- probes/regression: 10/10 functional probes pass; multi-slot decode
  unchanged (N=2 ratio 1.25, same as pre-cache baseline).

Co-authored-by: Codex <codex@local>
@leon7609 leon7609 requested a review from a team as a code owner May 11, 2026 02:17
@ggml-gh-bot

ggml-gh-bot Bot commented May 11, 2026

Copy link
Copy Markdown

Hi @leon7609, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@leon7609

Copy link
Copy Markdown
Author

Re the ggml-gh-bot automated check above:

  • Multiple open PRs: I've closed the other one (server: adaptive low-yield MTP speculation fallback #22931). Will re-open after this is reviewed.
  • AI-generated content disclosure: I've updated the PR description with an explicit "AI assistance disclosure" section at the top, identifying which parts of the work involved coding-agent assistance and which were my own (investigation, benchmarking, supervision). The original commit message is also AI-drafted — happy to amend it with a force-push if reviewers want a more hand-written one. Let me know which form you'd prefer.

Thanks for the patience on the policy — should have caught both issues before opening.

@am17an am17an closed this May 11, 2026
@am17an

am17an commented May 11, 2026

Copy link
Copy Markdown
Contributor

@leon7609 create an issue with a reproduction rather than creating a PR directly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants