Summary
tools/server/server-context.cpp::update_slots() aborts the prompt-cache restore path with
common_context_can_seq_rm: memory only supports full sequence removal
slot update_slots: forcing full prompt re-processing due to lack of cache data
(likely due to SWA or hybrid/recurrent memory, see #13194)
for any model whose memory backend returns COMMON_CONTEXT_SEQ_RM_TYPE_FULL from common_context_can_seq_rm(). This is a permanent disablement, not a degraded mode: every turn of every agent / RAG / multi-turn loop re-prefills the entire prompt from scratch, even when 95%+ of it is identical to the previous turn.
Affected backends (any one of these triggers FULL-removal):
- Pure recurrent state models (Mamba, Mamba-2, RWKV)
- Hybrid attention + SSM (Jamba, Granite-Hybrid)
- Hybrid-iswa with compressed KV (e.g. DeepSeek-V4 family)
- Any model running with SWA-only KV cache where the window has slid past the prefill range
The context-checkpoint infrastructure that would make cache restore work for these models is already populated by do_checkpoint() (server-context.cpp around L2618) — it just isn't consulted when pos_min == -1, because the existing restore path keys on pos_min for partial-removal memory.
Reproduction
Any FULL-only memory model + standard llama-server with the cache enabled exhibits this. Minimal recipe (Mamba-370M GGUF, ~400 MB Q4, no GPU required):
MODEL=/path/to/mamba-or-jamba.gguf
./build/bin/llama-server \
-m "$MODEL" \
-c 8192 \
--cache-ram 4096 \
--port 18080 \
--no-warmup 2>&1 | tee server.log &
SERVER_PID=$!
sleep 5
PROMPT=$(python3 -c 'print("The following is a passage. " + ("Lorem ipsum dolor sit amet, consectetur adipiscing elit. " * 250) + " Please summarize.")')
for label in "cold" "expected-cached"; do
echo "=== request: $label ==="
time curl -sS http://localhost:18080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d "$(python3 -c 'import json,sys; print(json.dumps({"messages":[{"role":"user","content":sys.stdin.read()}], "max_tokens":32, "temperature":0}))' <<< "$PROMPT")" \
| python3 -c 'import json,sys; print(json.load(sys.stdin).get("usage"))'
done
grep -E "memory only supports full sequence removal|forcing full prompt re-processing" server.log
kill $SERVER_PID
Observed behavior
- Request 2 wall time ≈ Request 1 wall time (full prefill repeated).
server.log shows the forcing full prompt re-processing due to lack of cache data line on every turn.
usage.prompt_tokens_cached (when exposed) reports 0 even though the prefix is byte-identical.
Expected behavior
- Request 2 prefill ≈ free (state restored from the checkpoint that
do_checkpoint() populated after Request 1).
- TTFT for Request 2 drops by 1–2 orders of magnitude on long prompts.
Why this matters
On long-prompt workloads (agent / RAG / chat with system + few-shot), the missing cache produces 5–50× higher TTFT than the same workload on a partial-removal-memory model of equal parameter count.
Concrete numbers from my own setup (a hybrid-iswa workload with a 284B-A13B MoE; details available on request, since the model itself is third-party): without cache, a 12-turn agent loop took 298.5 s cumulative wall, p50 TTFT 18.7 s. With a candidate patch reusing existing checkpoints, the same loop ran in 121.7 s, p50 TTFT 3.15 s — same model, same hardware, same probe set passing (10/10 functional probes including json_schema / tools / thinking / long_12k / long_32k).
Approach (candidate patch ready, not submitted)
~50-line patch in tools/server/server-context.cpp:
- In
update_slots(), before the pos_min == -1 abort: add a slot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL branch that searches slot.prompt.checkpoints by n_tokens ≤ n_past, restores via llama_state_seq_set_data_ext(..., LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY), replays the suffix.
- Wrap the existing
pos_min-based code in else so partial-removal models are entirely unchanged.
- Invalidate checkpoints by
cur.n_tokens > n_past (FULL) vs cur.pos_max > pos_next (partial).
No new public API. No new flags. Reuses the existing --cache-ram budget and the existing llama_state_seq_* calls.
I had previously opened this as PR #22932 — closed by maintainer with the (correct) suggestion to open an issue with reproduction first. Submitting now per that guidance. Happy to send the patch for review once the approach is acknowledged.
Related
AI-assistance disclosure
This issue text and the candidate patch were drafted with coding-agent assistance (Claude). The bug investigation and reproduction recipe were done on my own setup; the technical claims are mine. I'm aware of the project's AI-content guidelines and am happy to revise descriptions / commit messages into a more hand-written form on request.
Summary
tools/server/server-context.cpp::update_slots()aborts the prompt-cache restore path withfor any model whose memory backend returns
COMMON_CONTEXT_SEQ_RM_TYPE_FULLfromcommon_context_can_seq_rm(). This is a permanent disablement, not a degraded mode: every turn of every agent / RAG / multi-turn loop re-prefills the entire prompt from scratch, even when 95%+ of it is identical to the previous turn.Affected backends (any one of these triggers FULL-removal):
The context-checkpoint infrastructure that would make cache restore work for these models is already populated by
do_checkpoint()(server-context.cpp around L2618) — it just isn't consulted whenpos_min == -1, because the existing restore path keys onpos_minfor partial-removal memory.Reproduction
Any FULL-only memory model + standard
llama-serverwith the cache enabled exhibits this. Minimal recipe (Mamba-370M GGUF, ~400 MB Q4, no GPU required):Observed behavior
server.logshows theforcing full prompt re-processing due to lack of cache dataline on every turn.usage.prompt_tokens_cached(when exposed) reports 0 even though the prefix is byte-identical.Expected behavior
do_checkpoint()populated after Request 1).Why this matters
On long-prompt workloads (agent / RAG / chat with system + few-shot), the missing cache produces 5–50× higher TTFT than the same workload on a partial-removal-memory model of equal parameter count.
Concrete numbers from my own setup (a hybrid-iswa workload with a 284B-A13B MoE; details available on request, since the model itself is third-party): without cache, a 12-turn agent loop took 298.5 s cumulative wall, p50 TTFT 18.7 s. With a candidate patch reusing existing checkpoints, the same loop ran in 121.7 s, p50 TTFT 3.15 s — same model, same hardware, same probe set passing (10/10 functional probes including json_schema / tools / thinking / long_12k / long_32k).
Approach (candidate patch ready, not submitted)
~50-line patch in
tools/server/server-context.cpp:update_slots(), before thepos_min == -1abort: add aslot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULLbranch that searchesslot.prompt.checkpointsbyn_tokens ≤ n_past, restores viallama_state_seq_set_data_ext(..., LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY), replays the suffix.pos_min-based code inelseso partial-removal models are entirely unchanged.cur.n_tokens > n_past(FULL) vscur.pos_max > pos_next(partial).No new public API. No new flags. Reuses the existing
--cache-rambudget and the existingllama_state_seq_*calls.I had previously opened this as PR #22932 — closed by maintainer with the (correct) suggestion to open an issue with reproduction first. Submitting now per that guidance. Happy to send the patch for review once the approach is acknowledged.
Related
AI-assistance disclosure
This issue text and the candidate patch were drafted with coding-agent assistance (Claude). The bug investigation and reproduction recipe were done on my own setup; the technical claims are mine. I'm aware of the project's AI-content guidelines and am happy to revise descriptions / commit messages into a more hand-written form on request.