Skip to content

server: prompt cache permanently disabled for FULL-only memory backends (Mamba, Jamba, hybrid-iswa, SWA-only) #22940

@leon7609

Description

@leon7609

Summary

tools/server/server-context.cpp::update_slots() aborts the prompt-cache restore path with

common_context_can_seq_rm: memory only supports full sequence removal
slot update_slots: forcing full prompt re-processing due to lack of cache data
  (likely due to SWA or hybrid/recurrent memory, see #13194)

for any model whose memory backend returns COMMON_CONTEXT_SEQ_RM_TYPE_FULL from common_context_can_seq_rm(). This is a permanent disablement, not a degraded mode: every turn of every agent / RAG / multi-turn loop re-prefills the entire prompt from scratch, even when 95%+ of it is identical to the previous turn.

Affected backends (any one of these triggers FULL-removal):

  • Pure recurrent state models (Mamba, Mamba-2, RWKV)
  • Hybrid attention + SSM (Jamba, Granite-Hybrid)
  • Hybrid-iswa with compressed KV (e.g. DeepSeek-V4 family)
  • Any model running with SWA-only KV cache where the window has slid past the prefill range

The context-checkpoint infrastructure that would make cache restore work for these models is already populated by do_checkpoint() (server-context.cpp around L2618) — it just isn't consulted when pos_min == -1, because the existing restore path keys on pos_min for partial-removal memory.

Reproduction

Any FULL-only memory model + standard llama-server with the cache enabled exhibits this. Minimal recipe (Mamba-370M GGUF, ~400 MB Q4, no GPU required):

MODEL=/path/to/mamba-or-jamba.gguf

./build/bin/llama-server \
  -m "$MODEL" \
  -c 8192 \
  --cache-ram 4096 \
  --port 18080 \
  --no-warmup 2>&1 | tee server.log &
SERVER_PID=$!
sleep 5

PROMPT=$(python3 -c 'print("The following is a passage. " + ("Lorem ipsum dolor sit amet, consectetur adipiscing elit. " * 250) + " Please summarize.")')

for label in "cold" "expected-cached"; do
  echo "=== request: $label ==="
  time curl -sS http://localhost:18080/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d "$(python3 -c 'import json,sys; print(json.dumps({"messages":[{"role":"user","content":sys.stdin.read()}], "max_tokens":32, "temperature":0}))' <<< "$PROMPT")" \
    | python3 -c 'import json,sys; print(json.load(sys.stdin).get("usage"))'
done

grep -E "memory only supports full sequence removal|forcing full prompt re-processing" server.log
kill $SERVER_PID

Observed behavior

  • Request 2 wall time ≈ Request 1 wall time (full prefill repeated).
  • server.log shows the forcing full prompt re-processing due to lack of cache data line on every turn.
  • usage.prompt_tokens_cached (when exposed) reports 0 even though the prefix is byte-identical.

Expected behavior

  • Request 2 prefill ≈ free (state restored from the checkpoint that do_checkpoint() populated after Request 1).
  • TTFT for Request 2 drops by 1–2 orders of magnitude on long prompts.

Why this matters

On long-prompt workloads (agent / RAG / chat with system + few-shot), the missing cache produces 5–50× higher TTFT than the same workload on a partial-removal-memory model of equal parameter count.

Concrete numbers from my own setup (a hybrid-iswa workload with a 284B-A13B MoE; details available on request, since the model itself is third-party): without cache, a 12-turn agent loop took 298.5 s cumulative wall, p50 TTFT 18.7 s. With a candidate patch reusing existing checkpoints, the same loop ran in 121.7 s, p50 TTFT 3.15 s — same model, same hardware, same probe set passing (10/10 functional probes including json_schema / tools / thinking / long_12k / long_32k).

Approach (candidate patch ready, not submitted)

~50-line patch in tools/server/server-context.cpp:

  • In update_slots(), before the pos_min == -1 abort: add a slot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL branch that searches slot.prompt.checkpoints by n_tokens ≤ n_past, restores via llama_state_seq_set_data_ext(..., LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY), replays the suffix.
  • Wrap the existing pos_min-based code in else so partial-removal models are entirely unchanged.
  • Invalidate checkpoints by cur.n_tokens > n_past (FULL) vs cur.pos_max > pos_next (partial).

No new public API. No new flags. Reuses the existing --cache-ram budget and the existing llama_state_seq_* calls.

I had previously opened this as PR #22932 — closed by maintainer with the (correct) suggestion to open an issue with reproduction first. Submitting now per that guidance. Happy to send the patch for review once the approach is acknowledged.

Related


AI-assistance disclosure

This issue text and the candidate patch were drafted with coding-agent assistance (Claude). The bug investigation and reproduction recipe were done on my own setup; the technical claims are mine. I'm aware of the project's AI-content guidelines and am happy to revise descriptions / commit messages into a more hand-written form on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions