server: prompt cache permanently disabled for FULL-only memory backends (Mamba, Jamba, hybrid-iswa, SWA-only)

### Summary

`tools/server/server-context.cpp::update_slots()` aborts the prompt-cache restore path with

```
common_context_can_seq_rm: memory only supports full sequence removal
slot update_slots: forcing full prompt re-processing due to lack of cache data
  (likely due to SWA or hybrid/recurrent memory, see #13194)
```

for **any** model whose memory backend returns `COMMON_CONTEXT_SEQ_RM_TYPE_FULL` from `common_context_can_seq_rm()`. This is a permanent disablement, not a degraded mode: every turn of every agent / RAG / multi-turn loop re-prefills the entire prompt from scratch, even when 95%+ of it is identical to the previous turn.

Affected backends (any one of these triggers FULL-removal):

- Pure recurrent state models (Mamba, Mamba-2, RWKV)
- Hybrid attention + SSM (Jamba, Granite-Hybrid)
- Hybrid-iswa with compressed KV (e.g. DeepSeek-V4 family)
- Any model running with SWA-only KV cache where the window has slid past the prefill range

The context-checkpoint infrastructure that would make cache restore work for these models is **already populated** by `do_checkpoint()` (server-context.cpp around L2618) — it just isn't consulted when `pos_min == -1`, because the existing restore path keys on `pos_min` for partial-removal memory.

### Reproduction

Any FULL-only memory model + standard `llama-server` with the cache enabled exhibits this. Minimal recipe (Mamba-370M GGUF, ~400 MB Q4, no GPU required):

```bash
MODEL=/path/to/mamba-or-jamba.gguf

./build/bin/llama-server \
  -m "$MODEL" \
  -c 8192 \
  --cache-ram 4096 \
  --port 18080 \
  --no-warmup 2>&1 | tee server.log &
SERVER_PID=$!
sleep 5

PROMPT=$(python3 -c 'print("The following is a passage. " + ("Lorem ipsum dolor sit amet, consectetur adipiscing elit. " * 250) + " Please summarize.")')

for label in "cold" "expected-cached"; do
  echo "=== request: $label ==="
  time curl -sS http://localhost:18080/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d "$(python3 -c 'import json,sys; print(json.dumps({"messages":[{"role":"user","content":sys.stdin.read()}], "max_tokens":32, "temperature":0}))' <<< "$PROMPT")" \
    | python3 -c 'import json,sys; print(json.load(sys.stdin).get("usage"))'
done

grep -E "memory only supports full sequence removal|forcing full prompt re-processing" server.log
kill $SERVER_PID
```

### Observed behavior

- Request 2 wall time ≈ Request 1 wall time (full prefill repeated).
- `server.log` shows the `forcing full prompt re-processing due to lack of cache data` line on every turn.
- `usage.prompt_tokens_cached` (when exposed) reports 0 even though the prefix is byte-identical.

### Expected behavior

- Request 2 prefill ≈ free (state restored from the checkpoint that `do_checkpoint()` populated after Request 1).
- TTFT for Request 2 drops by 1–2 orders of magnitude on long prompts.

### Why this matters

On long-prompt workloads (agent / RAG / chat with system + few-shot), the missing cache produces 5–50× higher TTFT than the same workload on a partial-removal-memory model of equal parameter count.

Concrete numbers from my own setup (a hybrid-iswa workload with a 284B-A13B MoE; details available on request, since the model itself is third-party): without cache, a 12-turn agent loop took 298.5 s cumulative wall, p50 TTFT 18.7 s. With a candidate patch reusing existing checkpoints, the same loop ran in 121.7 s, p50 TTFT 3.15 s — same model, same hardware, same probe set passing (10/10 functional probes including json_schema / tools / thinking / long_12k / long_32k).

### Approach (candidate patch ready, not submitted)

~50-line patch in `tools/server/server-context.cpp`:

- In `update_slots()`, before the `pos_min == -1` abort: add a `slot.ctx_seq_rm_type == COMMON_CONTEXT_SEQ_RM_TYPE_FULL` branch that searches `slot.prompt.checkpoints` by `n_tokens ≤ n_past`, restores via `llama_state_seq_set_data_ext(..., LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY)`, replays the suffix.
- Wrap the existing `pos_min`-based code in `else` so partial-removal models are entirely unchanged.
- Invalidate checkpoints by `cur.n_tokens > n_past` (FULL) vs `cur.pos_max > pos_next` (partial).

No new public API. No new flags. Reuses the existing `--cache-ram` budget and the existing `llama_state_seq_*` calls.

I had previously opened this as PR #22932 — closed by maintainer with the (correct) suggestion to open an issue with reproduction first. Submitting now per that guidance. Happy to send the patch for review once the approach is acknowledged.

### Related

- #13194 — SWA / iswa cache discussion that introduced the FULL-removal mode

---

### AI-assistance disclosure

This issue text and the candidate patch were drafted with coding-agent assistance (Claude). The bug investigation and reproduction recipe were done on my own setup; the technical claims are mine. I'm aware of the project's AI-content guidelines and am happy to revise descriptions / commit messages into a more hand-written form on request.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: prompt cache permanently disabled for FULL-only memory backends (Mamba, Jamba, hybrid-iswa, SWA-only) #22940

Summary

Reproduction

Observed behavior

Expected behavior

Why this matters

Approach (candidate patch ready, not submitted)

Related

AI-assistance disclosure

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

server: prompt cache permanently disabled for FULL-only memory backends (Mamba, Jamba, hybrid-iswa, SWA-only) #22940

Description

Summary

Reproduction

Observed behavior

Expected behavior

Why this matters

Approach (candidate patch ready, not submitted)

Related

AI-assistance disclosure

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions