server: prompt cache checkpoints are slot-local, missing across slots under -np > 1

### Summary

Under `-np > 1`, `llama-server`'s prompt-cache checkpoints are **slot-local**: `slot.prompt.checkpoints` lives on each slot, and the slot-selection logic (`server_context::get_available_slot` and the LCP / token-match path inside `update_slots`) only consults the candidate slot's own checkpoints. When a second request with a matching prefix is routed to a different slot than the one holding the matching checkpoint, the request falls through to a cold prefill — even though a perfectly usable checkpoint exists elsewhere on the same server.

This is most visible on concurrent agent / RAG / chat workloads where two clients share the same system prompt + few-shot prefix but happen to land on different slots due to slot availability.

### Reproduction (recipe)

```bash
# Two slots, prompt cache on, any model (FULL-only memory makes it more visible
# but the slot-local checkpoint limitation applies to all memory backends).
./build/bin/llama-server -m "$MODEL" -c 16384 --cache-ram 4096 -np 2 \
  --port 18080 --no-warmup > server.log 2>&1 &

PROMPT=$(python3 -c 'print("System: you are an assistant.\n\n" + ("Background: " + "lorem ipsum " * 300) + "\n\nQuestion: summarize.")')

# 1. Prime slot 0 with the prompt (so slot 0 now holds a checkpoint).
curl -sS http://localhost:18080/v1/chat/completions -H 'Content-Type: application/json' \
  -d "$(python3 -c 'import json,sys; print(json.dumps({"messages":[{"role":"user","content":sys.stdin.read()}],"max_tokens":16}))' <<< "$PROMPT")" \
  >/dev/null

# 2. Send a long-running unrelated request to slot 0 (keep it busy).
LONG=$(python3 -c 'print("Write a 2000-word essay on " + "x " * 50)')
curl -sS http://localhost:18080/v1/chat/completions -H 'Content-Type: application/json' \
  -d "$(python3 -c 'import json,sys; print(json.dumps({"messages":[{"role":"user","content":sys.stdin.read()}],"max_tokens":4096}))' <<< "$LONG")" &
LONG_PID=$!

sleep 1

# 3. While slot 0 is busy, send the prompt again. It must route to slot 1.
echo "=== retry on slot 1 (expected: cache hit from slot 0) ==="
time curl -sS http://localhost:18080/v1/chat/completions -H 'Content-Type: application/json' \
  -d "$(python3 -c 'import json,sys; print(json.dumps({"messages":[{"role":"user","content":sys.stdin.read()}],"max_tokens":16}))' <<< "$PROMPT")" \
  | python3 -c 'import json,sys; print(json.load(sys.stdin).get("usage"))'

wait $LONG_PID
```

### Observed behavior

Step 3 takes the full cold-prefill wall time (e.g. ~15 s on a 3K-token prompt), and `server.log` shows slot 1 doing a fresh prefill from scratch.

### Expected behavior

Step 3 should restore from slot 0's checkpoint via shared cache and return after the tail tokens only (sub-second on a fully-matching prefix).

### Concrete measurement (from a candidate patch)

On a `-np 2` test server with a candidate fix that introduces a shared checkpoint pool + global LCP-aware slot selection, an exact-prefix probe drops from **15.42 s** (storage-only baseline, cache exists but slot-routing can't find it) to **0.755 s** (scheduling picks the slot whose checkpoint matches, or restores into an idle slot from the shared pool). Same hardware, same model, same prompt, same probe set.

### Why this matters

In production workloads where:

- Multiple concurrent users share a long system prompt (chatbots, agents)
- A single user runs multiple parallel pipelines (RAG fan-out, eval batches)
- A scheduled worker (e.g. background digest) competes with an interactive worker for the same server

…the slot a request lands on is effectively random, so the per-slot cache locality wastes most of the cache's potential. With shared checkpoints, the cache hit rate under `-np N` approaches the single-slot ideal.

The memory cost is naturally lower than the current behavior: today, two slots holding the same prefix store two copies of the checkpoint; with a shared pool, one copy serves both.

### Approach (candidate patch ready, not submitted)

Two-layer split, each layer independently shippable:

1. **Storage layer**: lift `slot.prompt.checkpoints` to a server-global pool keyed by `(model_id, n_tokens, byte-hash-of-state)`. Use refcounted handles; LRU eviction must not free a checkpoint that is currently being restored or held by a live slot. Per-slot pointers become read references into the pool, not owners.

2. **Scheduling layer**: in `get_available_slot` (and adjacent token-match logic), do the LCP / n_tokens search over the **global pool**, not just the candidate slot. Two scheduler policies surface naturally:

   - `prefer_idle_slot_with_global_restore` (default): pick the best idle slot, restore the matching checkpoint into it from the pool.
   - `prefer_owning_slot_with_wait` (opt-in): if a busy slot already has the checkpoint loaded, wait for it instead of restoring elsewhere — only useful when restore cost > expected wait time.

Correctness gotchas the patch resolves:

- Concurrent reads of the same checkpoint bytes (slot A restoring while slot B already restored from the same handle).
- Eviction during in-flight restore: refcount + deferred-free.
- Backward compatibility: when `-np 1`, behavior is identical to today.

No new public API (existing `--cache-ram` budget covers the global pool). One internal interface change: checkpoint handle becomes refcounted.

### Related

- #22940 (prompt cache disabled for FULL-only memory backends — separate but related; both are about the cache restore path under-utilizing existing infrastructure)

---

### AI-assistance disclosure

This issue text and the candidate patch were drafted with coding-agent assistance (Claude / Codex). The bug observation came from concurrent-workload measurements on my own setup; the candidate patch was implemented by Codex under my supervision. Technical claims are mine.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: prompt cache checkpoints are slot-local, missing across slots under -np > 1 #22942

Summary

Reproduction (recipe)

Observed behavior

Expected behavior

Concrete measurement (from a candidate patch)

Why this matters

Approach (candidate patch ready, not submitted)

Related

AI-assistance disclosure

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

server: prompt cache checkpoints are slot-local, missing across slots under -np > 1 #22942

Description

Summary

Reproduction (recipe)

Observed behavior

Expected behavior

Concrete measurement (from a candidate patch)

Why this matters

Approach (candidate patch ready, not submitted)

Related

AI-assistance disclosure

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions