Skip to content

server: prompt cache checkpoints are slot-local, missing across slots under -np > 1 #22942

@leon7609

Description

@leon7609

Summary

Under -np > 1, llama-server's prompt-cache checkpoints are slot-local: slot.prompt.checkpoints lives on each slot, and the slot-selection logic (server_context::get_available_slot and the LCP / token-match path inside update_slots) only consults the candidate slot's own checkpoints. When a second request with a matching prefix is routed to a different slot than the one holding the matching checkpoint, the request falls through to a cold prefill — even though a perfectly usable checkpoint exists elsewhere on the same server.

This is most visible on concurrent agent / RAG / chat workloads where two clients share the same system prompt + few-shot prefix but happen to land on different slots due to slot availability.

Reproduction (recipe)

# Two slots, prompt cache on, any model (FULL-only memory makes it more visible
# but the slot-local checkpoint limitation applies to all memory backends).
./build/bin/llama-server -m "$MODEL" -c 16384 --cache-ram 4096 -np 2 \
  --port 18080 --no-warmup > server.log 2>&1 &

PROMPT=$(python3 -c 'print("System: you are an assistant.\n\n" + ("Background: " + "lorem ipsum " * 300) + "\n\nQuestion: summarize.")')

# 1. Prime slot 0 with the prompt (so slot 0 now holds a checkpoint).
curl -sS http://localhost:18080/v1/chat/completions -H 'Content-Type: application/json' \
  -d "$(python3 -c 'import json,sys; print(json.dumps({"messages":[{"role":"user","content":sys.stdin.read()}],"max_tokens":16}))' <<< "$PROMPT")" \
  >/dev/null

# 2. Send a long-running unrelated request to slot 0 (keep it busy).
LONG=$(python3 -c 'print("Write a 2000-word essay on " + "x " * 50)')
curl -sS http://localhost:18080/v1/chat/completions -H 'Content-Type: application/json' \
  -d "$(python3 -c 'import json,sys; print(json.dumps({"messages":[{"role":"user","content":sys.stdin.read()}],"max_tokens":4096}))' <<< "$LONG")" &
LONG_PID=$!

sleep 1

# 3. While slot 0 is busy, send the prompt again. It must route to slot 1.
echo "=== retry on slot 1 (expected: cache hit from slot 0) ==="
time curl -sS http://localhost:18080/v1/chat/completions -H 'Content-Type: application/json' \
  -d "$(python3 -c 'import json,sys; print(json.dumps({"messages":[{"role":"user","content":sys.stdin.read()}],"max_tokens":16}))' <<< "$PROMPT")" \
  | python3 -c 'import json,sys; print(json.load(sys.stdin).get("usage"))'

wait $LONG_PID

Observed behavior

Step 3 takes the full cold-prefill wall time (e.g. ~15 s on a 3K-token prompt), and server.log shows slot 1 doing a fresh prefill from scratch.

Expected behavior

Step 3 should restore from slot 0's checkpoint via shared cache and return after the tail tokens only (sub-second on a fully-matching prefix).

Concrete measurement (from a candidate patch)

On a -np 2 test server with a candidate fix that introduces a shared checkpoint pool + global LCP-aware slot selection, an exact-prefix probe drops from 15.42 s (storage-only baseline, cache exists but slot-routing can't find it) to 0.755 s (scheduling picks the slot whose checkpoint matches, or restores into an idle slot from the shared pool). Same hardware, same model, same prompt, same probe set.

Why this matters

In production workloads where:

  • Multiple concurrent users share a long system prompt (chatbots, agents)
  • A single user runs multiple parallel pipelines (RAG fan-out, eval batches)
  • A scheduled worker (e.g. background digest) competes with an interactive worker for the same server

…the slot a request lands on is effectively random, so the per-slot cache locality wastes most of the cache's potential. With shared checkpoints, the cache hit rate under -np N approaches the single-slot ideal.

The memory cost is naturally lower than the current behavior: today, two slots holding the same prefix store two copies of the checkpoint; with a shared pool, one copy serves both.

Approach (candidate patch ready, not submitted)

Two-layer split, each layer independently shippable:

  1. Storage layer: lift slot.prompt.checkpoints to a server-global pool keyed by (model_id, n_tokens, byte-hash-of-state). Use refcounted handles; LRU eviction must not free a checkpoint that is currently being restored or held by a live slot. Per-slot pointers become read references into the pool, not owners.

  2. Scheduling layer: in get_available_slot (and adjacent token-match logic), do the LCP / n_tokens search over the global pool, not just the candidate slot. Two scheduler policies surface naturally:

    • prefer_idle_slot_with_global_restore (default): pick the best idle slot, restore the matching checkpoint into it from the pool.
    • prefer_owning_slot_with_wait (opt-in): if a busy slot already has the checkpoint loaded, wait for it instead of restoring elsewhere — only useful when restore cost > expected wait time.

Correctness gotchas the patch resolves:

  • Concurrent reads of the same checkpoint bytes (slot A restoring while slot B already restored from the same handle).
  • Eviction during in-flight restore: refcount + deferred-free.
  • Backward compatibility: when -np 1, behavior is identical to today.

No new public API (existing --cache-ram budget covers the global pool). One internal interface change: checkpoint handle becomes refcounted.

Related


AI-assistance disclosure

This issue text and the candidate patch were drafted with coding-agent assistance (Claude / Codex). The bug observation came from concurrent-workload measurements on my own setup; the candidate patch was implemented by Codex under my supervision. Technical claims are mine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions