Skip to content

Eval bug: KV cache drops ~4k tokens per turn on Qwen3.6-35B-A3B (since build b9235 ) #23589

@orangeswim

Description

@orangeswim

Name and Version

llama-server b9297-b0df4c0cf (Windows CUDA 13.1 x64)
Regression since b9235. Last good: b9222.

Operating systems

Windows

GGML backends

CUDA

Hardware

RTX 3090 24GB, 32GB SYTEM RAM

Models

Model File Source Miss
Qwen3.6-35B-A3B Q4_K_XL MTP Qwen3.6-35B-A3B-UD-Q4_K_XL-mtp.gguf unsloth/Qwen3.6-35B-A3B-MTP-GGUF 4,108
Qwen3.6-35B-A3B Q5_K_M Qwen3.6-35B-A3B-UD-Q5_K_M.gguf unsloth/Qwen3.6-35B-A3B-GGUF 4,108
Qwen3.6-27B Q4_K_XL Qwen3.6-27B-UD-Q4_K_XL.gguf unsloth/Qwen3.6-27B-GGUF 16
Qwen3.5-4B Q4_K_XL Qwen3.5-4B-UD-Q4_K_XL.gguf unsloth/Qwen3.5-4B-GGUF 16
Gemma 4 E4B Q8_K_XL gemma-4-E4B-it-UD-Q8_K_XL.gguf unsloth/gemma-4-E4B-it-GGUF 12
Gemma 4 26B-A4B IQ2_XXS gemma-4-26B-A4B-it-UD-IQ2_XXS.gguf unsloth/gemma-4-26B-A4B-it-GGUF 12
Gemma 3 4B Q2_K google.gemma-3-4b-it.Q2_K.gguf DevQuasar/google.gemma-3-4b-it-GGUF 12

Problem description & steps to reproduce

While I was running benchmarks on local models, I encountered this issue when I updated my llama to the latest build. After investigation I tracked it down to the following commit.
I built build b9297 with ccee426 reverted and confirmed the cache drop was fixed.

Issue:

Commit ccee426 (PR #23280) fixed a crash on hybrid attention models but introduced a cache reuse regression: exactly one batch's worth of cached tokens (-b value) is dropped and recomputed on every multi-turn request. At 150k context with -b 4096, follow-up prefill goes from 317ms → 3,074ms (~10x slower). The miss scales with batch size: -b 4096 → miss 4,108, -b 2048 → miss 2,060. For agentic use, or multi-turn generation, this can be 1-3 seconds every turn which gets larger with longer context.

Reproduction:

Small reproducible test
llama-server -m Qwen3.6-35B-A3B.gguf -ngl 99 -c 4096 -b 16 -ub 16 \
  --ctx-checkpoints 4 --host 0.0.0.0 --port 8080

# Generate ~500 tokens of filler
CONTENT=$(python3 -c "print('The quick brown fox jumps over the lazy dog. ' * 50)")
ESCAPED=$(python3 -c "import json,sys; print(json.dumps(sys.argv[1])[1:-1])" "$CONTENT")

# Request 1: fresh prompt
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "$(printf '{"model":"local","messages":[{"role":"user","content":"%s"}],"temperature":0,"max_tokens":1,"stream":false}' "$ESCAPED")" \
  | jq '{input: .usage.prompt_tokens, cached: (.usage.prompt_tokens_details.cached_tokens // 0)}'

# Request 2: same prompt + follow-up (tests cache reuse)
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "$(printf '{"model":"local","messages":[{"role":"user","content":"%s"},{"role":"assistant","content":"OK"},{"role":"user","content":"Hello"}],"temperature":0,"max_tokens":1,"stream":false}' "$ESCAPED")" \
  | jq '{input: .usage.prompt_tokens, cached: (.usage.prompt_tokens_details.cached_tokens // 0), miss: (.usage.prompt_tokens - (.usage.prompt_tokens_details.cached_tokens // 0))}'

Expected: miss = ~12 (only new tokens). Actual: miss = ~28 (includes one batch of dropped cache). The miss equals batch size + new tokens.

Large Context Test
CONTENT=$(head -c 58860 wiki.txt)
ESCAPED=$(python3 -c "import json,sys; print(json.dumps(sys.argv[1])[1:-1])" "$CONTENT")

# Request 1
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "$(printf '{"model":"local","messages":[{"role":"user","content":"%s"}],"temperature":0,"max_tokens":1,"stream":false}' "$ESCAPED")" \
  | jq '{input: .usage.prompt_tokens, cached: (.usage.prompt_tokens_details.cached_tokens // 0)}'

# Request 2 (same prompt + follow-up)
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "$(printf '{"model":"local","messages":[{"role":"user","content":"%s"},{"role":"assistant","content":"OK"},{"role":"user","content":"Hello"}],"temperature":0,"max_tokens":1,"stream":false}' "$ESCAPED")" \
  | jq '{input: .usage.prompt_tokens, cached: (.usage.prompt_tokens_details.cached_tokens // 0), miss: (.usage.prompt_tokens - (.usage.prompt_tokens_details.cached_tokens // 0))}'

Expected: miss = ~16 (only new tokens). Actual: miss = ~4,100.

Root cause

ccee426 changes two lines in tools/server/server-context.cpp:

- const auto pos_min_thold = std::max(0, pos_next - n_swa);
+ const auto pos_min_thold = std::max(0, pos_next - n_swa - 1);

- if (n_past > 0 && n_past < slot.prompt.n_tokens()) {
+ if (n_past > 0 && n_past <= slot.prompt.n_tokens()) {

The <= change causes the checkpoint search logic to run when n_past == slot.prompt.n_tokens() (full prefix match). On Qwen3.6-35B-A3B, pos_min is high, so pos_min >= pos_min_thold triggers, no checkpoint is found, and the code forces a full reset (n_past = 0). One batch of cached tokens lost.

Other models (including Qwen3.6-27B, Qwen3.5-4B, Gemma dense/MoE/SWA) keep pos_min low enough that the threshold check passes and existing KV cache is reused normally. I was not able to reproduce it with the other models I tried.

Verbose log confirms this path (server: llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_XL-mtp.gguf -ngl 99 -c 20000 -b 4096 -ub 4096 --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --n-cpu-moe 9 -fa on -t 8 --ctx-checkpoints 4 --log-file log.txt):

n_past = 17610, pos_min = 17613, n_swa = 0
Checking checkpoint with [17609, 17609] against 17609...  ← fails (17609 < 17609 is false)
Checking checkpoint with [13517, 13517] against 17609...  ← used (13517 < 17609)
restored context checkpoint (n_tokens = 13518, n_past = 13518)

First Bad Commit

Build Miss Status
b9222 16 Last good
b9235 4,108 First bad (only ccee426 touches server-context in this range)

Verified: reverting the two lines on latest master (b22ff4b) restores correct behavior (miss = 16). This revert was not tested against the original crash from #23280, so it may reintroduce that issue.

Relevant log output

Qwen3.6-35B-A3B 150k from build b9186-b9297
Build Checkpoints Req2 Cached Miss Req2 Prefill
b9186 1 147,986 / 148,124 (99.9%) 138 317ms
b9186 4 147,986 / 148,124 (99.9%) 138 318ms
b9297 0 0 / 148,124 (0%) 148,124 67,937ms
b9297 4 143,894 / 148,124 (97.1%) 4,230 3,074ms
b9297 8 143,894 / 148,124 (97.1%) 4,230 3,095ms
b9297 16 143,894 / 148,124 (97.1%) 4,230 3,095ms
Cross-model test on b9297 (all clean except Qwen3.6-35B-A3B)
Model Architecture Miss
Qwen3.6-35B-A3B MoE + hybrid 4,108
Qwen3.6-27B Dense 16
Qwen3.5-4B Dense 16
Gemma 4 E4B Dense 12
Gemma 4 26B-A4B MoE 12
Gemma 3 4B Dense + SWA 12
Batch Size Test
Batch Size Miss Minus new tokens
4096 4,108 4,096
2048 2,060 2,048
16 28 16
Identical Request Repeated (b9297, 18k context) Sent the exact same 18k-token request twice in a row to test cache lookup without conversation changes. Shows one shot works normally. | Request | Input Tokens | Cached | Prefill | |---------|-------------|--------|---------| | Req 1 | 17,614 | 0 | 5,141ms | | Req 2 (identical) | 17,614 | 17,610 (99.98%) | 88ms |

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions