Name and Version
llama-server b9297-b0df4c0cf (Windows CUDA 13.1 x64)
Regression since b9235. Last good: b9222.
Operating systems
Windows
GGML backends
CUDA
Hardware
RTX 3090 24GB, 32GB SYTEM RAM
Models
Problem description & steps to reproduce
While I was running benchmarks on local models, I encountered this issue when I updated my llama to the latest build. After investigation I tracked it down to the following commit.
I built build b9297 with ccee426 reverted and confirmed the cache drop was fixed.
Issue:
Commit ccee426 (PR #23280) fixed a crash on hybrid attention models but introduced a cache reuse regression: exactly one batch's worth of cached tokens (-b value) is dropped and recomputed on every multi-turn request. At 150k context with -b 4096, follow-up prefill goes from 317ms → 3,074ms (~10x slower). The miss scales with batch size: -b 4096 → miss 4,108, -b 2048 → miss 2,060. For agentic use, or multi-turn generation, this can be 1-3 seconds every turn which gets larger with longer context.
Reproduction:
Small reproducible test
llama-server -m Qwen3.6-35B-A3B.gguf -ngl 99 -c 4096 -b 16 -ub 16 \
--ctx-checkpoints 4 --host 0.0.0.0 --port 8080
# Generate ~500 tokens of filler
CONTENT=$(python3 -c "print('The quick brown fox jumps over the lazy dog. ' * 50)")
ESCAPED=$(python3 -c "import json,sys; print(json.dumps(sys.argv[1])[1:-1])" "$CONTENT")
# Request 1: fresh prompt
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d "$(printf '{"model":"local","messages":[{"role":"user","content":"%s"}],"temperature":0,"max_tokens":1,"stream":false}' "$ESCAPED")" \
| jq '{input: .usage.prompt_tokens, cached: (.usage.prompt_tokens_details.cached_tokens // 0)}'
# Request 2: same prompt + follow-up (tests cache reuse)
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d "$(printf '{"model":"local","messages":[{"role":"user","content":"%s"},{"role":"assistant","content":"OK"},{"role":"user","content":"Hello"}],"temperature":0,"max_tokens":1,"stream":false}' "$ESCAPED")" \
| jq '{input: .usage.prompt_tokens, cached: (.usage.prompt_tokens_details.cached_tokens // 0), miss: (.usage.prompt_tokens - (.usage.prompt_tokens_details.cached_tokens // 0))}'
Expected: miss = ~12 (only new tokens). Actual: miss = ~28 (includes one batch of dropped cache). The miss equals batch size + new tokens.
Large Context Test
CONTENT=$(head -c 58860 wiki.txt)
ESCAPED=$(python3 -c "import json,sys; print(json.dumps(sys.argv[1])[1:-1])" "$CONTENT")
# Request 1
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d "$(printf '{"model":"local","messages":[{"role":"user","content":"%s"}],"temperature":0,"max_tokens":1,"stream":false}' "$ESCAPED")" \
| jq '{input: .usage.prompt_tokens, cached: (.usage.prompt_tokens_details.cached_tokens // 0)}'
# Request 2 (same prompt + follow-up)
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d "$(printf '{"model":"local","messages":[{"role":"user","content":"%s"},{"role":"assistant","content":"OK"},{"role":"user","content":"Hello"}],"temperature":0,"max_tokens":1,"stream":false}' "$ESCAPED")" \
| jq '{input: .usage.prompt_tokens, cached: (.usage.prompt_tokens_details.cached_tokens // 0), miss: (.usage.prompt_tokens - (.usage.prompt_tokens_details.cached_tokens // 0))}'
Expected: miss = ~16 (only new tokens). Actual: miss = ~4,100.
Root cause
ccee426 changes two lines in tools/server/server-context.cpp:
- const auto pos_min_thold = std::max(0, pos_next - n_swa);
+ const auto pos_min_thold = std::max(0, pos_next - n_swa - 1);
- if (n_past > 0 && n_past < slot.prompt.n_tokens()) {
+ if (n_past > 0 && n_past <= slot.prompt.n_tokens()) {
The <= change causes the checkpoint search logic to run when n_past == slot.prompt.n_tokens() (full prefix match). On Qwen3.6-35B-A3B, pos_min is high, so pos_min >= pos_min_thold triggers, no checkpoint is found, and the code forces a full reset (n_past = 0). One batch of cached tokens lost.
Other models (including Qwen3.6-27B, Qwen3.5-4B, Gemma dense/MoE/SWA) keep pos_min low enough that the threshold check passes and existing KV cache is reused normally. I was not able to reproduce it with the other models I tried.
Verbose log confirms this path (server: llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_XL-mtp.gguf -ngl 99 -c 20000 -b 4096 -ub 4096 --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --n-cpu-moe 9 -fa on -t 8 --ctx-checkpoints 4 --log-file log.txt):
n_past = 17610, pos_min = 17613, n_swa = 0
Checking checkpoint with [17609, 17609] against 17609... ← fails (17609 < 17609 is false)
Checking checkpoint with [13517, 13517] against 17609... ← used (13517 < 17609)
restored context checkpoint (n_tokens = 13518, n_past = 13518)
First Bad Commit
| Build |
Miss |
Status |
| b9222 |
16 |
Last good |
| b9235 |
4,108 |
First bad (only ccee426 touches server-context in this range) |
Verified: reverting the two lines on latest master (b22ff4b) restores correct behavior (miss = 16). This revert was not tested against the original crash from #23280, so it may reintroduce that issue.
Relevant log output
Qwen3.6-35B-A3B 150k from build b9186-b9297
| Build |
Checkpoints |
Req2 Cached |
Miss |
Req2 Prefill |
| b9186 |
1 |
147,986 / 148,124 (99.9%) |
138 |
317ms |
| b9186 |
4 |
147,986 / 148,124 (99.9%) |
138 |
318ms |
| b9297 |
0 |
0 / 148,124 (0%) |
148,124 |
67,937ms |
| b9297 |
4 |
143,894 / 148,124 (97.1%) |
4,230 |
3,074ms |
| b9297 |
8 |
143,894 / 148,124 (97.1%) |
4,230 |
3,095ms |
| b9297 |
16 |
143,894 / 148,124 (97.1%) |
4,230 |
3,095ms |
Cross-model test on b9297 (all clean except Qwen3.6-35B-A3B)
| Model |
Architecture |
Miss |
| Qwen3.6-35B-A3B |
MoE + hybrid |
4,108 |
| Qwen3.6-27B |
Dense |
16 |
| Qwen3.5-4B |
Dense |
16 |
| Gemma 4 E4B |
Dense |
12 |
| Gemma 4 26B-A4B |
MoE |
12 |
| Gemma 3 4B |
Dense + SWA |
12 |
Batch Size Test
| Batch Size |
Miss |
Minus new tokens |
| 4096 |
4,108 |
4,096 |
| 2048 |
2,060 |
2,048 |
| 16 |
28 |
16 |
Identical Request Repeated (b9297, 18k context)
Sent the exact same 18k-token request twice in a row to test cache lookup without conversation changes.
Shows one shot works normally.
| Request | Input Tokens | Cached | Prefill |
|---------|-------------|--------|---------|
| Req 1 | 17,614 | 0 | 5,141ms |
| Req 2 (identical) | 17,614 | 17,610 (99.98%) | 88ms |
Name and Version
llama-server b9297-b0df4c0cf (Windows CUDA 13.1 x64)
Regression since b9235. Last good: b9222.
Operating systems
Windows
GGML backends
CUDA
Hardware
RTX 3090 24GB, 32GB SYTEM RAM
Models
Qwen3.6-35B-A3B-UD-Q4_K_XL-mtp.ggufQwen3.6-35B-A3B-UD-Q5_K_M.ggufQwen3.6-27B-UD-Q4_K_XL.ggufQwen3.5-4B-UD-Q4_K_XL.ggufgemma-4-E4B-it-UD-Q8_K_XL.ggufgemma-4-26B-A4B-it-UD-IQ2_XXS.ggufgoogle.gemma-3-4b-it.Q2_K.ggufProblem description & steps to reproduce
While I was running benchmarks on local models, I encountered this issue when I updated my llama to the latest build. After investigation I tracked it down to the following commit.
I built build b9297 with
ccee426reverted and confirmed the cache drop was fixed.Issue:
Commit
ccee426(PR #23280) fixed a crash on hybrid attention models but introduced a cache reuse regression: exactly one batch's worth of cached tokens (-bvalue) is dropped and recomputed on every multi-turn request. At 150k context with-b 4096, follow-up prefill goes from 317ms → 3,074ms (~10x slower). The miss scales with batch size:-b 4096→ miss 4,108,-b 2048→ miss 2,060. For agentic use, or multi-turn generation, this can be 1-3 seconds every turn which gets larger with longer context.Reproduction:
Small reproducible test
Expected: miss = ~12 (only new tokens). Actual: miss = ~28 (includes one batch of dropped cache). The miss equals batch size + new tokens.
Large Context Test
Expected: miss = ~16 (only new tokens). Actual: miss = ~4,100.
Root cause
ccee426changes two lines intools/server/server-context.cpp:The
<=change causes the checkpoint search logic to run whenn_past == slot.prompt.n_tokens()(full prefix match). On Qwen3.6-35B-A3B,pos_minis high, sopos_min >= pos_min_tholdtriggers, no checkpoint is found, and the code forces a full reset (n_past = 0). One batch of cached tokens lost.Other models (including Qwen3.6-27B, Qwen3.5-4B, Gemma dense/MoE/SWA) keep
pos_minlow enough that the threshold check passes and existing KV cache is reused normally. I was not able to reproduce it with the other models I tried.Verbose log confirms this path (server:
llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_XL-mtp.gguf -ngl 99 -c 20000 -b 4096 -ub 4096 --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --n-cpu-moe 9 -fa on -t 8 --ctx-checkpoints 4 --log-file log.txt):First Bad Commit
ccee426touches server-context in this range)Verified: reverting the two lines on latest master (b22ff4b) restores correct behavior (miss = 16). This revert was not tested against the original crash from #23280, so it may reintroduce that issue.
Relevant log output
Qwen3.6-35B-A3B 150k from build b9186-b9297
Cross-model test on b9297 (all clean except Qwen3.6-35B-A3B)
Batch Size Test
Identical Request Repeated (b9297, 18k context)
Sent the exact same 18k-token request twice in a row to test cache lookup without conversation changes. Shows one shot works normally. | Request | Input Tokens | Cached | Prefill | |---------|-------------|--------|---------| | Req 1 | 17,614 | 0 | 5,141ms | | Req 2 (identical) | 17,614 | 17,610 (99.98%) | 88ms |