Eval bug: KV cache drops ~4k tokens per turn on Qwen3.6-35B-A3B (since build b9235 )

### Name and Version

llama-server b9297-b0df4c0cf (Windows CUDA 13.1 x64)
Regression since b9235. Last good: b9222.

### Operating systems

Windows

### GGML backends

CUDA

### Hardware

RTX 3090 24GB, 32GB SYTEM RAM

### Models


| Model | File | Source | Miss |
|-------|------|--------|------|
| Qwen3.6-35B-A3B Q4_K_XL MTP | `Qwen3.6-35B-A3B-UD-Q4_K_XL-mtp.gguf` | [unsloth/Qwen3.6-35B-A3B-MTP-GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF) | 4,108 |
| Qwen3.6-35B-A3B Q5_K_M | `Qwen3.6-35B-A3B-UD-Q5_K_M.gguf` | [unsloth/Qwen3.6-35B-A3B-GGUF](https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF) | 4,108 |
| Qwen3.6-27B Q4_K_XL | `Qwen3.6-27B-UD-Q4_K_XL.gguf` | [unsloth/Qwen3.6-27B-GGUF](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) | 16 |
| Qwen3.5-4B Q4_K_XL | `Qwen3.5-4B-UD-Q4_K_XL.gguf` | [unsloth/Qwen3.5-4B-GGUF](https://huggingface.co/unsloth/Qwen3.5-4B-GGUF) | 16 |
| Gemma 4 E4B Q8_K_XL | `gemma-4-E4B-it-UD-Q8_K_XL.gguf` | [unsloth/gemma-4-E4B-it-GGUF](https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF) | 12 |
| Gemma 4 26B-A4B IQ2_XXS | `gemma-4-26B-A4B-it-UD-IQ2_XXS.gguf` | [unsloth/gemma-4-26B-A4B-it-GGUF](https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF) | 12 |
| Gemma 3 4B Q2_K | `google.gemma-3-4b-it.Q2_K.gguf` | [DevQuasar/google.gemma-3-4b-it-GGUF](https://huggingface.co/DevQuasar/google.gemma-3-4b-it-GGUF) | 12 |


### Problem description & steps to reproduce

While I was running benchmarks on local models, I encountered this issue when I updated my llama to the latest build. After investigation I tracked it down to the following commit.
I built build b9297 with `ccee426` reverted and confirmed the cache drop was fixed.

**Issue:**

Commit `ccee426` (PR #23280) fixed a crash on hybrid attention models but introduced a cache reuse regression: exactly one batch's worth of cached tokens (`-b` value) is dropped and recomputed on every multi-turn request. At 150k context with `-b 4096`, follow-up prefill goes from 317ms → 3,074ms (~10x slower). The miss scales with batch size: `-b 4096` → miss 4,108, `-b 2048` → miss 2,060. For agentic use, or multi-turn generation, this can be 1-3 seconds every turn which gets larger with longer context.

**Reproduction:**

<details>
<summary>Small reproducible test</summary>

```bash
llama-server -m Qwen3.6-35B-A3B.gguf -ngl 99 -c 4096 -b 16 -ub 16 \
  --ctx-checkpoints 4 --host 0.0.0.0 --port 8080

# Generate ~500 tokens of filler
CONTENT=$(python3 -c "print('The quick brown fox jumps over the lazy dog. ' * 50)")
ESCAPED=$(python3 -c "import json,sys; print(json.dumps(sys.argv[1])[1:-1])" "$CONTENT")

# Request 1: fresh prompt
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "$(printf '{"model":"local","messages":[{"role":"user","content":"%s"}],"temperature":0,"max_tokens":1,"stream":false}' "$ESCAPED")" \
  | jq '{input: .usage.prompt_tokens, cached: (.usage.prompt_tokens_details.cached_tokens // 0)}'

# Request 2: same prompt + follow-up (tests cache reuse)
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "$(printf '{"model":"local","messages":[{"role":"user","content":"%s"},{"role":"assistant","content":"OK"},{"role":"user","content":"Hello"}],"temperature":0,"max_tokens":1,"stream":false}' "$ESCAPED")" \
  | jq '{input: .usage.prompt_tokens, cached: (.usage.prompt_tokens_details.cached_tokens // 0), miss: (.usage.prompt_tokens - (.usage.prompt_tokens_details.cached_tokens // 0))}'
```

Expected: miss = ~12 (only new tokens). Actual: miss = ~28 (includes one batch of dropped cache). The miss equals batch size + new tokens.
</details>



<details>
<summary>Large Context Test</summary>

```bash
CONTENT=$(head -c 58860 wiki.txt)
ESCAPED=$(python3 -c "import json,sys; print(json.dumps(sys.argv[1])[1:-1])" "$CONTENT")

# Request 1
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "$(printf '{"model":"local","messages":[{"role":"user","content":"%s"}],"temperature":0,"max_tokens":1,"stream":false}' "$ESCAPED")" \
  | jq '{input: .usage.prompt_tokens, cached: (.usage.prompt_tokens_details.cached_tokens // 0)}'

# Request 2 (same prompt + follow-up)
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "$(printf '{"model":"local","messages":[{"role":"user","content":"%s"},{"role":"assistant","content":"OK"},{"role":"user","content":"Hello"}],"temperature":0,"max_tokens":1,"stream":false}' "$ESCAPED")" \
  | jq '{input: .usage.prompt_tokens, cached: (.usage.prompt_tokens_details.cached_tokens // 0), miss: (.usage.prompt_tokens - (.usage.prompt_tokens_details.cached_tokens // 0))}'
```
Expected: miss = ~16 (only new tokens). Actual: miss = ~4,100.

</details>

## Root cause

`ccee426` changes two lines in `tools/server/server-context.cpp`:

```diff
- const auto pos_min_thold = std::max(0, pos_next - n_swa);
+ const auto pos_min_thold = std::max(0, pos_next - n_swa - 1);

- if (n_past > 0 && n_past < slot.prompt.n_tokens()) {
+ if (n_past > 0 && n_past <= slot.prompt.n_tokens()) {
```

The `<=` change causes the checkpoint search logic to run when `n_past == slot.prompt.n_tokens()` (full prefix match). On Qwen3.6-35B-A3B, `pos_min` is high, so `pos_min >= pos_min_thold` triggers, no checkpoint is found, and the code forces a full reset (`n_past = 0`). One batch of cached tokens lost.

Other models (including Qwen3.6-27B, Qwen3.5-4B, Gemma dense/MoE/SWA) keep `pos_min` low enough that the threshold check passes and existing KV cache is reused normally. I was not able to reproduce it with the other models I tried.

Verbose log confirms this path (server: `llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_XL-mtp.gguf -ngl 99 -c 20000 -b 4096 -ub 4096 --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --n-cpu-moe 9 -fa on -t 8 --ctx-checkpoints 4 --log-file log.txt`):
```
n_past = 17610, pos_min = 17613, n_swa = 0
Checking checkpoint with [17609, 17609] against 17609...  ← fails (17609 < 17609 is false)
Checking checkpoint with [13517, 13517] against 17609...  ← used (13517 < 17609)
restored context checkpoint (n_tokens = 13518, n_past = 13518)
```

### First Bad Commit

| Build | Miss | Status |
|-------|------|--------|
| b9222 | 16 | Last good |
| b9235 | 4,108 | First bad (only `ccee426` touches server-context in this range) |

Verified: reverting the two lines on latest master (b22ff4b) restores correct behavior (miss = 16). This revert was not tested against the original crash from #23280, so it may reintroduce that issue.

### Relevant log output

<details>
<summary>Qwen3.6-35B-A3B 150k from build b9186-b9297</summary>

| Build | Checkpoints | Req2 Cached | Miss | Req2 Prefill |
|-------|-------------|-------------|------|--------------|
| b9186 | 1 | 147,986 / 148,124 (99.9%) | 138 | 317ms |
| b9186 | 4 | 147,986 / 148,124 (99.9%) | 138 | 318ms |
| b9297 | 0 | 0 / 148,124 (0%) | 148,124 | 67,937ms |
| b9297 | 4 | 143,894 / 148,124 (97.1%) | 4,230 | 3,074ms |
| b9297 | 8 | 143,894 / 148,124 (97.1%) | 4,230 | 3,095ms |
| b9297 | 16 | 143,894 / 148,124 (97.1%) | 4,230 | 3,095ms |
</details>

<details>
<summary>Cross-model test on b9297 (all clean except Qwen3.6-35B-A3B)</summary>

| Model | Architecture | Miss |
|-------|-------------|------|
| Qwen3.6-35B-A3B | MoE + hybrid | 4,108 |
| Qwen3.6-27B | Dense | 16 |
| Qwen3.5-4B | Dense | 16 |
| Gemma 4 E4B | Dense | 12 |
| Gemma 4 26B-A4B | MoE | 12 |
| Gemma 3 4B | Dense + SWA | 12 |

</details>

<details>
<summary>Batch Size Test</summary>

| Batch Size | Miss | Minus new tokens |
|------------|------|------------------|
| 4096 | 4,108 | 4,096 |
| 2048 | 2,060 | 2,048 |
| 16 | 28 | 16 | 

</details>

<details>
<summary> Identical Request Repeated (b9297, 18k context)</summary>
Sent the exact same 18k-token request twice in a row to test cache lookup without conversation changes.
Shows one shot works normally.
| Request | Input Tokens | Cached | Prefill |
|---------|-------------|--------|---------|
| Req 1 | 17,614 | 0 | 5,141ms |
| Req 2 (identical) | 17,614 | 17,610 (99.98%) | 88ms |
</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: KV cache drops ~4k tokens per turn on Qwen3.6-35B-A3B (since build b9235 ) #23589

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Root cause

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	File	Source	Miss
Qwen3.6-35B-A3B Q4_K_XL MTP	`Qwen3.6-35B-A3B-UD-Q4_K_XL-mtp.gguf`	unsloth/Qwen3.6-35B-A3B-MTP-GGUF	4,108
Qwen3.6-35B-A3B Q5_K_M	`Qwen3.6-35B-A3B-UD-Q5_K_M.gguf`	unsloth/Qwen3.6-35B-A3B-GGUF	4,108
Qwen3.6-27B Q4_K_XL	`Qwen3.6-27B-UD-Q4_K_XL.gguf`	unsloth/Qwen3.6-27B-GGUF	16
Qwen3.5-4B Q4_K_XL	`Qwen3.5-4B-UD-Q4_K_XL.gguf`	unsloth/Qwen3.5-4B-GGUF	16
Gemma 4 E4B Q8_K_XL	`gemma-4-E4B-it-UD-Q8_K_XL.gguf`	unsloth/gemma-4-E4B-it-GGUF	12
Gemma 4 26B-A4B IQ2_XXS	`gemma-4-26B-A4B-it-UD-IQ2_XXS.gguf`	unsloth/gemma-4-26B-A4B-it-GGUF	12
Gemma 3 4B Q2_K	`google.gemma-3-4b-it.Q2_K.gguf`	DevQuasar/google.gemma-3-4b-it-GGUF	12

Build	Miss	Status
b9222	16	Last good
b9235	4,108	First bad (only `ccee426` touches server-context in this range)

Build	Checkpoints	Req2 Cached	Miss	Req2 Prefill
b9186	1	147,986 / 148,124 (99.9%)	138	317ms
b9186	4	147,986 / 148,124 (99.9%)	138	318ms
b9297	0	0 / 148,124 (0%)	148,124	67,937ms
b9297	4	143,894 / 148,124 (97.1%)	4,230	3,074ms
b9297	8	143,894 / 148,124 (97.1%)	4,230	3,095ms
b9297	16	143,894 / 148,124 (97.1%)	4,230	3,095ms

Model	Architecture	Miss
Qwen3.6-35B-A3B	MoE + hybrid	4,108
Qwen3.6-27B	Dense	16
Qwen3.5-4B	Dense	16
Gemma 4 E4B	Dense	12
Gemma 4 26B-A4B	MoE	12
Gemma 3 4B	Dense + SWA	12

Eval bug: KV cache drops ~4k tokens per turn on Qwen3.6-35B-A3B (since build b9235 ) #23589

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Root cause

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions