Eval bug: llama-server Qwen3.5 vision behave weirdly on certain batchsize settings.

### Name and Version

```
$ /home/tkskkurumi/llama.cpp/build_832aa9476/bin/llama-server --version
ggml_cuda_init: found 7 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 3: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 4: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 5: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 6: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
version: 8155 (832aa9476)
built with GNU 13.3.0 for Linux x86_64
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jul_16_07:30:01_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.48
Build cuda_13.0.r13.0/compiler.36260728_0
```

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

RTX 4090d 48GB / RTX 3080 20GB

### Models

[Qwen3.5 122B UD-Q3_K_XL by unsloth](https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF/tree/main/UD-Q3_K_XL)
[Qwen3.5 35B UD-Q3_K_XL by unsloth](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/blob/main/Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf)

### Problem description & steps to reproduce

The bug is firstly spotted on Qwen3.5 122B and I reproduced on Qwen3.5 35B as well.
Using 3gpus (4090d 48GB + 3080 20GB + 3080 20GB, -ts 22,13,12) to run [Qwen3.5 122B unsloth UD-Q3_K_XL](https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF/tree/main/UD-Q3_K_XL). On certain batchsize settings (-b 6144 -ub 2048), Qwen3.5 vision response is totally irrelevant to image input/claim that image is blank/claim that no image is provided.

<img width="934" height="816" alt="Image" src="https://github.com/user-attachments/assets/8798a6f8-6ee0-401e-9928-825601b14a9a" />
irrelevent to given image.
<img width="851" height="696" alt="Image" src="https://github.com/user-attachments/assets/4dbb05e5-bf21-4b88-aca9-95ef94a79db6" />
irrelevent to given image.
<img width="847" height="677" alt="Image" src="https://github.com/user-attachments/assets/5fa14023-9e6c-4e1f-b33e-f224866c880c" />
claim that image is blank

But on (-b 8192 -ub 2048) and (-b 3072 -ub 1024) it works well.

Then I also tested [Qwen3.5 35B unsloth UD-Q3_K_XL](https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/blob/main/Qwen3.5-35B-A3B-UD-Q3_K_XL.gguf) on 3x 3080 20GB. 

|model|b|ub|result|
|-|-|-|-|
|Qwen3.5 122B 3bit|6144|2048|❌response is irrelevant to image|
|Qwen3.5 122B 3bit|8192|2048|✅work well |
|Qwen3.5 122B 3bit|3072|1024|✅work well |
|-|-|-|-|
|Qwen3.5 35B 3bit|8192|2048|✅work well |
|Qwen3.5 35B 3bit|6144|2048|❌response is irrelevant to image|
|Qwen3.5 35B 3bit|3072|1024|❌response is irrelevant to image|
|-|-|-|-|
|Qwen3.5 35B 3bit --chat-template-kwargs '{"enable_thinking": false}'|6144|2048|❌claim that provided image is blank❌|
|Qwen3.5 122B 3bit --chat-template-kwargs '{"enable_thinking": false}'|6144|2048|❌response is irrelavent to image|

mmproj is all using mmproj-F16.gguf.


### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>bad run logs</summary>


```console

srv  params_from_: Chat format: peg-constructed
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 4121895662
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 790, total state size = 65.286 MiB
srv          load:  - looking for better prompt, base f_keep = 0.005, sim = 0.007
srv        update:  - cache state: 3 prompts, 193.000 MiB (limits: 1024.000 MiB, 40192 tokens, 40192 est)
srv        update:    - prompt 0x574b8fea9220:     639 tokens, checkpoints:  0,    63.852 MiB
srv        update:    - prompt 0x574ba91fa270:     101 tokens, checkpoints:  0,    63.863 MiB
srv        update:    - prompt 0x71369c00f7c0:     790 tokens, checkpoints:  0,    65.286 MiB
srv  get_availabl: prompt cache update took 162.39 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 401 | processing task, is_child = 0
slot update_slots: id  0 | task 401 | new prompt, n_ctx_slot = 40192, n_keep = 0, task.n_tokens = 560
slot update_slots: id  0 | task 401 | n_past = 4, slot.prompt.tokens.size() = 790, seq_id = 0, pos_min = 283, n_swa = 1
slot update_slots: id  0 | task 401 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 401 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 401 | prompt processing progress, n_tokens = 4, batch.n_tokens = 4, progress = 0.007143
srv  log_server_r: done request: POST /v1/chat/completions 192.168.31.175 200
slot update_slots: id  0 | task 401 | n_tokens = 4, memory_seq_rm [4, end)
srv  process_chun: processing image...
encoding image slice...
image slice encoded in 63 ms
srv  process_chun: image processed in 63 ms
slot init_sampler: id  0 | task 401 | init sampler, took 0.01 ms, tokens: text = 14, total = 560
slot update_slots: id  0 | task 401 | prompt processing done, n_tokens = 560, batch.n_tokens = 10
find_slot: non-consecutive token position 55 after 3 for sequence 0 with 10 new tokens
find_slot: non-consecutive token position 55 after 3 for sequence 0 with 10 new tokens
slot print_timing: id  0 | task 401 | 
prompt eval time =     152.89 ms /   560 tokens (    0.27 ms per token,  3662.67 tokens per second)
       eval time =    1428.20 ms /   132 tokens (   10.82 ms per token,    92.42 tokens per second)
      total time =    1581.10 ms /   692 tokens
slot      release: id  0 | task 401 | stop processing: n_tokens = 691, truncated = 0
srv  update_slots: all slots are idle

```
</details>


<details>
<summary>good run Logs</summary>
srv  params_from_: Chat format: peg-constructed
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 40192, n_keep = 0, task.n_tokens = 573
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 4, batch.n_tokens = 4, progress = 0.006981
srv  log_server_r: done request: POST /v1/chat/completions 192.168.31.175 200
slot update_slots: id  0 | task 0 | n_tokens = 4, memory_seq_rm [4, end)
srv  process_chun: processing image...
encoding image slice...
image slice encoded in 198 ms
decoding image batch 1/1, n_tokens_batch = 559
find_slot: non-consecutive token position 4 after 3 for sequence 0 with 559 new tokens
find_slot: non-consecutive token position 4 after 3 for sequence 0 with 559 new tokens
image decoded (batch 1/1) in 401 ms
srv  process_chun: image processed in 599 ms
slot init_sampler: id  0 | task 0 | init sampler, took 0.01 ms, tokens: text = 14, total = 573
slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 573, batch.n_tokens = 10
find_slot: non-consecutive token position 56 after 4 for sequence 0 with 10 new tokens
find_slot: non-consecutive token position 56 after 4 for sequence 0 with 10 new tokens
slot print_timing: id  0 | task 0 | 
prompt eval time =     790.00 ms /   573 tokens (    1.38 ms per token,   725.32 tokens per second)
       eval time =    4201.49 ms /   387 tokens (   10.86 ms per token,    92.11 tokens per second)
      total time =    4991.49 ms /   960 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 959, truncated = 0
srv  update_slots: all slots are idle
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: llama-server Qwen3.5 vision behave weirdly on certain batchsize settings. #19929

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

model	b	ub	result
Qwen3.5 122B 3bit	6144	2048	❌response is irrelevant to image
Qwen3.5 122B 3bit	8192	2048	✅work well
Qwen3.5 122B 3bit	3072	1024	✅work well
-	-	-	-
Qwen3.5 35B 3bit	8192	2048	✅work well
Qwen3.5 35B 3bit	6144	2048	❌response is irrelevant to image
Qwen3.5 35B 3bit	3072	1024	❌response is irrelevant to image
-	-	-	-
Qwen3.5 35B 3bit --chat-template-kwargs '{"enable_thinking": false}'	6144	2048	❌claim that provided image is blank❌
Qwen3.5 122B 3bit --chat-template-kwargs '{"enable_thinking": false}'	6144	2048	❌response is irrelavent to image

Eval bug: llama-server Qwen3.5 vision behave weirdly on certain batchsize settings. #19929

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions