Skip to content

Eval bug: llama-server Qwen3.5 vision behave weirdly on certain batchsize settings. #19929

@TkskKurumi

Description

@TkskKurumi

Name and Version

$ /home/tkskkurumi/llama.cpp/build_832aa9476/bin/llama-server --version
ggml_cuda_init: found 7 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 3: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 4: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 5: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
  Device 6: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
version: 8155 (832aa9476)
built with GNU 13.3.0 for Linux x86_64
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jul_16_07:30:01_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.48
Build cuda_13.0.r13.0/compiler.36260728_0

Operating systems

Linux

GGML backends

CUDA

Hardware

RTX 4090d 48GB / RTX 3080 20GB

Models

Qwen3.5 122B UD-Q3_K_XL by unsloth
Qwen3.5 35B UD-Q3_K_XL by unsloth

Problem description & steps to reproduce

The bug is firstly spotted on Qwen3.5 122B and I reproduced on Qwen3.5 35B as well.
Using 3gpus (4090d 48GB + 3080 20GB + 3080 20GB, -ts 22,13,12) to run Qwen3.5 122B unsloth UD-Q3_K_XL. On certain batchsize settings (-b 6144 -ub 2048), Qwen3.5 vision response is totally irrelevant to image input/claim that image is blank/claim that no image is provided.

Image irrelevent to given image. Image irrelevent to given image. Image claim that image is blank

But on (-b 8192 -ub 2048) and (-b 3072 -ub 1024) it works well.

Then I also tested Qwen3.5 35B unsloth UD-Q3_K_XL on 3x 3080 20GB.

model b ub result
Qwen3.5 122B 3bit 6144 2048 ❌response is irrelevant to image
Qwen3.5 122B 3bit 8192 2048 ✅work well
Qwen3.5 122B 3bit 3072 1024 ✅work well
- - - -
Qwen3.5 35B 3bit 8192 2048 ✅work well
Qwen3.5 35B 3bit 6144 2048 ❌response is irrelevant to image
Qwen3.5 35B 3bit 3072 1024 ❌response is irrelevant to image
- - - -
Qwen3.5 35B 3bit --chat-template-kwargs '{"enable_thinking": false}' 6144 2048 ❌claim that provided image is blank❌
Qwen3.5 122B 3bit --chat-template-kwargs '{"enable_thinking": false}' 6144 2048 ❌response is irrelavent to image

mmproj is all using mmproj-F16.gguf.

First Bad Commit

No response

Relevant log output

bad run logs
srv  params_from_: Chat format: peg-constructed
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 4121895662
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 790, total state size = 65.286 MiB
srv          load:  - looking for better prompt, base f_keep = 0.005, sim = 0.007
srv        update:  - cache state: 3 prompts, 193.000 MiB (limits: 1024.000 MiB, 40192 tokens, 40192 est)
srv        update:    - prompt 0x574b8fea9220:     639 tokens, checkpoints:  0,    63.852 MiB
srv        update:    - prompt 0x574ba91fa270:     101 tokens, checkpoints:  0,    63.863 MiB
srv        update:    - prompt 0x71369c00f7c0:     790 tokens, checkpoints:  0,    65.286 MiB
srv  get_availabl: prompt cache update took 162.39 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 401 | processing task, is_child = 0
slot update_slots: id  0 | task 401 | new prompt, n_ctx_slot = 40192, n_keep = 0, task.n_tokens = 560
slot update_slots: id  0 | task 401 | n_past = 4, slot.prompt.tokens.size() = 790, seq_id = 0, pos_min = 283, n_swa = 1
slot update_slots: id  0 | task 401 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 401 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 401 | prompt processing progress, n_tokens = 4, batch.n_tokens = 4, progress = 0.007143
srv  log_server_r: done request: POST /v1/chat/completions 192.168.31.175 200
slot update_slots: id  0 | task 401 | n_tokens = 4, memory_seq_rm [4, end)
srv  process_chun: processing image...
encoding image slice...
image slice encoded in 63 ms
srv  process_chun: image processed in 63 ms
slot init_sampler: id  0 | task 401 | init sampler, took 0.01 ms, tokens: text = 14, total = 560
slot update_slots: id  0 | task 401 | prompt processing done, n_tokens = 560, batch.n_tokens = 10
find_slot: non-consecutive token position 55 after 3 for sequence 0 with 10 new tokens
find_slot: non-consecutive token position 55 after 3 for sequence 0 with 10 new tokens
slot print_timing: id  0 | task 401 | 
prompt eval time =     152.89 ms /   560 tokens (    0.27 ms per token,  3662.67 tokens per second)
       eval time =    1428.20 ms /   132 tokens (   10.82 ms per token,    92.42 tokens per second)
      total time =    1581.10 ms /   692 tokens
slot      release: id  0 | task 401 | stop processing: n_tokens = 691, truncated = 0
srv  update_slots: all slots are idle
good run Logs srv params_from_: Chat format: peg-constructed slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1 slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist slot launch_slot_: id 0 | task 0 | processing task, is_child = 0 slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 40192, n_keep = 0, task.n_tokens = 573 slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 4, batch.n_tokens = 4, progress = 0.006981 srv log_server_r: done request: POST /v1/chat/completions 192.168.31.175 200 slot update_slots: id 0 | task 0 | n_tokens = 4, memory_seq_rm [4, end) srv process_chun: processing image... encoding image slice... image slice encoded in 198 ms decoding image batch 1/1, n_tokens_batch = 559 find_slot: non-consecutive token position 4 after 3 for sequence 0 with 559 new tokens find_slot: non-consecutive token position 4 after 3 for sequence 0 with 559 new tokens image decoded (batch 1/1) in 401 ms srv process_chun: image processed in 599 ms slot init_sampler: id 0 | task 0 | init sampler, took 0.01 ms, tokens: text = 14, total = 573 slot update_slots: id 0 | task 0 | prompt processing done, n_tokens = 573, batch.n_tokens = 10 find_slot: non-consecutive token position 56 after 4 for sequence 0 with 10 new tokens find_slot: non-consecutive token position 56 after 4 for sequence 0 with 10 new tokens slot print_timing: id 0 | task 0 | prompt eval time = 790.00 ms / 573 tokens ( 1.38 ms per token, 725.32 tokens per second) eval time = 4201.49 ms / 387 tokens ( 10.86 ms per token, 92.11 tokens per second) total time = 4991.49 ms / 960 tokens slot release: id 0 | task 0 | stop processing: n_tokens = 959, truncated = 0 srv update_slots: all slots are idle ```

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions