srv params_from_: Chat format: peg-constructed
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = 4121895662
srv get_availabl: updating prompt cache
srv prompt_save: - saving prompt with length 790, total state size = 65.286 MiB
srv load: - looking for better prompt, base f_keep = 0.005, sim = 0.007
srv update: - cache state: 3 prompts, 193.000 MiB (limits: 1024.000 MiB, 40192 tokens, 40192 est)
srv update: - prompt 0x574b8fea9220: 639 tokens, checkpoints: 0, 63.852 MiB
srv update: - prompt 0x574ba91fa270: 101 tokens, checkpoints: 0, 63.863 MiB
srv update: - prompt 0x71369c00f7c0: 790 tokens, checkpoints: 0, 65.286 MiB
srv get_availabl: prompt cache update took 162.39 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 0 | task 401 | processing task, is_child = 0
slot update_slots: id 0 | task 401 | new prompt, n_ctx_slot = 40192, n_keep = 0, task.n_tokens = 560
slot update_slots: id 0 | task 401 | n_past = 4, slot.prompt.tokens.size() = 790, seq_id = 0, pos_min = 283, n_swa = 1
slot update_slots: id 0 | task 401 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id 0 | task 401 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 401 | prompt processing progress, n_tokens = 4, batch.n_tokens = 4, progress = 0.007143
srv log_server_r: done request: POST /v1/chat/completions 192.168.31.175 200
slot update_slots: id 0 | task 401 | n_tokens = 4, memory_seq_rm [4, end)
srv process_chun: processing image...
encoding image slice...
image slice encoded in 63 ms
srv process_chun: image processed in 63 ms
slot init_sampler: id 0 | task 401 | init sampler, took 0.01 ms, tokens: text = 14, total = 560
slot update_slots: id 0 | task 401 | prompt processing done, n_tokens = 560, batch.n_tokens = 10
find_slot: non-consecutive token position 55 after 3 for sequence 0 with 10 new tokens
find_slot: non-consecutive token position 55 after 3 for sequence 0 with 10 new tokens
slot print_timing: id 0 | task 401 |
prompt eval time = 152.89 ms / 560 tokens ( 0.27 ms per token, 3662.67 tokens per second)
eval time = 1428.20 ms / 132 tokens ( 10.82 ms per token, 92.42 tokens per second)
total time = 1581.10 ms / 692 tokens
slot release: id 0 | task 401 | stop processing: n_tokens = 691, truncated = 0
srv update_slots: all slots are idle
Name and Version
Operating systems
Linux
GGML backends
CUDA
Hardware
RTX 4090d 48GB / RTX 3080 20GB
Models
Qwen3.5 122B UD-Q3_K_XL by unsloth
Qwen3.5 35B UD-Q3_K_XL by unsloth
Problem description & steps to reproduce
The bug is firstly spotted on Qwen3.5 122B and I reproduced on Qwen3.5 35B as well.
Using 3gpus (4090d 48GB + 3080 20GB + 3080 20GB, -ts 22,13,12) to run Qwen3.5 122B unsloth UD-Q3_K_XL. On certain batchsize settings (-b 6144 -ub 2048), Qwen3.5 vision response is totally irrelevant to image input/claim that image is blank/claim that no image is provided.
But on (-b 8192 -ub 2048) and (-b 3072 -ub 1024) it works well.
Then I also tested Qwen3.5 35B unsloth UD-Q3_K_XL on 3x 3080 20GB.
mmproj is all using mmproj-F16.gguf.
First Bad Commit
No response
Relevant log output
bad run logs
good run Logs
srv params_from_: Chat format: peg-constructed slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1 slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist slot launch_slot_: id 0 | task 0 | processing task, is_child = 0 slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 40192, n_keep = 0, task.n_tokens = 573 slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 4, batch.n_tokens = 4, progress = 0.006981 srv log_server_r: done request: POST /v1/chat/completions 192.168.31.175 200 slot update_slots: id 0 | task 0 | n_tokens = 4, memory_seq_rm [4, end) srv process_chun: processing image... encoding image slice... image slice encoded in 198 ms decoding image batch 1/1, n_tokens_batch = 559 find_slot: non-consecutive token position 4 after 3 for sequence 0 with 559 new tokens find_slot: non-consecutive token position 4 after 3 for sequence 0 with 559 new tokens image decoded (batch 1/1) in 401 ms srv process_chun: image processed in 599 ms slot init_sampler: id 0 | task 0 | init sampler, took 0.01 ms, tokens: text = 14, total = 573 slot update_slots: id 0 | task 0 | prompt processing done, n_tokens = 573, batch.n_tokens = 10 find_slot: non-consecutive token position 56 after 4 for sequence 0 with 10 new tokens find_slot: non-consecutive token position 56 after 4 for sequence 0 with 10 new tokens slot print_timing: id 0 | task 0 | prompt eval time = 790.00 ms / 573 tokens ( 1.38 ms per token, 725.32 tokens per second) eval time = 4201.49 ms / 387 tokens ( 10.86 ms per token, 92.11 tokens per second) total time = 4991.49 ms / 960 tokens slot release: id 0 | task 0 | stop processing: n_tokens = 959, truncated = 0 srv update_slots: all slots are idle ```