Name and Version
drros@epyc-ws:~/llama.cpp$ ./build/bin/llama-cli --version
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 71963 MiB):
Device 0: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
Device 1: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
Device 2: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
version: 8940 (78433f6)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
AMD Epyc 9274f \ 384Gb 4800 mt\s ddr5 \ 3*RTX PRO 4000 Blackwell
Models
Qwen 3.6 27B - Unsloth's Q8-XL quant with BF16 mmproj.
Problem description & steps to reproduce
Sometimes model spits out answer inside or doubles tags, not sure, but it looks like this in web interface:
- In logs meantime there is no parsing warnings or any other non standard messages.
Ofc not sure is it a model itself or parsing \ template issue bug. But this is seems like cause of strange stops in agentic workloads - sometimes it just stops after some time and if asked to "continue" it runs fine again.
Tend to think this is model issue, as this happened couple times also in VLLM (although I've not tried recent nightlies for couple days).
Also not sure how to reproduce this in controlled environment, as this happened to me just 2-3 times in web interface, with chats not having anything in common. This screenshot for example is just younger daughters math's training, but it also happened in some python dev chats.
Model load params and logs in logs section.
First Bad Commit
No response
Relevant log output
Logs
drros@epyc-ws:~/llama.cpp$ LLAMA_SET_ROWS=1 ./build/bin/llama-server --models-preset ../my-models.ini --models-max 1 -np 6 --port 30000 --host 192.168.0.60 --webui-mcp-proxy --no-mmproj-offload -kvu --mlock --spec-type ngram-mod --spec-ngram-size-n 48 --draft-min 4 --draft-max 64
g
srv load: spawning server instance with name=qwen3.6-27b-ud-q8-thinking-coding-vision on port 54915
srv load: spawning server instance with args:
srv load: /home/drros/llama.cpp/build/bin/llama-server
srv load: --chat-template-kwargs
srv load: {"preserve_thinking":true}
srv load: --draft-max
srv load: 64
srv load: --draft-n-min
srv load: 4
srv load: --host
srv load: 127.0.0.1
srv load: --image-min-tokens
srv load: 2048
srv load: --min-p
srv load: 0.0
srv load: --mlock
srv load: --no-mmap
srv load: --no-mmproj-offload
srv load: --port
srv load: 54915
srv load: --presence-penalty
srv load: 0.0
srv load: --repeat-penalty
srv load: 1.0
srv load: --spec-ngram-size-n
srv load: 48
srv load: --spec-type
srv load: ngram-mod
srv load: --temperature
srv load: 0.6
srv load: --top-k
srv load: 20
srv load: --top-p
srv load: 0.95
srv load: --webui-mcp-proxy
srv load: --alias
srv load: qwen3.6-27b-ud-q8-thinking-coding-vision
srv load: --ctx-size
srv load: 262144
srv load: --cache-ram
srv load: 65536
srv load: --swa-checkpoints
srv load: 128
srv load: --kv-unified
srv load: --model
srv load: /mnt/ds1nfs/codellamaweights/qwen3.6-27b-q8-xl/Qwen3.6-27B-UD-Q8_K_XL.gguf
srv load: --mmproj
srv load: /mnt/ds1nfs/codellamaweights/qwen3.6-27b-q8-xl/mmproj-BF16.gguf
srv load: --parallel
srv load: 6
srv load: --reasoning
srv load: on
srv load: --split-mode
srv load: tensor
srv load: --ubatch-size
srv load: 2048
Name and Version
drros@epyc-ws:~/llama.cpp$ ./build/bin/llama-cli --version
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 71963 MiB):
Device 0: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
Device 1: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
Device 2: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
version: 8940 (78433f6)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
AMD Epyc 9274f \ 384Gb 4800 mt\s ddr5 \ 3*RTX PRO 4000 Blackwell
Models
Qwen 3.6 27B - Unsloth's Q8-XL quant with BF16 mmproj.
Problem description & steps to reproduce
Sometimes model spits out answer inside or doubles tags, not sure, but it looks like this in web interface:
Ofc not sure is it a model itself or parsing \ template issue bug. But this is seems like cause of strange stops in agentic workloads - sometimes it just stops after some time and if asked to "continue" it runs fine again.
Tend to think this is model issue, as this happened couple times also in VLLM (although I've not tried recent nightlies for couple days).
Also not sure how to reproduce this in controlled environment, as this happened to me just 2-3 times in web interface, with chats not having anything in common. This screenshot for example is just younger daughters math's training, but it also happened in some python dev chats.
Model load params and logs in logs section.
First Bad Commit
No response
Relevant log output
Logs