Skip to content

Eval bug: Answer in think tags. Qwen 3.6 27B #22398

@drrros

Description

@drrros

Name and Version

drros@epyc-ws:~/llama.cpp$ ./build/bin/llama-cli --version
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 71963 MiB):
Device 0: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
Device 1: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
Device 2: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
version: 8940 (78433f6)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

AMD Epyc 9274f \ 384Gb 4800 mt\s ddr5 \ 3*RTX PRO 4000 Blackwell

Models

Qwen 3.6 27B - Unsloth's Q8-XL quant with BF16 mmproj.

Problem description & steps to reproduce

Sometimes model spits out answer inside or doubles tags, not sure, but it looks like this in web interface:

Image
  • In logs meantime there is no parsing warnings or any other non standard messages.
    Ofc not sure is it a model itself or parsing \ template issue bug. But this is seems like cause of strange stops in agentic workloads - sometimes it just stops after some time and if asked to "continue" it runs fine again.
    Tend to think this is model issue, as this happened couple times also in VLLM (although I've not tried recent nightlies for couple days).
    Also not sure how to reproduce this in controlled environment, as this happened to me just 2-3 times in web interface, with chats not having anything in common. This screenshot for example is just younger daughters math's training, but it also happened in some python dev chats.
    Model load params and logs in logs section.

First Bad Commit

No response

Relevant log output

Logs
drros@epyc-ws:~/llama.cpp$ LLAMA_SET_ROWS=1 ./build/bin/llama-server --models-preset ../my-models.ini  --models-max 1 -np 6 --port 30000 --host 192.168.0.60 --webui-mcp-proxy --no-mmproj-offload -kvu --mlock --spec-type ngram-mod --spec-ngram-size-n 48  --draft-min 4 --draft-max 64
g

srv          load: spawning server instance with name=qwen3.6-27b-ud-q8-thinking-coding-vision on port 54915
srv          load: spawning server instance with args:
srv          load:   /home/drros/llama.cpp/build/bin/llama-server
srv          load:   --chat-template-kwargs
srv          load:   {"preserve_thinking":true}
srv          load:   --draft-max
srv          load:   64
srv          load:   --draft-n-min
srv          load:   4
srv          load:   --host
srv          load:   127.0.0.1
srv          load:   --image-min-tokens
srv          load:   2048
srv          load:   --min-p
srv          load:   0.0
srv          load:   --mlock
srv          load:   --no-mmap
srv          load:   --no-mmproj-offload
srv          load:   --port
srv          load:   54915
srv          load:   --presence-penalty
srv          load:   0.0
srv          load:   --repeat-penalty
srv          load:   1.0
srv          load:   --spec-ngram-size-n
srv          load:   48
srv          load:   --spec-type
srv          load:   ngram-mod
srv          load:   --temperature
srv          load:   0.6
srv          load:   --top-k
srv          load:   20
srv          load:   --top-p
srv          load:   0.95
srv          load:   --webui-mcp-proxy
srv          load:   --alias
srv          load:   qwen3.6-27b-ud-q8-thinking-coding-vision
srv          load:   --ctx-size
srv          load:   262144
srv          load:   --cache-ram
srv          load:   65536
srv          load:   --swa-checkpoints
srv          load:   128
srv          load:   --kv-unified
srv          load:   --model
srv          load:   /mnt/ds1nfs/codellamaweights/qwen3.6-27b-q8-xl/Qwen3.6-27B-UD-Q8_K_XL.gguf
srv          load:   --mmproj
srv          load:   /mnt/ds1nfs/codellamaweights/qwen3.6-27b-q8-xl/mmproj-BF16.gguf
srv          load:   --parallel
srv          load:   6
srv          load:   --reasoning
srv          load:   on
srv          load:   --split-mode
srv          load:   tensor
srv          load:   --ubatch-size
srv          load:   2048

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions