Eval bug: Qwen3.5 9B often prints tool calls in XML and stops when thinking is enabled - tool calls inside thinking block

### Name and Version

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 16368 MiB):
  Device 0: AMD Radeon RX 6900 XT, gfx1030 (0x1030), VMM: no, Wave Size: 32, VRAM: 16368 MiB
version: 8461 (cea560f)
built with GNU 15.2.0 for Linux x86_64

### Operating systems

Linux

### GGML backends

HIP

### Hardware

CPU: AMD Ryzen 9 5950X
GPU: AMD Radeon RX 6900 XT
OS: NixOS 26.05 (Yarara) x86_64

I think the GGML backend is HIP because that's what the nixpkgs package `llama-cpp-rocm` enables - https://github.com/NixOS/nixpkgs/blob/a1e8ce6b50ffa87ad0d39881c47eb214982330dc/pkgs/by-name/ll/llama-cpp/package.nix

### Models

unsloth Qwen3.5-9B-UD-Q4_K_XL
https://huggingface.co/unsloth/Qwen3.5-9B-GGUF

### Problem description & steps to reproduce

I'm running llama.cpp server with the command:
```
llama-server --port 8001 --models-preset ~/llama.cpp/models.ini --host 0.0.0.0
```
My `models.ini` contains:
```ini
version = 1

kv-unified = true
cache-type-k = q8_0
cache-type-v = q8_0

[Qwen3.5-9B-UD-Q4-Coding]
model = /home/user/llama.cpp/unsloth/qwen3.5-9b-ud-q4_k_xl/unsloth_Qwen3.5-9B-GGUF_Qwen3.5-9B-UD-Q4_K_XL.gguf
mmproj = /home/user/llama.cpp/unsloth/qwen3.5-9b-ud-q4_k_xl/unsloth_Qwen3.5-9B-GGUF_mmproj-BF16.gguf
alias = Qwen3.5-9B-UD-Q4-Coding

ctx-size = 131072
fit = on
flash-attn = true
chat-template-kwargs = {"enable_thinking":true}

temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.00
presence-penalty = 0.0
repeat-penalty = 1.0
```

Then I use the [Zed editor](https://zed.dev/) to connect to my model from another computer. Here's the relevant bits of my chat with it using Zed's feature to export your agent chat to markdown:

<details>
<summary>Chat markdown</summary>

``````md
## User

What tools do you have?

## Assistant

<thinking>
The user is asking about what tools I have available. I should provide a clear, helpful summary of the tools I can use, formatted nicely in markdown. I don't need to use any tools for this - I can just explain them from the information I have.

</thinking>

I have access to several tools that can help you work with code and projects. Here's what I can do:

## File Operations

- **`read_file`** - Read the content of files in your project. If files are large, I'll get an outline with line numbers so we can dive into specific sections.
- **`list_directory`** - List the contents of a directory to explore the project structure.
- **`find_path`** - Fast file path matching using glob patterns (e.g., `**/*.rs` to find all Rust files). Great for locating files by name.
- **`grep`** - Search file contents using regular expressions. Prefer this over path search when looking for symbols or specific code patterns.

## Project Diagnostics

- **`diagnostics`** - Get errors and warnings for the project or a specific file. Useful for checking the health of your code after making changes.

## Web & Network

- **`fetch`** - Fetch URLs and return the content as Markdown. Handy for loading documentation or external resources.
- **`open`** - Open files or URLs with your default application (browser, text editor, etc.).

## System Tools

- **`now`** - Get the current datetime in RFC 3339 format (UTC or local time).

## Advanced

- **`spawn_agent`** - Delegate tasks to agents when you need parallel work or complex tasks that benefit from a dedicated session.

---

All of these tools are designed to work with your project in the root directory `/path/to/project`. Just let me know what you're working on and I can help you navigate and manipulate the code!

## User

Read through [@main.nu](file:///path/to/project/main.nu) and give me a summary of what it does

## Assistant

<thinking>

I need to read the actual content of main.nu file to provide a summary - the file outline only shows 7 symbols which is not enough detail.

</thinking>

I can see the file outline shows this is a Nushell script with some notes about <omitted>. However, I only see a file outline with symbol names. Let me read the full content to give you a comprehensive summary.



**Tool Call: Read file `main.nu`**
Status: Completed

```main.nu
SUCCESS: File outline retrieved. This file is too large to read all at once, so the outline below shows the file's structure with line numbers.

IMPORTANT: Do NOT retry this call without line numbers - you will get the same outline.
Instead, use the line numbers below to read specific sections by calling this tool again with start_line and end_line parameters.

# File outline for /path/to/project/main.nu

<omitted>

Showing symbols 1-7 (total symbols: 7)


NEXT STEPS: To read a specific symbol's implementation, call read_file with the same path plus start_line and end_line from the outline above.
For example, to read a function shown as [L100-150], use start_line: 100 and end_line: 150.
```


## Assistant

<thinking>

The file is too large and I only see a structure outline. I need to read specific sections. Let me start from the beginning to understand the overall structure.

<tool_call>
<function=read_file>
<parameter=path>
main.nu
</parameter>
<parameter=start_line>
1
</parameter>
<parameter=end_line>
100
</parameter>
</function>
</tool_call>
</thinking>
``````

</details>

After the agent prints the tool call in XML it simply stops running as if it's finished.

Sometimes it succeeds making one tool call, but then it starts printing XML and I give up on it. I have no idea if it may or may not make another successful call if you try enough times after it starts printing XML.

**However**, if I disable thinking with `chat-template-kwargs = {"enable_thinking":false}`, then it makes all its tool calls without problems (at least as far as I've tested), e.g. it can read the whole 1000-line file in several consecutive tool calls.

I've also tried running llama-server with the flag `--no-cache-prompt` as mentioned in https://github.com/ggml-org/llama.cpp/issues/20614 but that had no effect.

I think the same or similar issue is also reported here
https://github.com/ollama/ollama/issues/14745
https://github.com/ollama/ollama/issues/14493

### First Bad Commit

I've had this problem since llama.cpp 8255. I haven't tried earlier versions.

### Relevant log output

Here's the log from `llama-server` from running the model until the XML tool call. It shows that it's served several requests because the model actually tried to read the file multiple times but it got the path wrong. When it finally got it right, that's when it printed the XML.

[log.txt](https://github.com/user-attachments/files/26157574/log.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Qwen3.5 9B often prints tool calls in XML and stops when thinking is enabled - tool calls inside thinking block #20837

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: Qwen3.5 9B often prints tool calls in XML and stops when thinking is enabled - tool calls inside thinking block #20837

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions