-
Notifications
You must be signed in to change notification settings - Fork 15.3k
Description
Name and Version
version: 8189 (4d828bd1a)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
NVIDIA GeForce RTX 3090
Models
Qwen3.5-35B-A3B (UD-Q4_K_M quantization via Unsloth)
https://huggingface.co/unsloth/Qwen3.5-35B-A3B-UD-Q4_K_M-GGUF
Problem description & steps to reproduce
The Anthropic Messages API (/v1/messages) silently drops thinking content blocks when converting to the internal OpenAI chat format. In tools/server/server-common.cpp, the function convert_anthropic_to_oai() handles content block types text, image, tool_use, and tool_result, but has no handler for thinking blocks. They are silently ignored, and prior assistant messages are converted without the reasoning_content field.
This was discovered while using Claude Code with Qwen3.5 via llama-server's Anthropic-compatible endpoint. The impact depends on how each model's chat template handles thinking in conversation history (see comment below for details on Qwen3.5's specific behavior).
Steps to reproduce:
- Serve a thinking-capable model with
llama-serverand tool calling enabled - Use a client that calls
/v1/messageswiththinking.type: "enabled"and sendsthinkingblocks back in conversation history (e.g., Claude Code, or any Anthropic API client following the extended thinking spec) - Observe that
thinkingblocks are absent from the converted OpenAI messages (visible via theconverted requestdebug log)
Fix: Two changes in tools/server/server-common.cpp, function convert_anthropic_to_oai():
- Add a handler for
thinkingblocks to accumulate reasoning content - Set
reasoning_contenton the converted message so the chat template can use it
First Bad Commit
This has always been the case — convert_anthropic_to_oai() has never handled thinking blocks.
Relevant log output
Logs
slot process_toke: id 0 | task 10718 | n_decoded = 17, n_remaining = 31983, next token: 25 ':'
Grammar still awaiting trigger after token 248046 (`<|im_end|>`)
slot process_toke: id 0 | task 10718 | stopped by EOS
slot process_toke: id 0 | task 10718 | n_decoded = 18, n_remaining = 31982, next token: 248046 ''
prompt eval time = 9701.77 ms / 23996 tokens
eval time = 157.32 ms / 18 tokens
Parsed message: {"role":"assistant","content":"","reasoning_content":"Let me also check what branch you're on and what recent work has been done:"}
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"output_tokens":18}}Related
- Eval bug: Qwen3-Coder-Next generates prematurely EOS instead of tool call(/continued response) #19513 — Similar symptom (premature EOS instead of tool call) but different root cause and model. That issue affected Qwen3-Coder-Next (non-thinking) on the Responses API due to contiguous assistant message fragmentation, fixed by server : merge contiguous Responses input items into a single assistant message #19773.