fix(client): use chat_template_kwargs for Vllm provider instead of An…#1480
fix(client): use chat_template_kwargs for Vllm provider instead of An…#1480h3c-hexin wants to merge 1 commit into
Conversation
…thropic-style thinking field
vLLM is an OpenAI-compatible server, so it speaks OpenAI's chat
completions protocol and does not understand Anthropic-native extension
fields. The Vllm branch of apply_reasoning_effort was previously injecting
`thinking: {type: "disabled" | "enabled"}` at the top level of the request
body — but vLLM silently ignores this unknown field.
For Qwen3 (and other reasoning-capable models hosted via vLLM), the
canonical way to toggle thinking is the OpenAI-protocol extension
`chat_template_kwargs.enable_thinking`, which vLLM forwards into the
model's chat template. See the vLLM docs on chat templating + Qwen3
official template's `enable_thinking` parameter.
Symptom before this fix
-----------------------
With reasoning_effort="off" against a Qwen3 deployment, the model still
generated 10+ seconds of reasoning tokens, which vLLM placed into the
non-OpenAI-standard `reasoning` field. This client does not consume
`reasoning`, so the user observes a ~13s "freeze" before any content
delta arrives. Manual curl reproduction:
$ curl http://vllm-host:8000/v1/chat/completions -d '{
"model": "...", "messages":[{"role":"user","content":"hi"}],
"max_tokens": 100
}'
-> reasoning: "Thinking Process: 1. Analyze the User's Input..."
-> content: null (max_tokens spent in reasoning)
$ curl http://vllm-host:8000/v1/chat/completions -d '{
"model": "...", "messages":[{"role":"user","content":"hi"}],
"max_tokens": 100,
"chat_template_kwargs": {"enable_thinking": false}
}'
-> reasoning: null
-> content: "你好!我是 Qwen…" (full answer, < 5s)
After this fix
--------------
Vllm branch now mirrors NvidiaNim (which already uses
chat_template_kwargs, but with the NVIDIA-specific `thinking` key rather
than the Qwen-standard `enable_thinking`).
End-to-end measurement against vLLM hosting Qwen3.6-35B-A3B-FP8:
- TTFT (time-to-first-text-delta): 13039ms -> 274ms
- Total LLM call: 13s -> 5.7s
- Effective output rate: 3 chars/s (most time in hidden reasoning)
-> 46 chars/s
Note: Sglang and other OpenAI-compatible servers (Fireworks, Novita)
likely have the same issue but I have not verified the right field for
each. They are left unchanged in this PR; the Vllm branch is the
minimal, verified fix.
There was a problem hiding this comment.
Code Review
This pull request modifies crates/tui/src/client.rs to correctly configure reasoning effort for vLLM providers by using chat_template_kwargs.enable_thinking instead of Anthropic-specific fields. This change addresses an issue where vLLM would ignore the Anthropic-style configuration, leading to performance delays. Feedback suggests extending this implementation to ApiProvider::Sglang for consistency, as it is also an OpenAI-compatible server that supports these keyword arguments.
|
Thanks @h3c-hexin — your contribution landed in
Closing this PR now that the code is on If you want to land more work and would prefer your future PRs merge cleanly without a harvest step, the |
…the Anthropic field
`apply_reasoning_effort`'s vLLM branch was injecting
`thinking: {type: "disabled"}` at the top of the request body to
turn off model reasoning. But vLLM speaks OpenAI's
chat-completions protocol, not Anthropic-native extension fields,
and silently ignored that directive — the model emitted a full
hidden reasoning trace into the non-OpenAI-standard `reasoning`
field (which this client does not surface), so users saw a
~13-second perceived freeze before the first content token
arrived.
The vLLM branch now emits the OpenAI extension
`chat_template_kwargs.enable_thinking` — the canonical way to
toggle Qwen3's `<think>` mode, DeepSeek-R1's reasoning trace, and
any other reasoning-capable model served via vLLM. End-to-end
measurement against vLLM hosting Qwen3.6-35B-A3B-FP8:
- TTFT: 13039ms → 274ms
- Total LLM call: 13s → 5.7s
- Output rate: 3 ch/s → 46 ch/s
The `high` / `max` reasoning levels likewise route through
`chat_template_kwargs` so the toggle is consistent across effort
levels. No change for any non-vLLM provider (NVIDIA NIM continues
to use the NVIDIA-specific `chat_template_kwargs.thinking` key;
Anthropic-native providers keep the Anthropic-native field).
Resolved a 3-way merge conflict against the v0.8.32 AtlasCloud
harvest (PR Hmbown#1436) so AtlasCloud stays in the no-op match arm
alongside OpenAI / Ollama while the new vLLM arm gets its own
branch. Note for future Sglang / Fireworks / Novita work: those
servers likely have the same bug but each has its own
chat_template_kwargs schema; this PR is intentionally minimal
to the verified-fix scope.
Harvested from PR Hmbown#1480 by @h3c-hexin
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
session.reasoning_effort 原本只在第一条 SendMessage 时获得值;纯 SpawnSubAgent 驱动的会话(pinvou3 工作流宿主,取消对话型品悟后无任何 SendMessage)一直停留在 None → vLLM/Qwen3.6 回落默认 thinking 全开: 每步几千字思考拖慢 TTFT,且引入 thinking 死循环失败模式(6/12 taizi 思考失控连续顶格 16384 实证)。 新增 EngineConfig.reasoning_effort(默认 None=维持原行为),engine 建 session 时即初始化,后续 SendMessage 仍可按 op 覆盖。宿主对本地 vLLM 填 "off",经既有 apply_reasoning_effort 翻成 chat_template_kwargs enable_thinking=false(PR Hmbown#1480 链路)。4222 测试全过。 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…thropic-style thinking field
vLLM is an OpenAI-compatible server, so it speaks OpenAI's chat completions protocol and does not understand Anthropic-native extension fields. The Vllm branch of apply_reasoning_effort was previously injecting
thinking: {type: "disabled" | "enabled"}at the top level of the request body — but vLLM silently ignores this unknown field.For Qwen3 (and other reasoning-capable models hosted via vLLM), the canonical way to toggle thinking is the OpenAI-protocol extension
chat_template_kwargs.enable_thinking, which vLLM forwards into the model's chat template. See the vLLM docs on chat templating + Qwen3 official template'senable_thinkingparameter.Symptom before this fix
With reasoning_effort="off" against a Qwen3 deployment, the model still generated 10+ seconds of reasoning tokens, which vLLM placed into the non-OpenAI-standard
reasoningfield. This client does not consumereasoning, so the user observes a ~13s "freeze" before any content delta arrives. Manual curl reproduction:After this fix
Vllm branch now mirrors NvidiaNim (which already uses chat_template_kwargs, but with the NVIDIA-specific
thinkingkey rather than the Qwen-standardenable_thinking).End-to-end measurement against vLLM hosting Qwen3.6-35B-A3B-FP8:
Note: Sglang and other OpenAI-compatible servers (Fireworks, Novita) likely have the same issue but I have not verified the right field for each. They are left unchanged in this PR; the Vllm branch is the minimal, verified fix.
Summary
Testing
cargo test --all-featurescargo fmt --all -- --checkcargo clippy --all-targets --all-featuresChecklist