fix(client): use chat_template_kwargs for Vllm provider instead of An… by h3c-hexin · Pull Request #1480 · Hmbown/CodeWhale

h3c-hexin · 2026-05-12T01:36:52Z

…thropic-style thinking field

vLLM is an OpenAI-compatible server, so it speaks OpenAI's chat completions protocol and does not understand Anthropic-native extension fields. The Vllm branch of apply_reasoning_effort was previously injecting thinking: {type: "disabled" | "enabled"} at the top level of the request body — but vLLM silently ignores this unknown field.

For Qwen3 (and other reasoning-capable models hosted via vLLM), the canonical way to toggle thinking is the OpenAI-protocol extension chat_template_kwargs.enable_thinking, which vLLM forwards into the model's chat template. See the vLLM docs on chat templating + Qwen3 official template's enable_thinking parameter.

Symptom before this fix

With reasoning_effort="off" against a Qwen3 deployment, the model still generated 10+ seconds of reasoning tokens, which vLLM placed into the non-OpenAI-standard reasoning field. This client does not consume reasoning, so the user observes a ~13s "freeze" before any content delta arrives. Manual curl reproduction:

$ curl http://vllm-host:8000/v1/chat/completions -d '{
    "model": "...", "messages":[{"role":"user","content":"hi"}],
    "max_tokens": 100
  }'
-> reasoning: "Thinking Process: 1. Analyze the User's Input..."
-> content: null   (max_tokens spent in reasoning)

$ curl http://vllm-host:8000/v1/chat/completions -d '{
    "model": "...", "messages":[{"role":"user","content":"hi"}],
    "max_tokens": 100,
    "chat_template_kwargs": {"enable_thinking": false}
  }'
-> reasoning: null
-> content: "你好！我是 Qwen…"   (full answer, < 5s)

After this fix

Vllm branch now mirrors NvidiaNim (which already uses chat_template_kwargs, but with the NVIDIA-specific thinking key rather than the Qwen-standard enable_thinking).

End-to-end measurement against vLLM hosting Qwen3.6-35B-A3B-FP8:

TTFT (time-to-first-text-delta): 13039ms -> 274ms
Total LLM call: 13s -> 5.7s
Effective output rate: 3 chars/s (most time in hidden reasoning) -> 46 chars/s

Note: Sglang and other OpenAI-compatible servers (Fireworks, Novita) likely have the same issue but I have not verified the right field for each. They are left unchanged in this PR; the Vllm branch is the minimal, verified fix.

Summary

Testing

cargo test --all-features
cargo fmt --all -- --check
cargo clippy --all-targets --all-features

Checklist

Updated docs or comments as needed
Added or updated tests where relevant
Verified TUI behavior manually if UI changes

…thropic-style thinking field vLLM is an OpenAI-compatible server, so it speaks OpenAI's chat completions protocol and does not understand Anthropic-native extension fields. The Vllm branch of apply_reasoning_effort was previously injecting `thinking: {type: "disabled" | "enabled"}` at the top level of the request body — but vLLM silently ignores this unknown field. For Qwen3 (and other reasoning-capable models hosted via vLLM), the canonical way to toggle thinking is the OpenAI-protocol extension `chat_template_kwargs.enable_thinking`, which vLLM forwards into the model's chat template. See the vLLM docs on chat templating + Qwen3 official template's `enable_thinking` parameter. Symptom before this fix ----------------------- With reasoning_effort="off" against a Qwen3 deployment, the model still generated 10+ seconds of reasoning tokens, which vLLM placed into the non-OpenAI-standard `reasoning` field. This client does not consume `reasoning`, so the user observes a ~13s "freeze" before any content delta arrives. Manual curl reproduction: $ curl http://vllm-host:8000/v1/chat/completions -d '{ "model": "...", "messages":[{"role":"user","content":"hi"}], "max_tokens": 100 }' -> reasoning: "Thinking Process: 1. Analyze the User's Input..." -> content: null (max_tokens spent in reasoning) $ curl http://vllm-host:8000/v1/chat/completions -d '{ "model": "...", "messages":[{"role":"user","content":"hi"}], "max_tokens": 100, "chat_template_kwargs": {"enable_thinking": false} }' -> reasoning: null -> content: "你好！我是 Qwen…" (full answer, < 5s) After this fix -------------- Vllm branch now mirrors NvidiaNim (which already uses chat_template_kwargs, but with the NVIDIA-specific `thinking` key rather than the Qwen-standard `enable_thinking`). End-to-end measurement against vLLM hosting Qwen3.6-35B-A3B-FP8: - TTFT (time-to-first-text-delta): 13039ms -> 274ms - Total LLM call: 13s -> 5.7s - Effective output rate: 3 chars/s (most time in hidden reasoning) -> 46 chars/s Note: Sglang and other OpenAI-compatible servers (Fireworks, Novita) likely have the same issue but I have not verified the right field for each. They are left unchanged in this PR; the Vllm branch is the minimal, verified fix.

gemini-code-assist

Code Review

This pull request modifies crates/tui/src/client.rs to correctly configure reasoning effort for vLLM providers by using chat_template_kwargs.enable_thinking instead of Anthropic-specific fields. This change addresses an issue where vLLM would ignore the Anthropic-style configuration, leading to performance delays. Feedback suggests extending this implementation to ApiProvider::Sglang for consistency, as it is also an OpenAI-compatible server that supports these keyword arguments.

h3c-hexin

good

github-actions · 2026-05-12T07:54:34Z

Thanks @h3c-hexin — your contribution landed in dcc2c448ebe3 on main:

fix(client): vLLM uses chat_template_kwargs to toggle reasoning, not the Anthropic field

Closing this PR now that the code is on main. Credit lives in the commit message and (where applicable) the CHANGELOG.md entry for the next release. Apologies for not closing this at the time of the merge — the auto-close workflow is new in v0.8.31.

If you want to land more work and would prefer your future PRs merge cleanly without a harvest step, the CONTRIBUTING.md doc has a short note on what makes a contribution mergeable as-is.

@h3c-hexin

…the Anthropic field `apply_reasoning_effort`'s vLLM branch was injecting `thinking: {type: "disabled"}` at the top of the request body to turn off model reasoning. But vLLM speaks OpenAI's chat-completions protocol, not Anthropic-native extension fields, and silently ignored that directive — the model emitted a full hidden reasoning trace into the non-OpenAI-standard `reasoning` field (which this client does not surface), so users saw a ~13-second perceived freeze before the first content token arrived. The vLLM branch now emits the OpenAI extension `chat_template_kwargs.enable_thinking` — the canonical way to toggle Qwen3's `<think>` mode, DeepSeek-R1's reasoning trace, and any other reasoning-capable model served via vLLM. End-to-end measurement against vLLM hosting Qwen3.6-35B-A3B-FP8: - TTFT: 13039ms → 274ms - Total LLM call: 13s → 5.7s - Output rate: 3 ch/s → 46 ch/s The `high` / `max` reasoning levels likewise route through `chat_template_kwargs` so the toggle is consistent across effort levels. No change for any non-vLLM provider (NVIDIA NIM continues to use the NVIDIA-specific `chat_template_kwargs.thinking` key; Anthropic-native providers keep the Anthropic-native field). Resolved a 3-way merge conflict against the v0.8.32 AtlasCloud harvest (PR Hmbown#1436) so AtlasCloud stays in the no-op match arm alongside OpenAI / Ollama while the new vLLM arm gets its own branch. Note for future Sglang / Fireworks / Novita work: those servers likely have the same bug but each has its own chat_template_kwargs schema; this PR is intentionally minimal to the verified-fix scope. Harvested from PR Hmbown#1480 by @h3c-hexin Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

session.reasoning_effort 原本只在第一条 SendMessage 时获得值;纯 SpawnSubAgent 驱动的会话(pinvou3 工作流宿主,取消对话型品悟后无任何 SendMessage)一直停留在 None → vLLM/Qwen3.6 回落默认 thinking 全开: 每步几千字思考拖慢 TTFT,且引入 thinking 死循环失败模式(6/12 taizi 思考失控连续顶格 16384 实证)。新增 EngineConfig.reasoning_effort(默认 None=维持原行为),engine 建 session 时即初始化,后续 SendMessage 仍可按 op 覆盖。宿主对本地 vLLM 填 "off",经既有 apply_reasoning_effort 翻成 chat_template_kwargs enable_thinking=false(PR Hmbown#1480 链路)。4222 测试全过。 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

Comment thread crates/tui/src/client.rs

h3c-hexin commented May 12, 2026

View reviewed changes

github-actions Bot closed this May 12, 2026

h3c-hexin deleted the fix/vllm-openai-chat-template-kwargs branch May 14, 2026 01:16

Biilow-Bailang mentioned this pull request Jun 12, 2026

feat: pinvou3 工作流底座层移植到 v0.8.57(配合 pinvou3#5) h3c-hexin/DeepSeek-TUI#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(client): use chat_template_kwargs for Vllm provider instead of An…#1480

fix(client): use chat_template_kwargs for Vllm provider instead of An…#1480
h3c-hexin wants to merge 1 commit into
Hmbown:mainfrom
h3c-hexin:fix/vllm-openai-chat-template-kwargs

h3c-hexin commented May 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

h3c-hexin left a comment

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

h3c-hexin commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Symptom before this fix

After this fix

Summary

Testing

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

h3c-hexin left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

h3c-hexin commented May 12, 2026 •

edited

Loading