Skip to content

fix(client): use chat_template_kwargs for Vllm provider instead of An…#1480

Closed
h3c-hexin wants to merge 1 commit into
Hmbown:mainfrom
h3c-hexin:fix/vllm-openai-chat-template-kwargs
Closed

fix(client): use chat_template_kwargs for Vllm provider instead of An…#1480
h3c-hexin wants to merge 1 commit into
Hmbown:mainfrom
h3c-hexin:fix/vllm-openai-chat-template-kwargs

Conversation

@h3c-hexin

@h3c-hexin h3c-hexin commented May 12, 2026

Copy link
Copy Markdown
Contributor

…thropic-style thinking field

vLLM is an OpenAI-compatible server, so it speaks OpenAI's chat completions protocol and does not understand Anthropic-native extension fields. The Vllm branch of apply_reasoning_effort was previously injecting thinking: {type: "disabled" | "enabled"} at the top level of the request body — but vLLM silently ignores this unknown field.

For Qwen3 (and other reasoning-capable models hosted via vLLM), the canonical way to toggle thinking is the OpenAI-protocol extension chat_template_kwargs.enable_thinking, which vLLM forwards into the model's chat template. See the vLLM docs on chat templating + Qwen3 official template's enable_thinking parameter.

Symptom before this fix

With reasoning_effort="off" against a Qwen3 deployment, the model still generated 10+ seconds of reasoning tokens, which vLLM placed into the non-OpenAI-standard reasoning field. This client does not consume reasoning, so the user observes a ~13s "freeze" before any content delta arrives. Manual curl reproduction:

$ curl http://vllm-host:8000/v1/chat/completions -d '{
    "model": "...", "messages":[{"role":"user","content":"hi"}],
    "max_tokens": 100
  }'
-> reasoning: "Thinking Process: 1. Analyze the User's Input..."
-> content: null   (max_tokens spent in reasoning)

$ curl http://vllm-host:8000/v1/chat/completions -d '{
    "model": "...", "messages":[{"role":"user","content":"hi"}],
    "max_tokens": 100,
    "chat_template_kwargs": {"enable_thinking": false}
  }'
-> reasoning: null
-> content: "你好!我是 Qwen…"   (full answer, < 5s)

After this fix

Vllm branch now mirrors NvidiaNim (which already uses chat_template_kwargs, but with the NVIDIA-specific thinking key rather than the Qwen-standard enable_thinking).

End-to-end measurement against vLLM hosting Qwen3.6-35B-A3B-FP8:

  • TTFT (time-to-first-text-delta): 13039ms -> 274ms
  • Total LLM call: 13s -> 5.7s
  • Effective output rate: 3 chars/s (most time in hidden reasoning) -> 46 chars/s

Note: Sglang and other OpenAI-compatible servers (Fireworks, Novita) likely have the same issue but I have not verified the right field for each. They are left unchanged in this PR; the Vllm branch is the minimal, verified fix.

Summary

Testing

  • cargo test --all-features
  • cargo fmt --all -- --check
  • cargo clippy --all-targets --all-features

Checklist

  • Updated docs or comments as needed
  • Added or updated tests where relevant
  • Verified TUI behavior manually if UI changes

…thropic-style thinking field

vLLM is an OpenAI-compatible server, so it speaks OpenAI's chat
completions protocol and does not understand Anthropic-native extension
fields. The Vllm branch of apply_reasoning_effort was previously injecting
`thinking: {type: "disabled" | "enabled"}` at the top level of the request
body — but vLLM silently ignores this unknown field.

For Qwen3 (and other reasoning-capable models hosted via vLLM), the
canonical way to toggle thinking is the OpenAI-protocol extension
`chat_template_kwargs.enable_thinking`, which vLLM forwards into the
model's chat template. See the vLLM docs on chat templating + Qwen3
official template's `enable_thinking` parameter.

Symptom before this fix
-----------------------
With reasoning_effort="off" against a Qwen3 deployment, the model still
generated 10+ seconds of reasoning tokens, which vLLM placed into the
non-OpenAI-standard `reasoning` field. This client does not consume
`reasoning`, so the user observes a ~13s "freeze" before any content
delta arrives. Manual curl reproduction:

    $ curl http://vllm-host:8000/v1/chat/completions -d '{
        "model": "...", "messages":[{"role":"user","content":"hi"}],
        "max_tokens": 100
      }'
    -> reasoning: "Thinking Process: 1. Analyze the User's Input..."
    -> content: null   (max_tokens spent in reasoning)

    $ curl http://vllm-host:8000/v1/chat/completions -d '{
        "model": "...", "messages":[{"role":"user","content":"hi"}],
        "max_tokens": 100,
        "chat_template_kwargs": {"enable_thinking": false}
      }'
    -> reasoning: null
    -> content: "你好!我是 Qwen…"   (full answer, < 5s)

After this fix
--------------
Vllm branch now mirrors NvidiaNim (which already uses
chat_template_kwargs, but with the NVIDIA-specific `thinking` key rather
than the Qwen-standard `enable_thinking`).

End-to-end measurement against vLLM hosting Qwen3.6-35B-A3B-FP8:
- TTFT (time-to-first-text-delta): 13039ms -> 274ms
- Total LLM call: 13s -> 5.7s
- Effective output rate: 3 chars/s (most time in hidden reasoning)
                          -> 46 chars/s

Note: Sglang and other OpenAI-compatible servers (Fireworks, Novita)
likely have the same issue but I have not verified the right field for
each. They are left unchanged in this PR; the Vllm branch is the
minimal, verified fix.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies crates/tui/src/client.rs to correctly configure reasoning effort for vLLM providers by using chat_template_kwargs.enable_thinking instead of Anthropic-specific fields. This change addresses an issue where vLLM would ignore the Anthropic-style configuration, leading to performance delays. Feedback suggests extending this implementation to ApiProvider::Sglang for consistency, as it is also an OpenAI-compatible server that supports these keyword arguments.

Comment thread crates/tui/src/client.rs

@h3c-hexin h3c-hexin left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good

@github-actions

Copy link
Copy Markdown

Thanks @h3c-hexin — your contribution landed in dcc2c448ebe3 on main:

fix(client): vLLM uses chat_template_kwargs to toggle reasoning, not the Anthropic field

Closing this PR now that the code is on main. Credit lives in the commit message and (where applicable) the CHANGELOG.md entry for the next release. Apologies for not closing this at the time of the merge — the auto-close workflow is new in v0.8.31.

If you want to land more work and would prefer your future PRs merge cleanly without a harvest step, the CONTRIBUTING.md doc has a short note on what makes a contribution mergeable as-is.

@github-actions github-actions Bot closed this May 12, 2026
mars-base pushed a commit to mars-base/CodeWhale that referenced this pull request May 12, 2026
…the Anthropic field

`apply_reasoning_effort`'s vLLM branch was injecting
`thinking: {type: "disabled"}` at the top of the request body to
turn off model reasoning. But vLLM speaks OpenAI's
chat-completions protocol, not Anthropic-native extension fields,
and silently ignored that directive — the model emitted a full
hidden reasoning trace into the non-OpenAI-standard `reasoning`
field (which this client does not surface), so users saw a
~13-second perceived freeze before the first content token
arrived.

The vLLM branch now emits the OpenAI extension
`chat_template_kwargs.enable_thinking` — the canonical way to
toggle Qwen3's `<think>` mode, DeepSeek-R1's reasoning trace, and
any other reasoning-capable model served via vLLM. End-to-end
measurement against vLLM hosting Qwen3.6-35B-A3B-FP8:

  - TTFT:           13039ms → 274ms
  - Total LLM call: 13s     → 5.7s
  - Output rate:    3 ch/s  → 46 ch/s

The `high` / `max` reasoning levels likewise route through
`chat_template_kwargs` so the toggle is consistent across effort
levels. No change for any non-vLLM provider (NVIDIA NIM continues
to use the NVIDIA-specific `chat_template_kwargs.thinking` key;
Anthropic-native providers keep the Anthropic-native field).

Resolved a 3-way merge conflict against the v0.8.32 AtlasCloud
harvest (PR Hmbown#1436) so AtlasCloud stays in the no-op match arm
alongside OpenAI / Ollama while the new vLLM arm gets its own
branch. Note for future Sglang / Fireworks / Novita work: those
servers likely have the same bug but each has its own
chat_template_kwargs schema; this PR is intentionally minimal
to the verified-fix scope.

Harvested from PR Hmbown#1480 by @h3c-hexin

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@h3c-hexin h3c-hexin deleted the fix/vllm-openai-chat-template-kwargs branch May 14, 2026 01:16
Biilow-Bailang pushed a commit to h3c-hexin/DeepSeek-TUI that referenced this pull request Jun 12, 2026
session.reasoning_effort 原本只在第一条 SendMessage 时获得值;纯
SpawnSubAgent 驱动的会话(pinvou3 工作流宿主,取消对话型品悟后无任何
SendMessage)一直停留在 None → vLLM/Qwen3.6 回落默认 thinking 全开:
每步几千字思考拖慢 TTFT,且引入 thinking 死循环失败模式(6/12 taizi
思考失控连续顶格 16384 实证)。

新增 EngineConfig.reasoning_effort(默认 None=维持原行为),engine 建
session 时即初始化,后续 SendMessage 仍可按 op 覆盖。宿主对本地 vLLM
填 "off",经既有 apply_reasoning_effort 翻成 chat_template_kwargs
enable_thinking=false(PR Hmbown#1480 链路)。4222 测试全过。

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant