What happened?
I configured my main model and my fast model separately under modelProviders in ~/.qwen/settings.json, including model-specific options like extra_body.enable_thinking. My main model is a thinking model and I want it to think; my fast model is configured to never think (its entry sets extra_body.enable_thinking: false and extra_body.thinking.type: "disabled").
Despite that, side queries that run on the fast model (e.g. session-title generation, recap, tool-use summary) come back with reasoning_content populated. The fast model is clearly thinking even though my settings tell it not to.
When I checked the captured request body, I noticed that the per-model settings configured for my fast model are not the ones being applied. The fast model's request inherits the main model's settings instead — only the model id in the URL/body is swapped, everything else (extra_body, sampling params, reasoning config, base URL, API key) carries over from whatever the main model has.
In my case, that means side queries to deepseek-v4-flash are sent with my main model's enable_thinking: true, which is the exact opposite of what deepseek-v4-flash's own entry says. The model dutifully thinks, returns reasoning_content, and the side query takes longer and costs more than it should.
What did you expect to happen?
When a side query runs on the fast model, the request should use the fast model's per-model settings — the same extra_body, sampling params, and reasoning config that the fast model would use if I selected it as my main model. The fast model entry in modelProviders should be the source of truth for how the fast model is called, regardless of whether the call originates from a main turn, a title generator, a recap, or any other side query.
In other words: each model's modelProviders entry should fully describe how that model is called. Today, only the main model's entry is consulted; the fast model's entry is effectively ignored except for its model id.
This also matters beyond the thinking/no-thinking case:
- If the fast model lives on a different provider/base URL than the main model, side queries today would silently go to the wrong endpoint with the wrong key. It only "works" if both models happen to share the same base URL and API key (as in my setup, since both are on Dashscope).
- If a user sets different
samplingParams per model (e.g. lower temperature for the fast model to reduce variance on summaries), those settings are ignored on side queries.
Steps to reproduce
- In
~/.qwen/settings.json, set model.name to a thinking model that defaults to thinking-on (e.g. a Qwen3 or DeepSeek thinking-mode model). Give its modelProviders entry extra_body: { enable_thinking: true }.
- Set
fastModel to a different model whose modelProviders entry has extra_body: { enable_thinking: false } (or thinking: { type: "disabled" }).
- Run an interactive session, send two short messages so the auto-title generator triggers, then exit. (Or run any other side-query path:
/rename --auto, recap, etc.)
- Observe: the fast model's response for the title call contains
reasoning_content, despite that model's entry disabling thinking.
Why this matters
Two practical consequences:
- Cost and latency. Side queries are intended to be cheap and fast. When they accidentally inherit "think first" from the main model, the fast model burns its small token budget on reasoning, which can make the request slower than the main turn it's supporting and occasionally truncate the actual answer.
- User trust in settings. Putting per-model config under
modelProviders.<authType>[].generationConfig strongly implies that those settings apply whenever that model is used. The current behavior silently overrides them on every side query, which is hard to discover without inspecting outgoing traffic.
What happened?
I configured my main model and my fast model separately under
modelProvidersin~/.qwen/settings.json, including model-specific options likeextra_body.enable_thinking. My main model is a thinking model and I want it to think; my fast model is configured to never think (its entry setsextra_body.enable_thinking: falseandextra_body.thinking.type: "disabled").Despite that, side queries that run on the fast model (e.g. session-title generation, recap, tool-use summary) come back with
reasoning_contentpopulated. The fast model is clearly thinking even though my settings tell it not to.When I checked the captured request body, I noticed that the per-model settings configured for my fast model are not the ones being applied. The fast model's request inherits the main model's settings instead — only the model id in the URL/body is swapped, everything else (
extra_body, sampling params, reasoning config, base URL, API key) carries over from whatever the main model has.In my case, that means side queries to
deepseek-v4-flashare sent with my main model'senable_thinking: true, which is the exact opposite of whatdeepseek-v4-flash's own entry says. The model dutifully thinks, returnsreasoning_content, and the side query takes longer and costs more than it should.What did you expect to happen?
When a side query runs on the fast model, the request should use the fast model's per-model settings — the same
extra_body, sampling params, and reasoning config that the fast model would use if I selected it as my main model. The fast model entry inmodelProvidersshould be the source of truth for how the fast model is called, regardless of whether the call originates from a main turn, a title generator, a recap, or any other side query.In other words: each model's
modelProvidersentry should fully describe how that model is called. Today, only the main model's entry is consulted; the fast model's entry is effectively ignored except for its model id.This also matters beyond the thinking/no-thinking case:
samplingParamsper model (e.g. lowertemperaturefor the fast model to reduce variance on summaries), those settings are ignored on side queries.Steps to reproduce
~/.qwen/settings.json, setmodel.nameto a thinking model that defaults to thinking-on (e.g. a Qwen3 or DeepSeek thinking-mode model). Give itsmodelProvidersentryextra_body: { enable_thinking: true }.fastModelto a different model whosemodelProvidersentry hasextra_body: { enable_thinking: false }(orthinking: { type: "disabled" })./rename --auto, recap, etc.)reasoning_content, despite that model's entry disabling thinking.Why this matters
Two practical consequences:
modelProviders.<authType>[].generationConfigstrongly implies that those settings apply whenever that model is used. The current behavior silently overrides them on every side query, which is hard to discover without inspecting outgoing traffic.