Skip to content

[Bugfix] Honor tool_choice="none" in Chat Completions streaming#42752

Merged
sfeng33 merged 3 commits into
vllm-project:mainfrom
hoobnn:fix/issue-42747-tool-choice-none-streaming
Jun 3, 2026
Merged

[Bugfix] Honor tool_choice="none" in Chat Completions streaming#42752
sfeng33 merged 3 commits into
vllm-project:mainfrom
hoobnn:fix/issue-42747-tool-choice-none-streaming

Conversation

@hoobnn

@hoobnn hoobnn commented May 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #42747.

Streaming Chat Completions with tool_choice="none" — or explicitly disabled via JSON null, where request.tool_choice resolves to None — could still produce delta.tool_calls and finish with finish_reason="tool_calls" whenever the server was launched with a --tool-call-parser and the model output happened to match that parser's tool-call format. Non-streaming Chat Completions already handles both cases correctly.

Root cause

DelegatingParser.parse_delta in vllm/parser/abstract_parser.py invoked _extract_tool_calls_streaming unconditionally once the stream entered the tool-call phase, without inspecting request.tool_choice. The non-streaming path at vllm/entrypoints/openai/chat_completion/serving.py already short-circuits both cases:

elif not request.tool_choice or request.tool_choice == "none":
    message = ChatMessage(role=role, reasoning=reasoning, content=content)

The streaming path was missing the equivalent guard.

Fix

In DelegatingParser.parse_delta, when not request.tool_choice or request.tool_choice == "none", skip _extract_tool_calls_streaming and surface any remaining (post-reasoning) text as plain content. Because the tool parser is never invoked, state.function_name_returned stays untouched and the downstream tools_streamed[i] flag stays False, so finish_reason naturally falls back to "stop". Reasoning extraction on boundary deltas (introduced by #42691) is preserved.

Update — broadened guard per review feedback

The first revision of this PR only guarded request.tool_choice == "none". Per the review feedback from @gemini-code-assist — broaden the check to also cover request.tool_choice is None (the explicit-null / tools-disabled case raised under #42747) — the guard now reads not request.tool_choice or request.tool_choice == "none", matching the non-streaming semantics exactly.

Thanks to @FutureSkyFly, whose #44102 independently implemented the same broader guard and validated the direction (also cross-checked downstream in vllm-project/vllm-ascend#9776). This PR folds that broader guard into the original change, so #44102 can be closed as covered here.

Duplicate-PR check

gh issue view 42747 --repo vllm-project/vllm --comments
gh pr list --repo vllm-project/vllm --state open --search "42747 in:body"

Test plan

In tests/entrypoints/openai/test_tool_choice_content_none.py:

  • test_parse_delta_with_tool_choice_none_skips_tool_parser — explicit tool_choice="none": parser is not invoked, raw delta text surfaces as DeltaMessage.content.
  • test_parse_delta_with_tool_choice_null_skips_tool_parser — explicit tool_choice: null (request.tool_choice is None): parser is not invoked, content surfaces. The additional case beyond the original revision.
  • test_parse_delta_with_tool_choice_auto_still_runs_tool_parser — sanity: tool_choice="auto" still hits the tool parser (no regression).
  • test_parse_delta_tool_choice_none_multiple_chunks_remain_content — multi-chunk streaming stays in content mode across deltas.
.venv/bin/python -m pytest tests/entrypoints/openai/test_tool_choice_content_none.py -v
# 6 passed

Verified the null test fails when the guard is narrowed back to request.tool_choice == "none" only, confirming it genuinely exercises the broadened guard. ruff check / ruff format clean on both files.

AI assistance (Claude) was used to draft the patch and tests; the submitter reviewed every changed line and ran the tests above.

@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@mergify mergify Bot added tool-calling bug Something isn't working labels May 15, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to bypass the tool parser during streaming when tool_choice is set to "none", ensuring that model output is correctly surfaced as plain content. This change aligns the streaming behavior with the existing non-streaming implementation. The PR also includes comprehensive unit tests using a stub tool parser to verify that the bypass works as expected across multiple chunks. Review feedback suggests broadening the check to include cases where tool_choice is None to ensure full consistency with the non-streaming path.

Comment thread vllm/parser/abstract_parser.py Outdated
alexeldeib added a commit to alexeldeib/vllm that referenced this pull request May 19, 2026
Kimi K2.6 can emit untagged machine-readable output when a request requires JSON, structured text, Responses text.format JSON/schema output, or a forced tool payload. The Kimi reasoning parser previously treated that untagged output as implicit reasoning until it saw a visible reasoning end token, so valid payloads such as {"answer": 42} or required tool-call JSON could be hidden from the OpenAI/Responses stream or handed to the wrong parser phase.

Make the request contract explicit and preserve it across parser request rewrites. Structured text contracts bypass implicit reasoning immediately, while forced tool contracts only move into content/tool parsing when the prefix is a plausible tool payload. This avoids treating ordinary assistant text that happens to contain JSON as a tool call under auto tools, and prevents tool-parser generated grammars from being mistaken for caller requested structured text.

Keep visible Kimi reasoning delimiters meaningful: complete <think>...</think> regions and implicit Kimi tool-section boundaries are still stripped as reasoning. The one intentionally ambiguous edge we handle is a constrained structured choice literal that itself starts with <think>, where the allowed choice lets us preserve literal content without changing generic JSON/schema semantics.

Render/disaggregated serving now carries request-scoped reasoning state through GenerateRequest: render marks machine-output contracts as reasoning_ended and forwards effective chat_template_kwargs; disagg passes those values to engine.generate so structured decoding in the worker uses the same Kimi thinking configuration as render.

Also keep tool_choice=none streaming out of tool-call parsing. This overlaps semantically with upstream PRs vllm-project#42752 and vllm-project#42868, which are narrower generic fixes for tool_choice=none; if either lands first, future rebases should drop the duplicate guard but keep the Kimi machine-output/request-contract handling.

Co-authored-by: OpenAI Codex <codex@openai.com>
alexeldeib added a commit to alexeldeib/vllm that referenced this pull request May 19, 2026
Kimi K2.6 can emit untagged machine-readable output when a request requires JSON, structured text, Responses text.format JSON/schema output, or a forced tool payload. The Kimi reasoning parser previously treated that untagged output as implicit reasoning until it saw a visible reasoning end token, so valid payloads such as {"answer": 42} or required tool-call JSON could be hidden from the OpenAI/Responses stream or handed to the wrong parser phase.

Make the request contract explicit and preserve it across parser request rewrites. Structured text contracts bypass implicit reasoning immediately, while forced tool contracts only move into content/tool parsing when the prefix is a plausible tool payload. Preserve literal structured choices across rewrite as well, so a constrained choice such as <think>literal is not mistaken for hidden reasoning after structured decoding rewrites the request.

Keep visible Kimi reasoning delimiters meaningful: complete <think>...</think> regions and implicit Kimi tool-section boundaries are still stripped as reasoning. The intentionally ambiguous delimiter-literal edge is only handled when a constrained structured choice proves the literal is allowed, which avoids changing generic JSON/schema semantics.

Render/disaggregated serving now carries request-scoped reasoning state through GenerateRequest: render marks machine-output contracts as reasoning_ended and forwards effective chat_template_kwargs; disagg passes those values to engine.generate so structured decoding in the worker uses the same Kimi thinking configuration as render.

Also keep tool_choice=none streaming out of tool-call parsing. This overlaps semantically with upstream PRs vllm-project#42752 and vllm-project#42868, which are narrower generic fixes for tool_choice=none; if either lands first, future rebases should drop the duplicate guard but keep the Kimi machine-output/request-contract handling.

Co-authored-by: OpenAI Codex <codex@openai.com>
alexeldeib added a commit to alexeldeib/vllm that referenced this pull request May 19, 2026
Kimi K2.6 can emit untagged machine-readable output when a request requires JSON, structured text, Responses text.format JSON/schema output, or a forced tool payload. The Kimi reasoning parser previously treated that untagged output as implicit reasoning until it saw a visible reasoning end token, so valid payloads such as {"answer": 42} or required tool-call JSON could be hidden from the OpenAI/Responses stream or handed to the wrong parser phase.

Make the request contract explicit and preserve it across parser request rewrites. Structured text contracts bypass implicit reasoning immediately, while forced tool contracts only move into content/tool parsing when the prefix is a plausible tool payload. Preserve literal structured choices across rewrite as well, so a constrained choice such as <think>literal is not mistaken for hidden reasoning after structured decoding rewrites the request.

Keep visible Kimi reasoning delimiters meaningful: complete <think>...</think> regions and implicit Kimi tool-section boundaries are still stripped as reasoning. The intentionally ambiguous delimiter-literal edge is only handled when a constrained structured choice proves the literal is allowed, which avoids changing generic JSON/schema semantics.

Render/disaggregated serving now carries request-scoped reasoning state through GenerateRequest: render marks machine-output contracts as reasoning_ended and forwards effective chat_template_kwargs; disagg passes those values to engine.generate so structured decoding in the worker uses the same Kimi thinking configuration as render.

Also keep tool_choice=none streaming out of tool-call parsing. This overlaps semantically with upstream PRs vllm-project#42752 and vllm-project#42868, which are narrower generic fixes for tool_choice=none; if either lands first, future rebases should drop the duplicate guard but keep the Kimi machine-output/request-contract handling.

Co-authored-by: OpenAI Codex <codex@openai.com>
alexeldeib added a commit to alexeldeib/vllm that referenced this pull request May 19, 2026
Kimi K2.6 can emit untagged machine-readable output when a request requires JSON, structured text, Responses text.format JSON/schema output, or a forced tool payload. The Kimi reasoning parser previously treated that untagged output as implicit reasoning until it saw a visible reasoning end token, so valid payloads such as {"answer": 42} or required tool-call JSON could be hidden from the OpenAI/Responses stream or handed to the wrong parser phase.

Make the request contract explicit and preserve it across parser request rewrites. Structured text contracts bypass implicit reasoning immediately, while forced tool contracts only move into content/tool parsing when the prefix is a plausible tool payload. Preserve literal structured choices across rewrite as well, so a constrained choice such as <think>literal is not mistaken for hidden reasoning after structured decoding rewrites the request.

Keep visible Kimi reasoning delimiters meaningful: complete <think>...</think> regions and implicit Kimi tool-section boundaries are still stripped as reasoning. The intentionally ambiguous delimiter-literal edge is only handled when a constrained structured choice proves the literal is allowed, which avoids changing generic JSON/schema semantics.

Render/disaggregated serving now carries request-scoped reasoning state through GenerateRequest: render marks machine-output contracts as reasoning_ended and forwards effective chat_template_kwargs; disagg passes those values to engine.generate so structured decoding in the worker uses the same Kimi thinking configuration as render.

Also keep tool_choice=none streaming out of tool-call parsing. This overlaps semantically with upstream PRs vllm-project#42752 and vllm-project#42868, which are narrower generic fixes for tool_choice=none; if either lands first, future rebases should drop the duplicate guard but keep the Kimi machine-output/request-contract handling.

Co-authored-by: OpenAI Codex <codex@openai.com>
alexeldeib added a commit to alexeldeib/vllm that referenced this pull request May 19, 2026
Kimi K2.6 can emit untagged machine-readable output when a request requires JSON, structured text, Responses text.format JSON/schema output, or a forced tool payload. The Kimi reasoning parser previously treated that untagged output as implicit reasoning until it saw a visible reasoning end token, so valid payloads such as {"answer": 42} or required tool-call JSON could be hidden from the OpenAI/Responses stream or handed to the wrong parser phase.

Make the request contract explicit and preserve it across parser request rewrites. Structured text contracts bypass implicit reasoning immediately, while forced tool contracts only move into content/tool parsing when the prefix is a plausible tool payload. Preserve literal structured choices across rewrite as well, so a constrained choice such as <think>literal is not mistaken for hidden reasoning after structured decoding rewrites the request.

Keep visible Kimi reasoning delimiters meaningful: complete <think>...</think> regions and implicit Kimi tool-section boundaries are still stripped as reasoning. The intentionally ambiguous delimiter-literal edge is only handled when a constrained structured choice proves the literal is allowed, which avoids changing generic JSON/schema semantics.

Render/disaggregated serving now carries request-scoped reasoning state through GenerateRequest: render marks machine-output contracts as reasoning_ended and forwards effective chat_template_kwargs; disagg passes those values to engine.generate so structured decoding in the worker uses the same Kimi thinking configuration as render.

Also keep tool_choice=none streaming out of tool-call parsing. This overlaps semantically with upstream PRs vllm-project#42752 and vllm-project#42868, which are narrower generic fixes for tool_choice=none; if either lands first, future rebases should drop the duplicate guard but keep the Kimi machine-output/request-contract handling.

Co-authored-by: OpenAI Codex <codex@openai.com>
alexeldeib added a commit to alexeldeib/vllm that referenced this pull request May 22, 2026
Kimi K2 emits tool calls with native structural markers like <|tool_calls_section_begin|> and <|tool_call_begin|> functions.<name>:<id>, not the generic JSON payload used by the default required/named tool-choice path. When forced tool choices are guided and parsed as generic JSON, streamed responses can lose parsed tool calls or prevent visible reasoning before the native tool section.

Add a Kimi structural tag so required and named tool choices constrain generation to the same native format that KimiK2ToolParser already understands, and mark the parser as not supporting the generic required/named parser. The tag allows optional whitespace at the separator positions seen in Kimi K2.6 e2e output and already accepted by the parser regex, so guidance does not force the model away from its native distribution.

When structured outputs are enabled during reasoning, include a reasoning prefix that allows Kimi to complete its template-opened <think> block before the native tool-call section. Gate that prefix on the engine enable_in_reasoning setting and Kimi's thinking chat-template knob, not include_reasoning, because include_reasoning only controls response visibility.

Keep auto/none/no-tool behavior unchanged unless VLLM_ENFORCE_STRICT_TOOL_CALLING routes auto through structural tags, in which case Kimi now uses the same native tag builder as required/named. This change does not address the separate generic streaming parser issue where tool_choice="none" can still enter tool-call parsing; that is covered by vLLM PRs vllm-project#42752 and vllm-project#42868. Preserve strict=false tool definitions by disabling argument-schema guidance for that tool, and reject xgrammar-unsupported JSON schema features before installing the structural tag so unsupported schemas fail consistently with plain JSON structured outputs.

Tests cover Kimi structural-tag request adjustment, strict auto routing, strict=false tool schemas, xgrammar-unsupported schema rejection, opt-out from generic required/named parsing, replacement of conflicting structured-output constraints, structural-tag validation, reasoning-prefix gating by bitmask phase and Kimi thinking mode, and include_reasoning visibility not changing the grammar shape.

Co-authored-by: OpenAI Codex <codex@openai.com>

Signed-off-by: Ace Eldeib <aeldeib@coreweave.com>
@mergify mergify Bot added the needs-rebase label May 23, 2026
FutureSkyFly pushed a commit to FutureSkyFly/vllm-ascend that referenced this pull request May 31, 2026
Mirror of upstream vllm-project/vllm#42752 (fixes vllm-project/vllm#42747).

Streaming Chat Completions with tool_choice="none" (or omitted on a
no-tools request) could still produce delta.tool_calls and finish with
finish_reason="tool_calls" because DelegatingParser.parse_delta invokes
_extract_tool_calls_streaming unconditionally once the stream enters the
tool-call phase, ignoring request.tool_choice. Non-streaming already
short-circuits this in chat_completion/serving.py:1250:

    elif not request.tool_choice or request.tool_choice == "none":
        message = ChatMessage(role=role, reasoning=reasoning, content=content)

Replicate the same semantics on the streaming path: when tool_choice is
None or "none", skip the tool parser inside the tool-call phase and
surface the (post-reasoning) delta_text as plain DeltaMessage.content.

Effect: DSV4 DSML markup (and any other parser's tool-call-looking
output) stays in delta.content, matching the non-streaming behavior,
and finish_reason falls back to "stop".

Replaces the previous patch_dsv4_dsml_tool_choice_none.py approach,
which incorrectly stripped DSML markup from non-streaming content.
The new direction follows the upstream consensus in issue #42747:
both modes leave the markup in content, neither strips it.
FutureSkyFly pushed a commit to FutureSkyFly/vllm-ascend that referenced this pull request May 31, 2026
Mirror of upstream vllm-project/vllm#42752 (fixes vllm-project/vllm#42747).

Streaming Chat Completions with tool_choice="none" (or omitted on a
no-tools request) could still produce delta.tool_calls and finish with
finish_reason="tool_calls" because DelegatingParser.parse_delta invokes
_extract_tool_calls_streaming unconditionally once the stream enters the
tool-call phase, ignoring request.tool_choice. Non-streaming already
short-circuits this in chat_completion/serving.py:1250:

    elif not request.tool_choice or request.tool_choice == "none":
        message = ChatMessage(role=role, reasoning=reasoning, content=content)

Replicate the same semantics on the streaming path: when tool_choice is
None or "none", skip the tool parser inside the tool-call phase and
surface the (post-reasoning) delta_text as plain DeltaMessage.content.

Effect: DSV4 DSML markup (and any other parser's tool-call-looking
output) stays in delta.content, matching the non-streaming behavior,
and finish_reason falls back to "stop".

Replaces the previous patch_dsv4_dsml_tool_choice_none.py approach,
which incorrectly stripped DSML markup from non-streaming content.
The new direction follows the upstream consensus in issue #42747:
both modes leave the markup in content, neither strips it.

Signed-off-by: liuchenbing <chenliumail@163.com>
hoobnn added a commit to hoobnn/vllm that referenced this pull request May 31, 2026
DelegatingParser.parse_delta unconditionally invoked the configured
tool parser once the stream entered the tool-call phase, so streaming
Chat Completions could still emit delta.tool_calls and finish with
finish_reason="tool_calls" whenever a --tool-call-parser was configured
and the model output happened to match that parser format -- even when
the client disabled tools. The non-streaming path already short-circuits
both cases in chat_completion/serving.py:

    elif not request.tool_choice or request.tool_choice == "none":

Mirror that guard here. When `not request.tool_choice or
request.tool_choice == "none"` -- i.e. tool_choice="none" OR explicitly
disabled via JSON null (request.tool_choice is None) -- skip
extract_tool_calls_streaming and surface the accumulated post-reasoning
text as plain content. The tool parser is never invoked, so
function_name_returned/tools_streamed stay False and finish_reason falls
back to "stop". Reasoning extraction on boundary deltas is preserved.

This broadens the original tool_choice=="none"-only guard to also cover
request.tool_choice is None, per review feedback on vllm-project#42752, so streaming
matches the non-streaming semantics exactly.

Fixes vllm-project#42747

Signed-off-by: hoobnn <111053672+hoobnn@users.noreply.github.com>
@hoobnn hoobnn force-pushed the fix/issue-42747-tool-choice-none-streaming branch from 6af269f to 06e14e6 Compare May 31, 2026 09:05
@hoobnn hoobnn requested a review from AndreasKaratzas as a code owner May 31, 2026 09:05
@hoobnn hoobnn changed the title [Bugfix] Honor tool_choice="none" in Chat Completions streaming [Bugfix] Honor tool_choice=None / "none" in Chat Completions streaming May 31, 2026
@mergify mergify Bot removed the needs-rebase label May 31, 2026
@hoobnn

hoobnn commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

Addressed the review feedback: broadened the streaming guard to not request.tool_choice or request.tool_choice == "none" so it also covers request.tool_choice is None (explicit null / tools disabled), matching the non-streaming path. Rebased onto current main (reconciled with #42691's boundary-delta reasoning handling), added a dedicated regression test for the None path, and signed off for DCO. Folds in the same broader guard as #44102 — thanks @FutureSkyFly for the cross-validation.

hoobnn added a commit to hoobnn/vllm that referenced this pull request Jun 3, 2026
DelegatingParser.parse_delta unconditionally invoked the configured
tool parser once the stream entered the tool-call phase, so streaming
Chat Completions could still emit delta.tool_calls and finish with
finish_reason="tool_calls" whenever a --tool-call-parser was configured
and the model output happened to match that parser format -- even when
the client disabled tools. The non-streaming path already short-circuits
both cases in chat_completion/serving.py:

    elif not request.tool_choice or request.tool_choice == "none":

Mirror that guard here. When `not request.tool_choice or
request.tool_choice == "none"` -- i.e. tool_choice="none" OR explicitly
disabled via JSON null (request.tool_choice is None) -- skip
extract_tool_calls_streaming and surface the accumulated post-reasoning
text as plain content. The tool parser is never invoked, so
function_name_returned/tools_streamed stay False and finish_reason falls
back to "stop". Reasoning extraction on boundary deltas is preserved.

This broadens the original tool_choice=="none"-only guard to also cover
request.tool_choice is None, per review feedback on vllm-project#42752, so streaming
matches the non-streaming semantics exactly.

Fixes vllm-project#42747

Signed-off-by: hoobnn <111053672+hoobnn@users.noreply.github.com>
@hoobnn hoobnn force-pushed the fix/issue-42747-tool-choice-none-streaming branch from 06845e1 to 86bbc5c Compare June 3, 2026 00:45
@hoobnn hoobnn requested a review from sfeng33 June 3, 2026 00:47
@mergify mergify Bot removed the needs-rebase label Jun 3, 2026
hoobnn and others added 3 commits June 3, 2026 17:43
DelegatingParser.parse_delta unconditionally invoked the configured
tool parser once the stream entered the tool-call phase, so streaming
Chat Completions could still emit delta.tool_calls and finish with
finish_reason="tool_calls" whenever a --tool-call-parser was configured
and the model output happened to match that parser format -- even when
the client disabled tools. The non-streaming path already short-circuits
both cases in chat_completion/serving.py:

    elif not request.tool_choice or request.tool_choice == "none":

Mirror that guard here. When `not request.tool_choice or
request.tool_choice == "none"` -- i.e. tool_choice="none" OR explicitly
disabled via JSON null (request.tool_choice is None) -- skip
extract_tool_calls_streaming and surface the accumulated post-reasoning
text as plain content. The tool parser is never invoked, so
function_name_returned/tools_streamed stay False and finish_reason falls
back to "stop". Reasoning extraction on boundary deltas is preserved.

This broadens the original tool_choice=="none"-only guard to also cover
request.tool_choice is None, per review feedback on vllm-project#42752, so streaming
matches the non-streaming semantics exactly.

Fixes vllm-project#42747

Signed-off-by: hoobnn <111053672+hoobnn@users.noreply.github.com>
Relocate the streaming guard from parse_delta into
_extract_tool_calls_streaming next to the required/named dispatch, so
parse_delta reverts to a single unconditional call. The early return
surfaces remaining content as a DeltaMessage rather than None to avoid
dropping it when the pass-through-as-content fallback is skipped. Also
add a ResponsesRequest(tool_choice="none") parser-level regression
test.

Signed-off-by: hoobnn <111053672+hoobnn@users.noreply.github.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
@sfeng33 sfeng33 force-pushed the fix/issue-42747-tool-choice-none-streaming branch from 86bbc5c to 0845046 Compare June 3, 2026 18:36
@sfeng33 sfeng33 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 3, 2026
@sfeng33 sfeng33 changed the title [Bugfix] Honor tool_choice=None / "none" in Chat Completions streaming [Bugfix] Honor tool_choice="none" in Chat Completions streaming Jun 3, 2026
@vllm-project vllm-project deleted a comment from mergify Bot Jun 3, 2026

@sfeng33 sfeng33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I updated to narrow the guard condition to request.tool_choice == "none", when it's None, it should be treated as auto per openai spec.

Comment thread vllm/parser/abstract_parser.py
@sfeng33 sfeng33 enabled auto-merge (squash) June 3, 2026 18:44
@vllm-project vllm-project deleted a comment from mergify Bot Jun 3, 2026
@sfeng33 sfeng33 merged commit 2b237c7 into vllm-project:main Jun 3, 2026
36 of 37 checks passed
@hoobnn

hoobnn commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @sfeng33 for the patient review!

@hoobnn hoobnn deleted the fix/issue-42747-tool-choice-none-streaming branch June 3, 2026 22:36
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…-project#42752)

Signed-off-by: hoobnn <111053672+hoobnn@users.noreply.github.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Co-authored-by: sfeng33 <4florafeng@gmail.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
JisoLya pushed a commit to JisoLya/vllm that referenced this pull request Jun 5, 2026
…-project#42752)

Signed-off-by: hoobnn <111053672+hoobnn@users.noreply.github.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Co-authored-by: sfeng33 <4florafeng@gmail.com>
Signed-off-by: JisoLya <523420504@qq.com>
knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026
…-project#42752)

Signed-off-by: hoobnn <111053672+hoobnn@users.noreply.github.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Co-authored-by: sfeng33 <4florafeng@gmail.com>
alexbi29 added a commit to alexbi29/vllm that referenced this pull request Jun 8, 2026
… API

Three adaptations required after upstream refactors:

1. _WrappedParser removed (vllm-project#44279): replaced with an inline subclass
   _Gemma4Parser(DelegatingParser) with reasoning_parser_cls and
   tool_parser_cls set as class attributes directly.

2. parse_delta() gained a required `finished` kwarg (vllm-project#44017): updated
   _run_streaming to pass finished=(last token), _run_single_delta to
   pass finished=True, and the multi-turn loop to pass finished=False.

3. tool_choice="none" short-circuit added (vllm-project#42752): parse_delta now
   returns raw content immediately when request.tool_choice is "none",
   which is the default when no tools are specified. Fixed _make_request
   to include a dummy tool so tool_choice stays "auto" and the parser
   exercises its actual tool-extraction logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026
…-project#42752)

Signed-off-by: hoobnn <111053672+hoobnn@users.noreply.github.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Co-authored-by: sfeng33 <4florafeng@gmail.com>
Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
alexeldeib added a commit to alexeldeib/vllm that referenced this pull request Jun 12, 2026
Kimi K2 emits tool calls with native structural markers like <|tool_calls_section_begin|> and <|tool_call_begin|> functions.<name>:<id>, not the generic JSON payload used by the default required/named tool-choice path. When forced tool choices are guided and parsed as generic JSON, streamed responses can lose parsed tool calls or prevent visible reasoning before the native tool section.

Add a Kimi structural tag so required and named tool choices constrain generation to the same native format that KimiK2ToolParser already understands, and mark the parser as not supporting the generic required/named parser. The tag allows optional whitespace at the separator positions seen in Kimi K2.6 e2e output and already accepted by the parser regex, so guidance does not force the model away from its native distribution.

When structured outputs are enabled during reasoning, include a reasoning prefix that allows Kimi to complete its template-opened <think> block before the native tool-call section. Gate that prefix on the engine enable_in_reasoning setting and Kimi's thinking chat-template knob, not include_reasoning, because include_reasoning only controls response visibility.

Keep auto/none/no-tool behavior unchanged unless VLLM_ENFORCE_STRICT_TOOL_CALLING routes auto through structural tags, in which case Kimi now uses the same native tag builder as required/named. This change does not address the separate generic streaming parser issue where tool_choice="none" can still enter tool-call parsing; that is covered by vLLM PRs vllm-project#42752 and vllm-project#42868. Preserve strict=false tool definitions by disabling argument-schema guidance for that tool, and reject xgrammar-unsupported JSON schema features before installing the structural tag so unsupported schemas fail consistently with plain JSON structured outputs.

Tests cover Kimi structural-tag request adjustment, strict auto routing, strict=false tool schemas, xgrammar-unsupported schema rejection, opt-out from generic required/named parsing, replacement of conflicting structured-output constraints, structural-tag validation, reasoning-prefix gating by bitmask phase and Kimi thinking mode, and include_reasoning visibility not changing the grammar shape.

Co-authored-by: OpenAI Codex <codex@openai.com>

Signed-off-by: Ace Eldeib <aeldeib@coreweave.com>
alexbi29 added a commit to alexbi29/vllm that referenced this pull request Jun 12, 2026
… API

Three adaptations required after upstream refactors:

1. _WrappedParser removed (vllm-project#44279): replaced with an inline subclass
   _Gemma4Parser(DelegatingParser) with reasoning_parser_cls and
   tool_parser_cls set as class attributes directly.

2. parse_delta() gained a required `finished` kwarg (vllm-project#44017): updated
   _run_streaming to pass finished=(last token), _run_single_delta to
   pass finished=True, and the multi-turn loop to pass finished=False.

3. tool_choice="none" short-circuit added (vllm-project#42752): parse_delta now
   returns raw content immediately when request.tool_choice is "none",
   which is the default when no tools are specified. Fixed _make_request
   to include a dummy tool so tool_choice stays "auto" and the parser
   exercises its actual tool-extraction logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
alexbi29 added a commit to alexbi29/vllm that referenced this pull request Jun 12, 2026
… API

Three adaptations required after upstream refactors:

1. _WrappedParser removed (vllm-project#44279): replaced with an inline subclass
   _Gemma4Parser(DelegatingParser) with reasoning_parser_cls and
   tool_parser_cls set as class attributes directly.

2. parse_delta() gained a required `finished` kwarg (vllm-project#44017): updated
   _run_streaming to pass finished=(last token), _run_single_delta to
   pass finished=True, and the multi-turn loop to pass finished=False.

3. tool_choice="none" short-circuit added (vllm-project#42752): parse_delta now
   returns raw content immediately when request.tool_choice is "none",
   which is the default when no tools are specified. Fixed _make_request
   to include a dummy tool so tool_choice stays "auto" and the parser
   exercises its actual tool-extraction logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit c37293e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed tool-calling

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: Chat Completions streaming invokes tool parser despite tool_choice="none"

3 participants