[Bugfix] Honor tool_choice=None / "none" in Chat Completions streaming#44102
Conversation
Fixes vllm-project#42747 alongside the existing vllm-project#42752 attempt. Streaming Chat Completions with `tool_choice="none"` -- or omitted on a no-tools request, where `request.tool_choice` ends up as `None` -- could still produce `delta.tool_calls` and finish with `finish_reason="tool_calls"` whenever the server was launched with a `--tool-call-parser` and the model output happened to match that parser's tool-call format. `DelegatingParser.parse_delta` in `vllm/parser/abstract_parser.py` invokes `_extract_tool_calls_streaming` unconditionally once the stream enters the tool-call phase, without inspecting `request.tool_choice`. The non-streaming path at `vllm/entrypoints/openai/chat_completion/serving.py` already short-circuits both cases: elif not request.tool_choice or request.tool_choice == "none": message = ChatMessage(role=role, reasoning=reasoning, content=content) The streaming path was missing the equivalent guard. Fix --- In `DelegatingParser.parse_delta`, when `not request.tool_choice or request.tool_choice == "none"`, skip `_extract_tool_calls_streaming` and surface any remaining (post-reasoning) text as plain `content`. Because the tool parser is never invoked, `state.function_name_returned` stays untouched and the downstream `tools_streamed[i]` flag stays `False`, so `finish_reason` naturally falls back to `"stop"`. Reasoning extraction is untouched. Difference from vllm-project#42752 ---------------------- vllm-project#42752 (open, currently CONFLICTING with main) only guards on `request.tool_choice == "none"`. The pending review feedback on that PR (gemini-code-assist, 2026-05-15) explicitly asks to broaden the check to also cover the `request.tool_choice is None` case (no-tools request without an explicit tool_choice). This PR implements that broader guard so the streaming behavior matches the non-streaming `not request.tool_choice or request.tool_choice == "none"` semantics exactly. This logic has been independently validated downstream in vllm-project/vllm-ascend#9776 against vllm v0.20.2. Tests ----- Added in tests/entrypoints/openai/test_tool_choice_content_none.py: - test_parse_delta_with_tool_choice_none_skips_tool_parser -- explicit tool_choice="none": parser is not invoked, raw delta text surfaces as content. - test_parse_delta_with_omitted_tool_choice_skips_tool_parser -- omitted tool_choice on a no-tools request (request.tool_choice is None): parser is not invoked, raw delta text surfaces as content. This is the additional case beyond vllm-project#42752. - test_parse_delta_without_tool_choice_none_still_runs_tool_parser -- sanity: tool_choice="auto" still hits the tool parser (no regression). - test_parse_delta_tool_choice_none_multiple_chunks_remain_content -- multi-chunk streaming stays in content mode across deltas. Note: a local pytest run was not possible in the contributing environment (macos-aarch64, no torch available). The change mirrors the already-reviewed approach of vllm-project#42752, and CI on this PR will exercise the new tests. Signed-off-by: liuchenbing <chenliumail@163.com>
|
Thanks @FutureSkyFly for picking this up and for independently validating the broader guard (and the cross-check in vllm-project/vllm-ascend#9776) 🙏 I've folded the broader guard into the original PR #42752: the streaming check now reads Since #42752 now covers the same fix, I think this one can be closed as duplicate — but very happy to defer if you'd prefer to drive it. Either way, appreciate the push to broaden the guard. |
|
Thanks @hoobnn for picking it up and folding the broader guard into #42752 — appreciate the quick turnaround. Closing this as duplicate; #42752 now covers the same scope ( |
|
Closing as duplicate of #42752 (now updated with the broader guard). |
Summary
Fixes #42747.
Streaming Chat Completions with
tool_choice="none"— or omitted on a no-tools request, whererequest.tool_choiceends up asNone— could still producedelta.tool_callsand finish withfinish_reason="tool_calls"whenever the server was launched with a--tool-call-parserand the model output happened to match that parser's tool-call format. Non-streaming Chat Completions already handles both cases correctly.Root cause
DelegatingParser.parse_deltainvllm/parser/abstract_parser.pyinvokes_extract_tool_calls_streamingunconditionally once the stream enters the tool-call phase, without inspectingrequest.tool_choice. The non-streaming path atvllm/entrypoints/openai/chat_completion/serving.pyalready short-circuits both cases:The streaming path was missing the equivalent guard.
Fix
In
DelegatingParser.parse_delta, whennot request.tool_choice or request.tool_choice == "none", skip_extract_tool_calls_streamingand surface any remaining (post-reasoning) text as plaincontent. Because the tool parser is never invoked,state.function_name_returnedstays untouched and the downstreamtools_streamed[i]flag staysFalse, sofinish_reasonnaturally falls back to"stop". Reasoning extraction is untouched.Difference from #42752
#42752 was the original attempt at this fix and at the time this PR was opened was OPEN,
CONFLICTINGwith main, last updated 2026-05-23, and only guarded onrequest.tool_choice == "none". The review feedback on that PR (gemini-code-assist, 2026-05-15) explicitly asked to broaden the check to also coverrequest.tool_choice is None— the no-tools request case raised by @QwertyJack in the comment thread under #42747.This PR implements that broader guard so streaming matches the non-streaming
not request.tool_choice or request.tool_choice == "none"semantics exactly. Acknowledging @hoobnn for the original direction; happy to close this if #42752 is updated with the same broader guard.This logic has been independently validated downstream in vllm-project/vllm-ascend#9776 against vLLM v0.20.2.
Duplicate-PR check (per AGENTS.md)
chat_completion/serving.py, notabstract_parser.py; touches a different surface area.This PR is materially different from #42752 in the guard scope.
Test plan
Added in
tests/entrypoints/openai/test_tool_choice_content_none.py:test_parse_delta_with_tool_choice_none_skips_tool_parser— explicittool_choice="none": parser is not invoked, raw delta text surfaces asDeltaMessage.content.test_parse_delta_with_omitted_tool_choice_skips_tool_parser— omittedtool_choiceon a no-tools request (request.tool_choice is None): parser is not invoked, raw delta text surfaces as content. This is the additional case beyond [Bugfix] Honor tool_choice="none" in Chat Completions streaming #42752.test_parse_delta_without_tool_choice_none_still_runs_tool_parser— sanity:tool_choice="auto"still hits the tool parser (no regression).test_parse_delta_tool_choice_none_multiple_chunks_remain_content— multi-chunk streaming stays in content mode across deltas.A local
pytestrun was not possible in the contributing environment (macOS aarch64, notorchavailable). The new tests rely only onvllm.entrypoints.openai.chat_completion.protocol,vllm.parser.abstract_parserand a stub parser; they will run on CI on this PR.AI assistance was used (Claude) to draft the patch and the test stub, mirroring the already-reviewed approach in #42752.
Closing note (2026-05-31): @hoobnn has folded the broader guard into #42752 and rebased it onto current
mainwith the samerequest.tool_choice is Noneregression coverage. Closing this PR as duplicate of #42752 per its now-equivalent scope. Thanks @hoobnn for picking it up.