Your current environment
The output of python collect_env.py
OS : Ubuntu 22.04.5 LTS (x86_64)
PyTorch version : 2.11.0+cu130
CUDA used to build PyTorch : 13.0
Python version : 3.12.13
Is CUDA available : True
CUDA runtime version : 13.0.88
GPU 0-3: NVIDIA RTX PRO 6000 Blackwell Server Edition
vLLM Version : 0.21.0
transformers : 5.8.1
🐛 Describe the bug
With --reasoning-parser gemma4 enabled on vLLM v0.21.0, the Chat Completions API correctly surfaces model reasoning in a reasoning field alongside tool calls. However, the Responses API (/v1/responses) does not surface reasoning output in any form — reasoning_tokens is always 0 and no ResponseReasoningItem output item appears — even when reasoning: {"effort": "high"} is passed in the request.
This is specific to the gemma4 reasoning parser — tested with Qwen3 + deepseek_r1 which works correctly on the same image.
Server start command:
docker run -d \
--name vllm-gemma4 \
--gpus all \
--ipc=host \
-p 8000:8000 \
-e VLLM_ENABLE_RESPONSES_API_STORE=1 \
vllm/vllm-openai:v0.21.0 \
--model google/gemma-4-26B-A4B-it \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--max-num-seqs 4 \
--max-num-batched-tokens 14336 \
--tool-call-parser functiongemma \
--enable-auto-tool-choice \
--reasoning-parser gemma4
Reproduction script:
import requests, json
BASE = "http://localhost:8000/v1"
MODEL = "google/gemma-4-26B-A4B-it"
tools = [
{
"type": "function",
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
}
]
# Test 1: Chat Completions — reasoning WORKS
print("=== Chat Completions (works) ===")
r1 = requests.post(f"{BASE}/chat/completions", json={
"model": MODEL,
"messages": [{"role": "user", "content": "What is the weather in NYC?"}],
"tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get weather", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}],
"tool_choice": "required",
"max_tokens": 1024,
"chat_template_kwargs": {"enable_thinking": True},
})
d1 = r1.json()
msg = d1["choices"][0]["message"]
print(f" reasoning: {msg.get('reasoning', '')[:200]}")
print(f" tool_calls: {msg.get('tool_calls')}")
print(f" finish_reason: {d1['choices'][0]['finish_reason']}")
# Test 2: Responses API — reasoning NOT surfaced
print("\n=== Responses API (broken) ===")
r2 = requests.post(f"{BASE}/responses", json={
"model": MODEL,
"input": [{"role": "user", "content": "What is the weather in NYC?"}],
"tools": tools,
"tool_choice": "required",
"reasoning": {"effort": "high"},
})
d2 = r2.json()
usage = d2.get("usage", {})
output_details = usage.get("output_tokens_details", {})
print(f" reasoning_tokens: {output_details.get('reasoning_tokens', 0)}")
print(f" output types: {[item.get('type') for item in d2.get('output', [])]}")
print(f" top-level reasoning field: {d2.get('reasoning')}")
# Test 3: Responses API text-only — still no reasoning
print("\n=== Responses API text-only (also broken) ===")
r3 = requests.post(f"{BASE}/responses", json={
"model": MODEL,
"input": [{"role": "user", "content": "What is 2+2? Think step by step."}],
"reasoning": {"effort": "high"},
"max_output_tokens": 500,
})
d3 = r3.json()
usage3 = d3.get("usage", {})
print(f" reasoning_tokens: {usage3.get('output_tokens_details', {}).get('reasoning_tokens', 0)}")
print(f" output types: {[item.get('type') for item in d3.get('output', [])]}")
Expected behavior:
The Responses API should surface reasoning output when --reasoning-parser gemma4 is active and reasoning: {"effort": "high"} is passed. Expected:
- A
"reasoning" output item (ResponseReasoningItem) containing the model's thinking
reasoning_tokens > 0 in usage details
The Chat Completions API demonstrates this works at the model/parser level — the Responses API just doesn't wire it through to output items.
Actual behavior:
=== Chat Completions (works) ===
reasoning: The user is asking about the weather in NYC. I should look at the available tools... The `get_weather` tool seems appropriate for this task.
tool_calls: [{'id': 'chatcmpl-tool-...', 'type': 'function', 'function': {'name': 'get_weather', 'arguments': '{"city": "NYC"}'}}]
finish_reason: tool_calls
=== Responses API (broken) ===
reasoning_tokens: 0
output types: ['function_call']
top-level reasoning field: None
=== Responses API text-only (also broken) ===
reasoning_tokens: 0
output types: ['message']
Reproduction Evidence
Confirmed on vllm/vllm-openai:latest (v0.21.0) with google/gemma-4-26B-A4B-it, TP=4, --reasoning-parser gemma4, --tool-call-parser functiongemma:
=== Chat Completions with reasoning ===
reasoning present: True
reasoning preview: The user wants to know the product of 15 × 37...
finish_reason: stop
=== Chat Completions with tools + reasoning ===
reasoning present: True
reasoning preview: The user is asking about the weather in NYC. I should look for a tool...
tool_calls: [{"function": {"name": "get_weather", "arguments": "{\"city\": \"NYC\"}"}}]
finish_reason: tool_calls
=== Responses API with reasoning={'effort': 'high'} ===
reasoning_tokens: 0
output types: ['message']
has ResponseReasoningItem: False
=== Responses API with tools + reasoning ===
reasoning_tokens: 0
output types: ['function_call']
has ResponseReasoningItem: False
Note: This bug is parser-specific. Tested with Qwen3-1.7B + --reasoning-parser deepseek_r1 on the same image — the Responses API correctly returns ResponseReasoningItem with reasoning_tokens: 1023. The issue is specific to the gemma4 reasoning parser integration with the Responses API.
Analysis
The root cause is in the gemma4 reasoning parser's interaction with the Responses API serving path:
-
Chat Completions path works: The gemma4 reasoning parser correctly extracts <think>...</think> blocks and surfaces them in the reasoning field of the chat completion response.
-
Responses API non-harmony path (_make_response_output_items): Delegates to parser.extract_response_outputs() at serving.py:1044. The gemma4 parser's implementation of this method appears to not construct ResponseReasoningItem objects from the parsed reasoning content — it may be stripping the reasoning and only returning the final content/tool calls.
-
Contrast with working parsers: The deepseek_r1 parser correctly produces ResponseReasoningItem in its extract_response_outputs() implementation, which is why Qwen3 works fine.
-
reasoning_tokens counting also fails: The fallback at serving.py:866 that tries to count reasoning tokens from accumulated token IDs doesn't trigger for the gemma4 parser context.
Related Issues / PRs
This issue is distinct: it's about the gemma4 reasoning parser not producing ResponseReasoningItem in its extract_response_outputs() path, while other parsers (like deepseek_r1) work correctly.
Before submitting a new issue...
Your current environment
The output of
python collect_env.py🐛 Describe the bug
With
--reasoning-parser gemma4enabled on vLLM v0.21.0, the Chat Completions API correctly surfaces model reasoning in areasoningfield alongside tool calls. However, the Responses API (/v1/responses) does not surface reasoning output in any form —reasoning_tokensis always 0 and noResponseReasoningItemoutput item appears — even whenreasoning: {"effort": "high"}is passed in the request.This is specific to the
gemma4reasoning parser — tested with Qwen3 +deepseek_r1which works correctly on the same image.Server start command:
Reproduction script:
Expected behavior:
The Responses API should surface reasoning output when
--reasoning-parser gemma4is active andreasoning: {"effort": "high"}is passed. Expected:"reasoning"output item (ResponseReasoningItem) containing the model's thinkingreasoning_tokens > 0in usage detailsThe Chat Completions API demonstrates this works at the model/parser level — the Responses API just doesn't wire it through to output items.
Actual behavior:
Reproduction Evidence
Confirmed on
vllm/vllm-openai:latest(v0.21.0) withgoogle/gemma-4-26B-A4B-it, TP=4,--reasoning-parser gemma4,--tool-call-parser functiongemma:Note: This bug is parser-specific. Tested with Qwen3-1.7B +
--reasoning-parser deepseek_r1on the same image — the Responses API correctly returnsResponseReasoningItemwithreasoning_tokens: 1023. The issue is specific to thegemma4reasoning parser integration with the Responses API.Analysis
The root cause is in the
gemma4reasoning parser's interaction with the Responses API serving path:Chat Completions path works: The
gemma4reasoning parser correctly extracts<think>...</think>blocks and surfaces them in thereasoningfield of the chat completion response.Responses API non-harmony path (
_make_response_output_items): Delegates toparser.extract_response_outputs()atserving.py:1044. The gemma4 parser's implementation of this method appears to not constructResponseReasoningItemobjects from the parsed reasoning content — it may be stripping the reasoning and only returning the final content/tool calls.Contrast with working parsers: The
deepseek_r1parser correctly producesResponseReasoningItemin itsextract_response_outputs()implementation, which is why Qwen3 works fine.reasoning_tokenscounting also fails: The fallback atserving.py:866that tries to count reasoning tokens from accumulated token IDs doesn't trigger for the gemma4 parser context.Related Issues / PRs
chat_template_kwargs+thinking_token_budgetpassthrough — partial fix for a different layer)include_reasoningrequest parameter for non-harmony models #33915 — "Supportinclude_reasoningrequest parameter for non-harmony models"This issue is distinct: it's about the
gemma4reasoning parser not producingResponseReasoningItemin itsextract_response_outputs()path, while other parsers (likedeepseek_r1) work correctly.Before submitting a new issue...