Skip to content

[Bug]: Responses API does not surface reasoning output with --reasoning-parser gemma4 (works with deepseek_r1) #43395

@ashwing

Description

@ashwing

Your current environment

The output of python collect_env.py
OS                           : Ubuntu 22.04.5 LTS (x86_64)
PyTorch version              : 2.11.0+cu130
CUDA used to build PyTorch   : 13.0
Python version               : 3.12.13
Is CUDA available            : True
CUDA runtime version         : 13.0.88
GPU 0-3: NVIDIA RTX PRO 6000 Blackwell Server Edition
vLLM Version                 : 0.21.0
transformers                 : 5.8.1

🐛 Describe the bug

With --reasoning-parser gemma4 enabled on vLLM v0.21.0, the Chat Completions API correctly surfaces model reasoning in a reasoning field alongside tool calls. However, the Responses API (/v1/responses) does not surface reasoning output in any form — reasoning_tokens is always 0 and no ResponseReasoningItem output item appears — even when reasoning: {"effort": "high"} is passed in the request.

This is specific to the gemma4 reasoning parser — tested with Qwen3 + deepseek_r1 which works correctly on the same image.

Server start command:

docker run -d \
  --name vllm-gemma4 \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e VLLM_ENABLE_RESPONSES_API_STORE=1 \
  vllm/vllm-openai:v0.21.0 \
  --model google/gemma-4-26B-A4B-it \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 14336 \
  --tool-call-parser functiongemma \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4

Reproduction script:

import requests, json

BASE = "http://localhost:8000/v1"
MODEL = "google/gemma-4-26B-A4B-it"

tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    }
]

# Test 1: Chat Completions — reasoning WORKS
print("=== Chat Completions (works) ===")
r1 = requests.post(f"{BASE}/chat/completions", json={
    "model": MODEL,
    "messages": [{"role": "user", "content": "What is the weather in NYC?"}],
    "tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get weather", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}],
    "tool_choice": "required",
    "max_tokens": 1024,
    "chat_template_kwargs": {"enable_thinking": True},
})
d1 = r1.json()
msg = d1["choices"][0]["message"]
print(f"  reasoning: {msg.get('reasoning', '')[:200]}")
print(f"  tool_calls: {msg.get('tool_calls')}")
print(f"  finish_reason: {d1['choices'][0]['finish_reason']}")

# Test 2: Responses API — reasoning NOT surfaced
print("\n=== Responses API (broken) ===")
r2 = requests.post(f"{BASE}/responses", json={
    "model": MODEL,
    "input": [{"role": "user", "content": "What is the weather in NYC?"}],
    "tools": tools,
    "tool_choice": "required",
    "reasoning": {"effort": "high"},
})
d2 = r2.json()
usage = d2.get("usage", {})
output_details = usage.get("output_tokens_details", {})
print(f"  reasoning_tokens: {output_details.get('reasoning_tokens', 0)}")
print(f"  output types: {[item.get('type') for item in d2.get('output', [])]}")
print(f"  top-level reasoning field: {d2.get('reasoning')}")

# Test 3: Responses API text-only — still no reasoning
print("\n=== Responses API text-only (also broken) ===")
r3 = requests.post(f"{BASE}/responses", json={
    "model": MODEL,
    "input": [{"role": "user", "content": "What is 2+2? Think step by step."}],
    "reasoning": {"effort": "high"},
    "max_output_tokens": 500,
})
d3 = r3.json()
usage3 = d3.get("usage", {})
print(f"  reasoning_tokens: {usage3.get('output_tokens_details', {}).get('reasoning_tokens', 0)}")
print(f"  output types: {[item.get('type') for item in d3.get('output', [])]}")

Expected behavior:

The Responses API should surface reasoning output when --reasoning-parser gemma4 is active and reasoning: {"effort": "high"} is passed. Expected:

  • A "reasoning" output item (ResponseReasoningItem) containing the model's thinking
  • reasoning_tokens > 0 in usage details

The Chat Completions API demonstrates this works at the model/parser level — the Responses API just doesn't wire it through to output items.

Actual behavior:

=== Chat Completions (works) ===
  reasoning: The user is asking about the weather in NYC. I should look at the available tools... The `get_weather` tool seems appropriate for this task.
  tool_calls: [{'id': 'chatcmpl-tool-...', 'type': 'function', 'function': {'name': 'get_weather', 'arguments': '{"city": "NYC"}'}}]
  finish_reason: tool_calls

=== Responses API (broken) ===
  reasoning_tokens: 0
  output types: ['function_call']
  top-level reasoning field: None

=== Responses API text-only (also broken) ===
  reasoning_tokens: 0
  output types: ['message']

Reproduction Evidence

Confirmed on vllm/vllm-openai:latest (v0.21.0) with google/gemma-4-26B-A4B-it, TP=4, --reasoning-parser gemma4, --tool-call-parser functiongemma:

=== Chat Completions with reasoning ===
  reasoning present: True
  reasoning preview: The user wants to know the product of 15 × 37...
  finish_reason: stop

=== Chat Completions with tools + reasoning ===
  reasoning present: True
  reasoning preview: The user is asking about the weather in NYC. I should look for a tool...
  tool_calls: [{"function": {"name": "get_weather", "arguments": "{\"city\": \"NYC\"}"}}]
  finish_reason: tool_calls

=== Responses API with reasoning={'effort': 'high'} ===
  reasoning_tokens: 0
  output types: ['message']
  has ResponseReasoningItem: False

=== Responses API with tools + reasoning ===
  reasoning_tokens: 0
  output types: ['function_call']
  has ResponseReasoningItem: False

Note: This bug is parser-specific. Tested with Qwen3-1.7B + --reasoning-parser deepseek_r1 on the same image — the Responses API correctly returns ResponseReasoningItem with reasoning_tokens: 1023. The issue is specific to the gemma4 reasoning parser integration with the Responses API.

Analysis

The root cause is in the gemma4 reasoning parser's interaction with the Responses API serving path:

  1. Chat Completions path works: The gemma4 reasoning parser correctly extracts <think>...</think> blocks and surfaces them in the reasoning field of the chat completion response.

  2. Responses API non-harmony path (_make_response_output_items): Delegates to parser.extract_response_outputs() at serving.py:1044. The gemma4 parser's implementation of this method appears to not construct ResponseReasoningItem objects from the parsed reasoning content — it may be stripping the reasoning and only returning the final content/tool calls.

  3. Contrast with working parsers: The deepseek_r1 parser correctly produces ResponseReasoningItem in its extract_response_outputs() implementation, which is why Qwen3 works fine.

  4. reasoning_tokens counting also fails: The fallback at serving.py:866 that tries to count reasoning tokens from accumulated token IDs doesn't trigger for the gemma4 parser context.

Related Issues / PRs

This issue is distinct: it's about the gemma4 reasoning parser not producing ResponseReasoningItem in its extract_response_outputs() path, while other parsers (like deepseek_r1) work correctly.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions