Skip to content

[Feature][Response API] Add streaming support for non-harmony#23741

Merged
DarkLight1337 merged 3 commits intovllm-project:mainfrom
kebe7jun:feature/responses-api-streaming
Sep 4, 2025
Merged

[Feature][Response API] Add streaming support for non-harmony#23741
DarkLight1337 merged 3 commits intovllm-project:mainfrom
kebe7jun:feature/responses-api-streaming

Conversation

@kebe7jun
Copy link
Copy Markdown
Contributor

@kebe7jun kebe7jun commented Aug 27, 2025

Purpose

Add streaming support for non-harmony

Related issue #23225

Test Plan

Unit tests and self tests(see result).

Test Result

GPT-OSS Stream output
ResponseCreatedEvent(response=Response(id='resp_3bc9f13acb90485daa3d1694ac9ea14c', created_at=1756274867.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='in_progress', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_3bc9f13acb90485daa3d1694ac9ea14c', created_at=1756274867.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='in_progress', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseReasoningItem(id='', summary=[], type='reasoning', content=None, encrypted_content=None, status='in_progress'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseReasoningTextDeltaEvent(content_index=0, delta='User', item_id='', output_index=0, sequence_number=4, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=0, delta=' wants', item_id='', output_index=0, sequence_number=5, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=0, delta=' us', item_id='', output_index=0, sequence_number=6, type='response.reasoning_text.delta')
...
ResponseReasoningTextDeltaEvent(content_index=0, delta=' but', item_id='', output_index=0, sequence_number=110, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=0, delta=' okay', item_id='', output_index=0, sequence_number=111, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=0, delta='.', item_id='', output_index=0, sequence_number=112, type='response.reasoning_text.delta')
ResponseReasoningTextDoneEvent(content_index=0, item_id='', output_index=1, sequence_number=113, text='User wants us to say \'double bubble bath\' ten times fast. We need to comply? It\'s a nonsensical request but presumably no policy violation. It\'s a benign language request. We can comply by repeating phrase 10 times quickly. Should we maybe output a line like "double bubble bath" repeated 10 times quickly. That\'s fine.\n\nNo policy conflicts. The phrase is not disallowed. So we comply.\n\nWe should produce "double bubble bath double bubble bath ... " repeated 10 times. be mindful it\'s too much but okay.', type='response.reasoning_text.done')
ResponseOutputItemDoneEvent(item=ResponseReasoningItem(id='', summary=[], type='reasoning', content=[Content(text='User wants us to say \'double bubble bath\' ten times fast. We need to comply? It\'s a nonsensical request but presumably no policy violation. It\'s a benign language request. We can comply by repeating phrase 10 times quickly. Should we maybe output a line like "double bubble bath" repeated 10 times quickly. That\'s fine.\n\nNo policy conflicts. The phrase is not disallowed. So we comply.\n\nWe should produce "double bubble bath double bubble bath ... " repeated 10 times. be mindful it\'s too much but okay.', type='reasoning_text')], encrypted_content=None, status='completed'), output_index=1, sequence_number=114, type='response.output_item.done')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='', content=[], role='assistant', status='in_progress', type='message'), output_index=1, sequence_number=115, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='', output_index=1, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=116, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=0, delta='double', item_id='', logprobs=[], output_index=1, sequence_number=117, type='response.output_text.delta')
ResponseTextDeltaEvent(content_index=0, delta=' bubble', item_id='', logprobs=[], output_index=1, sequence_number=118, type='response.output_text.delta')
...
ResponseTextDeltaEvent(content_index=0, delta=' bubble', item_id='', logprobs=[], output_index=1, sequence_number=145, type='response.output_text.delta')
ResponseTextDeltaEvent(content_index=0, delta=' bath', item_id='', logprobs=[], output_index=1, sequence_number=146, type='response.output_text.delta')
ResponseTextDoneEvent(content_index=0, item_id='', logprobs=[], output_index=2, sequence_number=147, text='double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=0, item_id='', output_index=2, part=ResponseOutputText(annotations=[], text='double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath', type='output_text', logprobs=None), sequence_number=148, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='', content=[ResponseOutputText(annotations=[], text='double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath double bubble bath', type='output_text', logprobs=None)], role='assistant', status='completed', type='message'), output_index=2, sequence_number=149, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_3bc9f13acb90485daa3d1694ac9ea14c', created_at=1756274867.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=1.0, tool_choice='auto', tools=[], top_p=1.0, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='completed', text=None, top_logprobs=None, truncation='disabled', usage=ResponseUsage(input_tokens=81, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=149, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=230), user=None), sequence_number=150, type='response.completed')
Qwen3 30B A3B Stream output
ResponseCreatedEvent(response=Response(id='resp_a01680e6fda64355bdb4eccd95db366a', created_at=1756866839.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=0.6, tool_choice='auto', tools=[], top_p=0.95, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='in_progress', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None), sequence_number=0, type='response.created')
ResponseInProgressEvent(response=Response(id='resp_a01680e6fda64355bdb4eccd95db366a', created_at=1756866839.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=0.6, tool_choice='auto', tools=[], top_p=0.95, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='in_progress', text=None, top_logprobs=None, truncation='disabled', usage=None, user=None), sequence_number=1, type='response.in_progress')
ResponseOutputItemAddedEvent(item=ResponseReasoningItem(id='', summary=[], type='reasoning', content=None, encrypted_content=None, status='in_progress'), output_index=0, sequence_number=2, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='', output_index=0, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=3, type='response.content_part.added')
ResponseReasoningTextDeltaEvent(content_index=1, delta='\n', item_id='', output_index=0, sequence_number=4, type='response.reasoning_text.delta')
ResponseReasoningTextDeltaEvent(content_index=2, delta='Okay', item_id='', output_index=0, sequence_number=5, type='response.reasoning_text.delta')
...
ResponseReasoningTextDeltaEvent(content_index=256, delta='.\n', item_id='', output_index=0, sequence_number=259, type='response.reasoning_text.delta')
ResponseReasoningTextDoneEvent(content_index=257, item_id='', output_index=0, sequence_number=260, text='\nOkay, the user wants me to say "double bubble bath" ten times fast. Let me start by repeating it as instructed. I need to make sure I do it quickly but still clearly. Let me count each repetition. First time: double bubble bath. Second: double bubble bath. Third... Wait, maybe I should check if there\'s a specific way to pronounce it. "Double" is pronounced like "dub-uhl", "bubble" is "buh-buhl", and "bath" is "bath". So putting it all together: "dub-uhl buh-buhl bath". I should make sure each word is distinct but the whole phrase flows smoothly. Let me try again, faster. Double bubble bath, double bubble bath... Hmm, maybe I can practice a few times to get the rhythm right. Also, the user might be testing my ability to follow instructions or maybe it\'s a fun exercise. I should keep it light and not overthink it. Just repeat it ten times as fast as possible without making mistakes. Let me count: 1, 2, 3... up to 10. Okay, that should do it. I\'ll make sure the response is clear and matches the user\'s request.\n', type='response.reasoning_text.done')
ResponseOutputItemDoneEvent(item=ResponseReasoningItem(id='', summary=[], type='reasoning', content=[Content(text='\nOkay, the user wants me to say "double bubble bath" ten times fast. Let me start by repeating it as instructed. I need to make sure I do it quickly but still clearly. Let me count each repetition. First time: double bubble bath. Second: double bubble bath. Third... Wait, maybe I should check if there\'s a specific way to pronounce it. "Double" is pronounced like "dub-uhl", "bubble" is "buh-buhl", and "bath" is "bath". So putting it all together: "dub-uhl buh-buhl bath". I should make sure each word is distinct but the whole phrase flows smoothly. Let me try again, faster. Double bubble bath, double bubble bath... Hmm, maybe I can practice a few times to get the rhythm right. Also, the user might be testing my ability to follow instructions or maybe it\'s a fun exercise. I should keep it light and not overthink it. Just repeat it ten times as fast as possible without making mistakes. Let me count: 1, 2, 3... up to 10. Okay, that should do it. I\'ll make sure the response is clear and matches the user\'s request.\n', type='reasoning_text')], encrypted_content=None, status='completed'), output_index=0, sequence_number=261, type='response.output_item.done')
ResponseOutputItemAddedEvent(item=ResponseOutputMessage(id='', content=[], role='assistant', status='in_progress', type='message'), output_index=0, sequence_number=262, type='response.output_item.added')
ResponseContentPartAddedEvent(content_index=0, item_id='', output_index=1, part=ResponseOutputText(annotations=[], text='', type='output_text', logprobs=[]), sequence_number=263, type='response.content_part.added')
ResponseTextDeltaEvent(content_index=1, delta='\n\n', item_id='', logprobs=[], output_index=1, sequence_number=264, type='response.output_text.delta')
ResponseTextDeltaEvent(content_index=2, delta='Double', item_id='', logprobs=[], output_index=1, sequence_number=265, type='response.output_text.delta')
...
ResponseTextDeltaEvent(content_index=42, delta='', item_id='', logprobs=[], output_index=1, sequence_number=305, type='response.output_text.delta')
ResponseTextDoneEvent(content_index=43, item_id='', logprobs=[], output_index=1, sequence_number=306, text='\n\nDouble bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath.', type='response.output_text.done')
ResponseContentPartDoneEvent(content_index=44, item_id='', output_index=1, part=ResponseOutputText(annotations=[], text='\n\nDouble bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath.', type='output_text', logprobs=None), sequence_number=307, type='response.content_part.done')
ResponseOutputItemDoneEvent(item=ResponseOutputMessage(id='', content=[ResponseOutputText(annotations=[], text='\n\nDouble bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath, double bubble bath.', type='output_text', logprobs=None)], role='assistant', status='completed', type='message', summary=[]), output_index=1, sequence_number=308, type='response.output_item.done')
ResponseCompletedEvent(response=Response(id='resp_a01680e6fda64355bdb4eccd95db366a', created_at=1756866839.0, error=None, incomplete_details=None, instructions=None, metadata=None, model='model', object='response', output=[], parallel_tool_calls=True, temperature=0.6, tool_choice='auto', tools=[], top_p=0.95, background=False, max_output_tokens=1000, max_tool_calls=None, previous_response_id=None, prompt=None, prompt_cache_key=None, reasoning=None, safety_identifier=None, service_tier='auto', status='completed', text=None, top_logprobs=None, truncation='disabled', usage=ResponseUsage(input_tokens=18, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=300, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=318), user=None), sequence_number=309, type='response.completed')

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch from 8dc2da4 to b65638e Compare August 27, 2025 11:32
@mergify mergify Bot added the frontend label Aug 27, 2025
@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch from b65638e to 3bb6902 Compare August 27, 2025 11:46
@mergify mergify Bot added the v1 label Aug 27, 2025
@kebe7jun kebe7jun marked this pull request as ready for review August 27, 2025 11:55
@kebe7jun kebe7jun requested a review from aarnphm as a code owner August 27, 2025 11:55
@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch 2 times, most recently from 6d9fe9c to af25d9a Compare August 28, 2025 01:37
@kebe7jun
Copy link
Copy Markdown
Contributor Author

@heheda12345 PTAL

Copy link
Copy Markdown
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution. Some small comments.

Comment thread vllm/entrypoints/context.py Outdated
Comment thread vllm/entrypoints/openai/serving_responses.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix these indexes? Reference: #23382

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch 3 times, most recently from 77bd0aa to bc4c5ae Compare September 3, 2025 03:18
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update. Can you also update the "current_item_id"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reminder, my apologies for the oversight, fixed.

@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch from bc4c5ae to cf993d1 Compare September 3, 2025 07:51
Copy link
Copy Markdown
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for your contribution.

@heheda12345 heheda12345 enabled auto-merge (squash) September 3, 2025 18:11
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 3, 2025
@heheda12345
Copy link
Copy Markdown
Collaborator

@kebe7jun The v1-test-entrypoints CI failure seems to be related to this PR. Can you take a look?

 v1/entrypoints/openai/responses/test_basic.py::test_streaming - TypeError: 'AsyncStream' object is not iterable

auto-merge was automatically disabled September 4, 2025 01:14

Head branch was pushed to by a user without write access

@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch 3 times, most recently from 77ef2de to 30d435e Compare September 4, 2025 04:37
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Kebe <mail@kebe7jun.com>
@kebe7jun kebe7jun force-pushed the feature/responses-api-streaming branch from 30d435e to 3e604da Compare September 4, 2025 04:38
@DarkLight1337 DarkLight1337 merged commit 8f423e5 into vllm-project:main Sep 4, 2025
39 checks passed
@kebe7jun kebe7jun deleted the feature/responses-api-streaming branch September 4, 2025 09:49
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
ABC12345anouys pushed a commit to ABC12345anouys/vllm that referenced this pull request Sep 25, 2025
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants