Skip to content

Add --stream-response-default-include-usage server flag#16711

Merged
hnyls2002 merged 32 commits intosgl-project:mainfrom
syd520zy:main
Apr 4, 2026
Merged

Add --stream-response-default-include-usage server flag#16711
hnyls2002 merged 32 commits intosgl-project:mainfrom
syd520zy:main

Conversation

@syd520zy
Copy link
Copy Markdown
Contributor

@syd520zy syd520zy commented Jan 8, 2026

Motivation

When streaming is enabled, usage info is only returned if the client sets stream_options.include_usage = true. Server operators who need token-level monitoring metrics cannot rely on clients to set this. This PR adds a server-side flag to force usage inclusion in streaming responses.

Modifications

  • Add --stream-response-default-include-usage server arg that forces a final usage chunk in streaming responses even when stream_options is not specified by the client
  • Extract repeated stream_options checks into a shared should_include_usage() utility in utils.py
  • Remove dead enable_force_include_usage param from OpenAIServingResponses and unused stream_output field from ServerArgs
  • Fix test mock to include the new server arg attribute

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

alisonshao added a commit that referenced this pull request Jan 8, 2026
Update issue reference from #16711 to #16714 which is the correct
tracking issue for the Triton causal_conv1d_update bug with padded batches.
@syd520zy
Copy link
Copy Markdown
Contributor Author

syd520zy commented Jan 8, 2026

@JustinTong0323 @ispobock @slin1237 @CatherineSue @merrymercy Please review my pr, thank you very much

@syd520zy
Copy link
Copy Markdown
Contributor Author

@CatherineSue @JustinTong0323 @ispobock @merrymercy @slin1237 Please review my pr, thank you very much

@syd520zy
Copy link
Copy Markdown
Contributor Author

@CatherineSue @JustinTong0323 @ispobock @merrymercy @slin1237 Please review my pr, thank you very much

@syd520zy
Copy link
Copy Markdown
Contributor Author

@slin1237 @merrymercy @ispobock @JustinTong0323 @CatherineSue Please review my pr, thank you very much

Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is force?

@syd520zy
Copy link
Copy Markdown
Contributor Author

syd520zy commented Mar 5, 2026

Please review my pr, thank you very much @CatherineSue @merrymercy @slin1237 @ispobock @JustinTong0323

@syd520zy syd520zy requested a review from hnyls2002 March 10, 2026 07:42
@syd520zy
Copy link
Copy Markdown
Contributor Author

Why is force?

In the current framework, whether to output usage information depends on whether the user actively passes in this parameter. When the user does not pass in, the server will not be able to count the actual usage information of this request. After adding this parameter, I can control on the server whether all requests are forced to output usage information. In order to monitor and statistically analyze the token indicators of the business.

@syd520zy
Copy link
Copy Markdown
Contributor Author

@syd520zy That does mean "force"; what you actually want to do is just set a default value for this. Please rewrite the confusing "force" logic.

I cannot directly set a default value because in the current implementation, the behavior of "whether to return usage information" is controlled by the user side. They can set "including usage" in the request body to make all responses carry usage related information. What I hope for now is that this force behavior can only be used when the server requires all requests to return usage information for statistical purposes due to "management" or "auditing" needs. Therefore, to ensure compatibility, the best approach here is to add a service startup parameter or configure an environment variable to control whether the server enables this force behavior

@hnyls2002
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Apr 4, 2026
@hnyls2002
Copy link
Copy Markdown
Collaborator

/rerun-test registered/openai_server/basic/test_serving_chat.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 4, 2026

1-gpu-5090: View workflow run

cd test/ && python3 registered/openai_server/basic/test_serving_chat.py

@hnyls2002
Copy link
Copy Markdown
Collaborator

/rerun-test registered/openai_server/basic/test_serving_completions.py

@hnyls2002 hnyls2002 changed the title Add force-include-usage Support for stream Add --stream-response-default-include-usage server flag Apr 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 4, 2026

1-gpu-5090: View workflow run

cd test/ && python3 registered/openai_server/basic/test_serving_completions.py

@hnyls2002 hnyls2002 changed the title Add --stream-response-default-include-usage server flag Add --stream-response-default-include-usage server flag Apr 4, 2026
@hnyls2002 hnyls2002 merged commit de98590 into sgl-project:main Apr 4, 2026
100 of 192 checks passed
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
xiezhq-hermann pushed a commit to antgroup/sglang that referenced this pull request Apr 7, 2026
carlosfundora pushed a commit to carlosfundora/sglang-1-bit-turbo that referenced this pull request Apr 8, 2026
…erver flag (sgl-project#16711)

Upstream SHA: de98590
Cherry-picked from sgl-project/sglang

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants