Add --stream-response-default-include-usage server flag#16711
Add --stream-response-default-include-usage server flag#16711hnyls2002 merged 32 commits intosgl-project:mainfrom
--stream-response-default-include-usage server flag#16711Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…eques -- lint fix
…eques -- lint fix
|
@JustinTong0323 @ispobock @slin1237 @CatherineSue @merrymercy Please review my pr, thank you very much |
|
@CatherineSue @JustinTong0323 @ispobock @merrymercy @slin1237 Please review my pr, thank you very much |
|
@CatherineSue @JustinTong0323 @ispobock @merrymercy @slin1237 Please review my pr, thank you very much |
|
@slin1237 @merrymercy @ispobock @JustinTong0323 @CatherineSue Please review my pr, thank you very much |
|
Please review my pr, thank you very much @CatherineSue @merrymercy @slin1237 @ispobock @JustinTong0323 |
In the current framework, whether to output usage information depends on whether the user actively passes in this parameter. When the user does not pass in, the server will not be able to count the actual usage information of this request. After adding this parameter, I can control on the server whether all requests are forced to output usage information. In order to monitor and statistically analyze the token indicators of the business. |
I cannot directly set a default value because in the current implementation, the behavior of "whether to return usage information" is controlled by the user side. They can set "including usage" in the request body to make all responses carry usage related information. What I hope for now is that this force behavior can only be used when the server requires all requests to return usage information for statistical purposes due to "management" or "auditing" needs. Therefore, to ensure compatibility, the best approach here is to add a service startup parameter or configure an environment variable to control whether the server enables this force behavior |
|
/tag-and-rerun-ci |
|
/rerun-test registered/openai_server/basic/test_serving_chat.py |
|
✅ |
|
/rerun-test registered/openai_server/basic/test_serving_completions.py |
|
✅ |
--stream-response-default-include-usage server flag
…erver flag (sgl-project#16711) Upstream SHA: de98590 Cherry-picked from sgl-project/sglang Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Motivation
When streaming is enabled, usage info is only returned if the client sets
stream_options.include_usage = true. Server operators who need token-level monitoring metrics cannot rely on clients to set this. This PR adds a server-side flag to force usage inclusion in streaming responses.Modifications
--stream-response-default-include-usageserver arg that forces a final usage chunk in streaming responses even whenstream_optionsis not specified by the clientstream_optionschecks into a sharedshould_include_usage()utility inutils.pyenable_force_include_usageparam fromOpenAIServingResponsesand unusedstream_outputfield fromServerArgsChecklist