[Feature] Add Reasoning Tokens Usage#15562
Conversation
Signed-off-by: Muqi Li <muqi1029@gmail.com>
Signed-off-by: Muqi Li <muqi1029@gmail.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
Note: in current code path, if the server is launched without reasoning parser, the require_reasoning is always False. Is this intuitive? |
|
We may also need to update docs |
Do you mean we should get rid of We should have a flag to know whether reqs require reasoning and its |
I mean, in certain cases, the user might want to obtain the reasoning tokens without enabling the reasoning parser? |
okay, I will cherry-pick your tests instantly. Thanks for your remind! |
Signed-off-by: Muqi Li <muqi1029@gmail.com> Co-authored-by: Mufeez Amjad <mufeez.amjad@outlook.com>
31b0ebb to
3e68a62
Compare
|
Find another duplicated PR: #14404 |
JustinTong0323
left a comment
There was a problem hiding this comment.
Verified, Thanks for the contribution~
|
Very useful change, I met this problem and I am strongly waiting for Merge 😊. Thank you 👍 |
|
Hi @Muqi1029, @JustinTong0323 thanks for this great PR — it addresses exactly what we need. We're running GLM-5 (754B FP8) in production with I noticed this PR has merge conflicts with the current main branch. Since we need this feature urgently, I'd be happy to help resolve the conflicts and push this forward — either by collaborating on this PR or opening a new one based on your work (with full credit, of course). Would you be open to that? Or if you're planning to rebase soon, I'm happy to wait as well. Just want to make sure this doesn't stay stalled — there's clear community demand (4 duplicate PRs + multiple issues). Thanks again for the work here! |
|
@MLKoz2 @anencore94 Thanks for your highlight and kind words. Now I have resolved the conflicts and passed the corresponding CIs locally. @Fridge003 @hnyls2002 Please take a look. |
d4810bf to
eddfedc
Compare
Signed-off-by: Muqi Li <muqi1029@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Mufeez Amjad <mufeez.amjad@outlook.com> Co-authored-by: cklxx <1293822641@qq.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Signed-off-by: Muqi Li <muqi1029@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Mufeez Amjad <mufeez.amjad@outlook.com> Co-authored-by: cklxx <1293822641@qq.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Signed-off-by: Muqi Li <muqi1029@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Mufeez Amjad <mufeez.amjad@outlook.com> Co-authored-by: cklxx <1293822641@qq.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Signed-off-by: Muqi Li <muqi1029@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Mufeez Amjad <mufeez.amjad@outlook.com> Co-authored-by: cklxx <1293822641@qq.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Signed-off-by: Muqi Li <muqi1029@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Mufeez Amjad <mufeez.amjad@outlook.com> Co-authored-by: cklxx <1293822641@qq.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Motivation
SGLang currently returns token usage information, but the
reasoning_tokensfield is always0, which makes it unusable as a statistical metric. This is problematic sincereasoning_tokensis an important signal for analysis and monitoring.You can see the following result using latest(main branch) SGLang:
Server Launching Script
python -m sglang.launch_server \ --model-path Qwen/Qwen3-8B \ --reasoning-parser qwen3 \ --tool-call-parser qwen \ --port 8888 \ --log-requests \ --log-requests-level 3curl -X POST http://127.0.0.1:8888/v1/chat/completions \ -H "Authorization: Bear None" \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "Who are you?"} ] }' | jqAfter this PR:
{ "id": "40d09b4d150345b88a75945b1b7bb059", "object": "chat.completion", "created": 1766290957, "model": "default", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Hello! I'm Qwen, a large language model developed by Alibaba Cloud. I'm designed to assist with a wide range of tasks, such as answering questions, creating content, writing code, solving problems, and engaging in conversations. I aim to be helpful, friendly, and knowledgeable, and I'm here to learn and grow through our interactions. How can I assist you today? 😊", "reasoning_content": "Okay, the user asked, \"Who are you?\" I need to provide a clear and concise answer. Let me start by stating my name, Qwen. I should mention that I'm a large language model developed by Alibaba Cloud. It's important to highlight my capabilities, like answering questions, creating content, and engaging in conversations. I should also note that I can assist with various tasks such as writing, coding, and problem-solving. But I need to keep it friendly and approachable. Maybe add something about being here to help and learn from interactions. Let me check if I need to mention any specific features or limitations. Oh, right, I should avoid giving false information and encourage the user to ask questions. Let me structure this in a natural, conversational way without any markdown. Keep it simple and welcoming.\n", "tool_calls": null }, "logprobs": null, "finish_reason": "stop", "matched_stop": 151645 } ], "usage": { "prompt_tokens": 12, "total_tokens": 261, "completion_tokens": 249, "prompt_tokens_details": null, "reasoning_tokens": 168 }, "metadata": { "weight_version": "default" } }This also works in streaming situations:
data: {"id":"b8c4b28ecc6f48d5bffeb073e209ab22","object":"chat.completion.chunk","created":1766291188,"model":"default","choices":[{"index":0,"delta":{"role":null,"content":"?","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":{"prompt_tokens":12,"total_tokens":191,"completion_tokens":179,"prompt_tokens_details":null,"reasoning_tokens":114}} data: {"id":"b8c4b28ecc6f48d5bffeb073e209ab22","object":"chat.completion.chunk","created":1766291188,"model":"default","choices":[{"index":0,"delta":{"role":null,"content":" ","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":{"prompt_tokens":12,"total_tokens":192,"completion_tokens":180,"prompt_tokens_details":null,"reasoning_tokens":114}} data: {"id":"b8c4b28ecc6f48d5bffeb073e209ab22","object":"chat.completion.chunk","created":1766291188,"model":"default","choices":[{"index":0,"delta":{"role":null,"content":"😊","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":{"prompt_tokens":12,"total_tokens":193,"completion_tokens":181,"prompt_tokens_details":null,"reasoning_tokens":114}} data: {"id":"b8c4b28ecc6f48d5bffeb073e209ab22","object":"chat.completion.chunk","created":1766291188,"model":"default","choices":[{"index":0,"delta":{"role":null,"content":null,"reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":null} data: {"id":"b8c4b28ecc6f48d5bffeb073e209ab22","object":"chat.completion.chunk","created":1766291188,"model":"default","choices":[],"usage":{"prompt_tokens":12,"total_tokens":194,"completion_tokens":182,"prompt_tokens_details":null,"reasoning_tokens":114}} data: [DONE]Modifications
Compute
reasoning_tokensbased onreq.require_reasoningandnext_token_idduring both the extend and decode stages.The logic is intentionally NOT placed in the server process because the server process may introduce complexity related to potential re-tokenization. I think implementing this logic in the
output_processoris simpler, nearly tiny overhead.Checklist