Fix flaky streaming logprobs test by handling detokenizer text buffering#17687
Fix flaky streaming logprobs test by handling detokenizer text buffering#17687Kangyan-Zhou merged 4 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @Kangyan-Zhou, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces targeted debug logging to diagnose a persistent and flaky issue where streaming responses from the OAI server sometimes lack associated log probabilities. By adding detailed warning messages at critical points in both the completion serving and scheduler output processing, the changes aim to provide clearer insights into when and why these logprob discrepancies occur, facilitating a quicker resolution of the underlying problem. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/rerun-stage stage-b-test-small-1-gpu |
There was a problem hiding this comment.
Code Review
This pull request adds debug logging to help investigate a flaky issue with streaming logprobs in the OpenAI server. The changes in serving_completions.py and scheduler_output_processor_mixin.py introduce warnings that trigger when logprobs are empty but the generated text is not, which should be very helpful for debugging.
I've made a couple of suggestions to refactor small code duplications introduced with the new logging logic. These are minor points aimed at improving code clarity. Overall, the changes look good and are well-targeted for the debugging purpose.
|
|
||
| # Debug logging for flaky streaming logprobs issue | ||
| # See: https://github.com/sgl-project/sglang/actions/runs/21319310492/job/61366797740 | ||
| delta_for_log = text[len(stream_buffer) :] |
There was a problem hiding this comment.
The calculation text[len(stream_buffer) :] is also performed later at line 277 to define delta. To avoid this duplication, you could consider calculating this value once and reusing it. For example, you could calculate delta before the if request.logprobs is not None: block and use it in both places.
| logprob_slice = req.output_token_logprobs_val[ | ||
| send_output_token_logprobs_offset: | ||
| ] | ||
| decode_ids_slice = decode_ids[req.send_decode_id_offset :] |
|
/rerun-stage stage-b-test-small-1-gpu |
1 similar comment
|
/rerun-stage stage-b-test-small-1-gpu |
|
✅ Triggered |
The detokenizer holds back text at word boundaries during streaming to avoid showing incomplete words. On the final chunk, this buffered text is flushed. However, by then all logprobs have already been sent, causing the final chunk to have text but empty logprobs. Fix: Return None for logprobs when finish_reason is set and all logprobs have been sent. This is correct since no new tokens were generated - the text is just buffered text being flushed. Changes: - serving_completions.py: Return None for logprobs on final flush chunk - serving_chat.py: Apply same fix to chat completions streaming - test_openai_server.py: Update test to handle logprobs=None on final chunk - scheduler_output_processor_mixin.py: Remove debug logging Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
/rerun-stage stage-b-test-small-1-gpu |
|
✅ Triggered |
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
|
/rerun-stage stage-b-test-small-1-gpu |
|
✅ Triggered |
|
/rerun-stage stage-b-test-small-1-gpu |
|
✅ Triggered |
PR sgl-project#17687 fixed the case where empty logprobs were returned on the final chunk when finish_reason was set. However, the detokenizer can also flush buffered text mid-stream (when finish_reason is still None), causing the same issue. Changes: - serving_completions.py: Return logprobs=None when output_logprobs_slice is empty, regardless of finish_reason - serving_chat.py: Remove the "or finish_reason is None" condition that was causing empty logprobs to be processed mid-stream - utils.py: Remove debug logging added in previous commit - test_openai_server.py: Remove debug logging and update comments to clarify that logprobs=None can happen both mid-stream and on final chunk Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ing (sgl-project#17687) Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
…ing (sgl-project#17687) Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
…ence When multiple streaming chunks queue up before the consumer drains them (streaming backlog), all chunks' meta_info["output_token_logprobs"] point to the same list object in tokenizer_manager.py. Later chunks extend the list, causing earlier chunks to see logprobs that belong to later chunks. This makes the first chunk "steal" all logprobs and leaves subsequent chunks with empty logprobs, triggering IndexError in the test. Root fix: record output_token_logprobs_length as an immutable int snapshot in meta_info at chunk creation time. Downstream consumers use this length to slice the shared list correctly, so each chunk sees only its own logprobs regardless of later mutations. This reverts the workaround from PR #17687 which only handled the finish_reason case but missed the mid-stream backlog scenario. Made-with: Cursor
Summary
Fixes the flaky
test_completion_streamtest intest_openai_server.py.Root cause: The detokenizer holds back text at word boundaries during streaming (via
find_printable_text()) to avoid showing incomplete words. On the final chunk whenfinish_reasonis set, this buffered text is flushed. However, by then all logprobs have already been sent in previous chunks, causing the final chunk to have text content but empty logprobs - which broke the test.Fix: Return
Nonefor logprobs whenfinish_reasonis set and all logprobs have been sent. This is semantically correct since no new tokens were generated - the text is just previously buffered text being flushed.Changes
serving_completions.py: ReturnNonefor logprobs on final flush chunkserving_chat.py: Apply same fix to chat completions streaming endpointtest_openai_server.py: Update test to handlelogprobs=Noneon final chunkscheduler_output_processor_mixin.py: Remove debug logging from previous commit