Skip to content

Fix flaky streaming logprobs test by handling detokenizer text buffering#17687

Merged
Kangyan-Zhou merged 4 commits intosgl-project:mainfrom
Kangyan-Zhou:fix_flaky_test
Jan 25, 2026
Merged

Fix flaky streaming logprobs test by handling detokenizer text buffering#17687
Kangyan-Zhou merged 4 commits intosgl-project:mainfrom
Kangyan-Zhou:fix_flaky_test

Conversation

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

@Kangyan-Zhou Kangyan-Zhou commented Jan 24, 2026

Summary

Fixes the flaky test_completion_stream test in test_openai_server.py.

Root cause: The detokenizer holds back text at word boundaries during streaming (via find_printable_text()) to avoid showing incomplete words. On the final chunk when finish_reason is set, this buffered text is flushed. However, by then all logprobs have already been sent in previous chunks, causing the final chunk to have text content but empty logprobs - which broke the test.

Fix: Return None for logprobs when finish_reason is set and all logprobs have been sent. This is semantically correct since no new tokens were generated - the text is just previously buffered text being flushed.

Changes

  • serving_completions.py: Return None for logprobs on final flush chunk
  • serving_chat.py: Apply same fix to chat completions streaming endpoint
  • test_openai_server.py: Update test to handle logprobs=None on final chunk
  • scheduler_output_processor_mixin.py: Remove debug logging from previous commit

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Kangyan-Zhou, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces targeted debug logging to diagnose a persistent and flaky issue where streaming responses from the OAI server sometimes lack associated log probabilities. By adding detailed warning messages at critical points in both the completion serving and scheduler output processing, the changes aim to provide clearer insights into when and why these logprob discrepancies occur, facilitating a quicker resolution of the underlying problem.

Highlights

  • Enhanced Debugging for Streaming Logprobs: Added specific warning logs in serving_completions.py to detect instances where text is streamed without corresponding log probabilities, addressing a flaky issue.
  • Scheduler Logprob Consistency Check: Implemented debug logging in scheduler_output_processor_mixin.py to flag situations where decoded IDs are present but log probability slices are unexpectedly empty during streaming.
  • Improved Logprob Slicing Clarity: Refactored the calculation and slicing of output token logprobs in serving_completions.py for better readability and explicit variable assignment.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-b-test-small-1-gpu

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds debug logging to help investigate a flaky issue with streaming logprobs in the OpenAI server. The changes in serving_completions.py and scheduler_output_processor_mixin.py introduce warnings that trigger when logprobs are empty but the generated text is not, which should be very helpful for debugging.

I've made a couple of suggestions to refactor small code duplications introduced with the new logging logic. These are minor points aimed at improving code clarity. Overall, the changes look good and are well-targeted for the debugging purpose.


# Debug logging for flaky streaming logprobs issue
# See: https://github.com/sgl-project/sglang/actions/runs/21319310492/job/61366797740
delta_for_log = text[len(stream_buffer) :]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calculation text[len(stream_buffer) :] is also performed later at line 277 to define delta. To avoid this duplication, you could consider calculating this value once and reusing it. For example, you could calculate delta before the if request.logprobs is not None: block and use it in both places.

logprob_slice = req.output_token_logprobs_val[
send_output_token_logprobs_offset:
]
decode_ids_slice = decode_ids[req.send_decode_id_offset :]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The slice decode_ids[req.send_decode_id_offset :] is also calculated at line 947 within the same loop. To avoid this duplication, you could consider calculating it once after decode_ids is initialized and storing it in a variable for reuse in both places.

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-b-test-small-1-gpu

1 similar comment
@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-b-test-small-1-gpu

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-b-test-small-1-gpu to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

The detokenizer holds back text at word boundaries during streaming to
avoid showing incomplete words. On the final chunk, this buffered text
is flushed. However, by then all logprobs have already been sent, causing
the final chunk to have text but empty logprobs.

Fix: Return None for logprobs when finish_reason is set and all logprobs
have been sent. This is correct since no new tokens were generated - the
text is just buffered text being flushed.

Changes:
- serving_completions.py: Return None for logprobs on final flush chunk
- serving_chat.py: Apply same fix to chat completions streaming
- test_openai_server.py: Update test to handle logprobs=None on final chunk
- scheduler_output_processor_mixin.py: Remove debug logging

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@Kangyan-Zhou Kangyan-Zhou changed the title Add more loggings for oai server output debug Fix flaky streaming logprobs test by handling detokenizer text buffering Jan 25, 2026
@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-b-test-small-1-gpu

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-b-test-small-1-gpu to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

Comment thread python/sglang/srt/entrypoints/openai/serving_completions.py
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-b-test-small-1-gpu

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-b-test-small-1-gpu to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-b-test-small-1-gpu

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-b-test-small-1-gpu to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@Kangyan-Zhou Kangyan-Zhou merged commit 592603d into sgl-project:main Jan 25, 2026
66 of 73 checks passed
Kangyan-Zhou added a commit to Kangyan-Zhou/sglang that referenced this pull request Jan 26, 2026
PR sgl-project#17687 fixed the case where empty logprobs were returned on the final
chunk when finish_reason was set. However, the detokenizer can also flush
buffered text mid-stream (when finish_reason is still None), causing the
same issue.

Changes:
- serving_completions.py: Return logprobs=None when output_logprobs_slice
  is empty, regardless of finish_reason
- serving_chat.py: Remove the "or finish_reason is None" condition that
  was causing empty logprobs to be processed mid-stream
- utils.py: Remove debug logging added in previous commit
- test_openai_server.py: Remove debug logging and update comments to
  clarify that logprobs=None can happen both mid-stream and on final chunk

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026
…ing (sgl-project#17687)

Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
…ing (sgl-project#17687)

Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
merrymercy added a commit that referenced this pull request Mar 20, 2026
…ence

When multiple streaming chunks queue up before the consumer drains them
(streaming backlog), all chunks' meta_info["output_token_logprobs"] point
to the same list object in tokenizer_manager.py. Later chunks extend the
list, causing earlier chunks to see logprobs that belong to later chunks.
This makes the first chunk "steal" all logprobs and leaves subsequent
chunks with empty logprobs, triggering IndexError in the test.

Root fix: record output_token_logprobs_length as an immutable int snapshot
in meta_info at chunk creation time. Downstream consumers use this length
to slice the shared list correctly, so each chunk sees only its own
logprobs regardless of later mutations.

This reverts the workaround from PR #17687 which only handled the
finish_reason case but missed the mid-stream backlog scenario.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants