[CI] Enable retry logic for flaky CI tests#14983
Conversation
Summary of ChangesHello @alisonshao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a UnicodeDecodeError in the CI retry logic by adding errors="ignore" to the subprocess.Popen call. This is an effective solution to prevent the CI from crashing due to non-UTF-8 output from tests. I have added one comment to highlight a maintainability trade-off: using ignore results in the silent loss of data in logs, which could complicate debugging the underlying test failures. Despite this, the fix is a reasonable approach to ensure CI stability.
| stderr=subprocess.STDOUT, | ||
| env=os.environ, | ||
| text=True, | ||
| errors="ignore", # Ignore non-UTF-8 bytes to prevent UnicodeDecodeError |
There was a problem hiding this comment.
Using errors="ignore" effectively prevents the UnicodeDecodeError, but it introduces a maintainability trade-off. Any non-UTF-8 bytes in the test output will be silently discarded, leading to incomplete logs. This loss of information could make it more difficult to debug the root cause of why a test is producing invalid byte sequences.
While this is a reasonable trade-off to ensure CI stability, a more robust solution (though requiring a larger change) would involve reading the output as bytes and decoding it separately for logging (e.g., with errors='backslashreplace') versus for the retry logic (errors='ignore').
|
Verified locally that this fix works. Test setup - created a script that outputs the same invalid UTF-8 byte (0x9e) that caused the CI failure: import sys
print("Starting test...")
sys.stdout.buffer.write(b"Binary data: \x9e\x9f\xa0 here\n") # Invalid UTF-8 bytes
print("Test complete")
sys.exit(1)Without Crashes immediately when iterating over With Reads the output fine. Invalid bytes get dropped but all the ASCII text we need for error pattern matching ( This is safe because the retry logic only matches ASCII error patterns anyway - the non-UTF-8 bytes are just binary noise from GPU libraries or corrupted output that we don't need. |
692f29d to
8bbd381
Compare
|
/tag-and-rerun-ci |
5bcae2b to
6f56b74
Compare
Re-introduces the smart retry logic from PR #14771 that was reverted, with a fix for the UnicodeDecodeError that caused CI failures. Changes: - Add retry logic for accuracy/performance assertion failures - Add `errors="ignore"` to subprocess.Popen to handle non-UTF-8 bytes in test output, preventing crashes when tests output binary data - Add CLI flags: --enable-retry, --max-attempts, --retry-wait-seconds - Update both test/srt/run_suite.py and test/run_suite.py The UnicodeDecodeError occurred because the retry logic reads stdout line-by-line with text=True (UTF-8 decoding), and some tests output non-UTF-8 bytes. Using errors="ignore" safely drops invalid bytes while preserving all ASCII error messages needed for retry classification.
Complete the workflow changes from #14771: - Add SGLANG_CI_ENABLE_RETRY env variable - Add --enable-retry flag to test suite commands - Add SGLANG_CI_ENABLE_RETRY env to accuracy tests Timeout changes are omitted - retry logic handles timing internally.
…ry enabled When --enable-retry is passed, the test runner now adds extra timeout (default 600s) to accommodate retries. This avoids needing to manually increase workflow timeouts. New parameter: - --retry-timeout-increase: Additional seconds added to timeout when retry is enabled (default: 600)
d147b3a to
64b3c09
Compare
- Add enable_retry output to check-changes job based on PR label detection - Convert all hardcoded --enable-retry flags to conditional RETRY_FLAG - Update accuracy tests to use label-based enable_retry output - Remove global SGLANG_CI_ENABLE_RETRY env variable When the 'enable-retry' label is added to a PR, the retry logic will be enabled for flaky test failures.
|
https://github.com/sgl-project/sglang/actions/runs/20403159306/job/58628684454?pr=14983 It seems the log is completely missing |
| python3 test_eval_accuracy_large.py | ||
| env: | ||
| SGLANG_CI_ENABLE_RETRY: ${{ needs.check-changes.outputs.enable_retry }} | ||
| # Retry logic for flaky accuracy tests |
There was a problem hiding this comment.
Can we not have bash logic in the pr-test.yml file?
Summary
errors="ignore"tosubprocess.Popenin CI retry logic to handle non-UTF-8 bytes in test outputMotivation
PR #14771 introduced retry logic that reads subprocess stdout to classify errors. When tests output binary data or non-standard encoded characters, this causes:
This fix ignores invalid bytes instead of crashing, since the error classification only needs to match ASCII patterns like
AssertionError,RuntimeError, etc.Test plan