Skip to content

[CI] Enable retry logic for flaky CI tests#14983

Merged
Kangyan-Zhou merged 16 commits intomainfrom
fix-retry-unicode-error
Dec 23, 2025
Merged

[CI] Enable retry logic for flaky CI tests#14983
Kangyan-Zhou merged 16 commits intomainfrom
fix-retry-unicode-error

Conversation

@alisonshao
Copy link
Copy Markdown
Collaborator

Summary

  • Add errors="ignore" to subprocess.Popen in CI retry logic to handle non-UTF-8 bytes in test output

Motivation

PR #14771 introduced retry logic that reads subprocess stdout to classify errors. When tests output binary data or non-standard encoded characters, this causes:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9e in position 6489: invalid start byte

This fix ignores invalid bytes instead of crashing, since the error classification only needs to match ASCII patterns like AssertionError, RuntimeError, etc.

Test plan

  • CI should no longer crash with UnicodeDecodeError when tests output non-UTF-8 bytes

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @alisonshao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a UnicodeDecodeError that was occurring in the CI retry mechanism. The error was triggered when subprocess.Popen attempted to read test output containing non-UTF-8 encoded characters. By introducing an errors="ignore" parameter, the CI process can now bypass these problematic characters, ensuring that test runs complete without interruption and that the system can still effectively classify errors based on ASCII patterns.

Highlights

  • Error Handling: Add errors="ignore" to subprocess.Popen in the CI retry logic to prevent UnicodeDecodeError when processing test output containing non-UTF-8 bytes.
  • CI Stability: Enhance the robustness of the CI system by allowing it to gracefully handle diverse test outputs without crashing, ensuring that error classification can proceed as intended for ASCII patterns.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a UnicodeDecodeError in the CI retry logic by adding errors="ignore" to the subprocess.Popen call. This is an effective solution to prevent the CI from crashing due to non-UTF-8 output from tests. I have added one comment to highlight a maintainability trade-off: using ignore results in the silent loss of data in logs, which could complicate debugging the underlying test failures. Despite this, the fix is a reasonable approach to ensure CI stability.

stderr=subprocess.STDOUT,
env=os.environ,
text=True,
errors="ignore", # Ignore non-UTF-8 bytes to prevent UnicodeDecodeError
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using errors="ignore" effectively prevents the UnicodeDecodeError, but it introduces a maintainability trade-off. Any non-UTF-8 bytes in the test output will be silently discarded, leading to incomplete logs. This loss of information could make it more difficult to debug the root cause of why a test is producing invalid byte sequences.

While this is a reasonable trade-off to ensure CI stability, a more robust solution (though requiring a larger change) would involve reading the output as bytes and decoding it separately for logging (e.g., with errors='backslashreplace') versus for the retry logic (errors='ignore').

@alisonshao
Copy link
Copy Markdown
Collaborator Author

Verified locally that this fix works.

Test setup - created a script that outputs the same invalid UTF-8 byte (0x9e) that caused the CI failure:

import sys
print("Starting test...")
sys.stdout.buffer.write(b"Binary data: \x9e\x9f\xa0 here\n")  # Invalid UTF-8 bytes
print("Test complete")
sys.exit(1)

Without errors='ignore':

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9e in position 13: invalid start byte

Crashes immediately when iterating over process.stdout, same as the CI failure.

With errors='ignore':

Got line: Binary data:  here
Got line: Starting test...
Got line: Test complete
Result: Success! Exit code: 1

Reads the output fine. Invalid bytes get dropped but all the ASCII text we need for error pattern matching (AssertionError, RuntimeError, etc.) is preserved.

This is safe because the retry logic only matches ASCII error patterns anyway - the non-UTF-8 bytes are just binary noise from GPU libraries or corrupted output that we don't need.

@alisonshao alisonshao force-pushed the fix-retry-unicode-error branch 2 times, most recently from 692f29d to 8bbd381 Compare December 12, 2025 22:40
@alisonshao
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@alisonshao alisonshao changed the title Fix UnicodeDecodeError in CI retry logic (#14771) [CI] Enable retry logic for flaky CI tests (followup to #14771) Dec 17, 2025
@alisonshao alisonshao force-pushed the fix-retry-unicode-error branch from 5bcae2b to 6f56b74 Compare December 17, 2025 23:26
Re-introduces the smart retry logic from PR #14771 that was reverted,
with a fix for the UnicodeDecodeError that caused CI failures.

Changes:
- Add retry logic for accuracy/performance assertion failures
- Add `errors="ignore"` to subprocess.Popen to handle non-UTF-8 bytes
  in test output, preventing crashes when tests output binary data
- Add CLI flags: --enable-retry, --max-attempts, --retry-wait-seconds
- Update both test/srt/run_suite.py and test/run_suite.py

The UnicodeDecodeError occurred because the retry logic reads stdout
line-by-line with text=True (UTF-8 decoding), and some tests output
non-UTF-8 bytes. Using errors="ignore" safely drops invalid bytes
while preserving all ASCII error messages needed for retry classification.
Complete the workflow changes from #14771:
- Add SGLANG_CI_ENABLE_RETRY env variable
- Add --enable-retry flag to test suite commands
- Add SGLANG_CI_ENABLE_RETRY env to accuracy tests

Timeout changes are omitted - retry logic handles timing internally.
…ry enabled

When --enable-retry is passed, the test runner now adds extra timeout
(default 600s) to accommodate retries. This avoids needing to manually
increase workflow timeouts.

New parameter:
- --retry-timeout-increase: Additional seconds added to timeout when
  retry is enabled (default: 600)
@alisonshao alisonshao force-pushed the fix-retry-unicode-error branch from d147b3a to 64b3c09 Compare December 18, 2025 00:16
- Add enable_retry output to check-changes job based on PR label detection
- Convert all hardcoded --enable-retry flags to conditional RETRY_FLAG
- Update accuracy tests to use label-based enable_retry output
- Remove global SGLANG_CI_ENABLE_RETRY env variable

When the 'enable-retry' label is added to a PR, the retry logic will be
enabled for flaky test failures.
@alisonshao alisonshao added the enable-retry Enable retry logic for CI tests label Dec 19, 2025
@alisonshao
Copy link
Copy Markdown
Collaborator Author

@alisonshao alisonshao changed the title [CI] Enable retry logic for flaky CI tests (followup to #14771) [CI] Enable retry logic for flaky CI tests Dec 20, 2025
@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

Comment thread .github/workflows/pr-test.yml Outdated
python3 test_eval_accuracy_large.py
env:
SGLANG_CI_ENABLE_RETRY: ${{ needs.check-changes.outputs.enable_retry }}
# Retry logic for flaky accuracy tests
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not have bash logic in the pr-test.yml file?

@Kangyan-Zhou Kangyan-Zhou merged commit ac42797 into main Dec 23, 2025
208 of 214 checks passed
@Kangyan-Zhou Kangyan-Zhou deleted the fix-retry-unicode-error branch December 23, 2025 06:23
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enable-retry Enable retry logic for CI tests run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants