[CI] Enable retry logic for flaky CI tests by alisonshao · Pull Request #14983 · sgl-project/sglang

alisonshao · 2025-12-12T10:20:57Z

Summary

Add errors="ignore" to subprocess.Popen in CI retry logic to handle non-UTF-8 bytes in test output

Motivation

PR #14771 introduced retry logic that reads subprocess stdout to classify errors. When tests output binary data or non-standard encoded characters, this causes:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9e in position 6489: invalid start byte

This fix ignores invalid bytes instead of crashing, since the error classification only needs to match ASCII patterns like AssertionError, RuntimeError, etc.

Test plan

CI should no longer crash with UnicodeDecodeError when tests output non-UTF-8 bytes

gemini-code-assist · 2025-12-12T10:21:10Z

Summary of Changes

Hello @alisonshao, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a UnicodeDecodeError that was occurring in the CI retry mechanism. The error was triggered when subprocess.Popen attempted to read test output containing non-UTF-8 encoded characters. By introducing an errors="ignore" parameter, the CI process can now bypass these problematic characters, ensuring that test runs complete without interruption and that the system can still effectively classify errors based on ASCII patterns.

Highlights

Error Handling: Add errors="ignore" to subprocess.Popen in the CI retry logic to prevent UnicodeDecodeError when processing test output containing non-UTF-8 bytes.
CI Stability: Enhance the robustness of the CI system by allowing it to gracefully handle diverse test outputs without crashing, ensuring that error classification can proceed as intended for ASCII patterns.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly fixes a UnicodeDecodeError in the CI retry logic by adding errors="ignore" to the subprocess.Popen call. This is an effective solution to prevent the CI from crashing due to non-UTF-8 output from tests. I have added one comment to highlight a maintainability trade-off: using ignore results in the silent loss of data in logs, which could complicate debugging the underlying test failures. Despite this, the fix is a reasonable approach to ensure CI stability.

gemini-code-assist · 2025-12-12T10:22:37Z

                    stderr=subprocess.STDOUT,
                    env=os.environ,
                    text=True,
+                    errors="ignore",  # Ignore non-UTF-8 bytes to prevent UnicodeDecodeError


Using errors="ignore" effectively prevents the UnicodeDecodeError, but it introduces a maintainability trade-off. Any non-UTF-8 bytes in the test output will be silently discarded, leading to incomplete logs. This loss of information could make it more difficult to debug the root cause of why a test is producing invalid byte sequences.

While this is a reasonable trade-off to ensure CI stability, a more robust solution (though requiring a larger change) would involve reading the output as bytes and decoding it separately for logging (e.g., with errors='backslashreplace') versus for the retry logic (errors='ignore').

alisonshao · 2025-12-12T10:26:36Z

Verified locally that this fix works.

Test setup - created a script that outputs the same invalid UTF-8 byte (0x9e) that caused the CI failure:

import sys
print("Starting test...")
sys.stdout.buffer.write(b"Binary data: \x9e\x9f\xa0 here\n")  # Invalid UTF-8 bytes
print("Test complete")
sys.exit(1)

Without errors='ignore':

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9e in position 13: invalid start byte

Crashes immediately when iterating over process.stdout, same as the CI failure.

With errors='ignore':

Got line: Binary data:  here
Got line: Starting test...
Got line: Test complete
Result: Success! Exit code: 1

Reads the output fine. Invalid bytes get dropped but all the ASCII text we need for error pattern matching (AssertionError, RuntimeError, etc.) is preserved.

This is safe because the retry logic only matches ASCII error patterns anyway - the non-UTF-8 bytes are just binary noise from GPU libraries or corrupted output that we don't need.

alisonshao · 2025-12-12T23:02:09Z

/tag-and-rerun-ci

Re-introduces the smart retry logic from PR #14771 that was reverted, with a fix for the UnicodeDecodeError that caused CI failures. Changes: - Add retry logic for accuracy/performance assertion failures - Add `errors="ignore"` to subprocess.Popen to handle non-UTF-8 bytes in test output, preventing crashes when tests output binary data - Add CLI flags: --enable-retry, --max-attempts, --retry-wait-seconds - Update both test/srt/run_suite.py and test/run_suite.py The UnicodeDecodeError occurred because the retry logic reads stdout line-by-line with text=True (UTF-8 decoding), and some tests output non-UTF-8 bytes. Using errors="ignore" safely drops invalid bytes while preserving all ASCII error messages needed for retry classification.

Complete the workflow changes from #14771: - Add SGLANG_CI_ENABLE_RETRY env variable - Add --enable-retry flag to test suite commands - Add SGLANG_CI_ENABLE_RETRY env to accuracy tests Timeout changes are omitted - retry logic handles timing internally.

…ry enabled When --enable-retry is passed, the test runner now adds extra timeout (default 600s) to accommodate retries. This avoids needing to manually increase workflow timeouts. New parameter: - --retry-timeout-increase: Additional seconds added to timeout when retry is enabled (default: 600)

- Add enable_retry output to check-changes job based on PR label detection - Convert all hardcoded --enable-retry flags to conditional RETRY_FLAG - Update accuracy tests to use label-based enable_retry output - Remove global SGLANG_CI_ENABLE_RETRY env variable When the 'enable-retry' label is added to a PR, the retry logic will be enabled for flaky test failures.

alisonshao · 2025-12-20T00:43:12Z

ci: https://github.com/sgl-project/sglang/actions/runs/20363783580?pr=14983

Kangyan-Zhou · 2025-12-21T07:29:51Z

https://github.com/sgl-project/sglang/actions/runs/20403159306/job/58628684454?pr=14983 It seems the log is completely missing

Kangyan-Zhou · 2025-12-22T05:34:56Z

-          python3 test_eval_accuracy_large.py
-        env:
-          SGLANG_CI_ENABLE_RETRY: ${{ needs.check-changes.outputs.enable_retry }}
+          # Retry logic for flaky accuracy tests


Can we not have bash logic in the pr-test.yml file?

gemini-code-assist Bot reviewed Dec 12, 2025

View reviewed changes

alisonshao mentioned this pull request Dec 12, 2025

Add retry logic for scheduled CI tests #14771

Merged

alisonshao force-pushed the fix-retry-unicode-error branch 2 times, most recently from 692f29d to 8bbd381 Compare December 12, 2025 22:40

github-actions Bot added the run-ci label Dec 12, 2025

alisonshao requested review from Fridge003, Kangyan-Zhou, ispobock and merrymercy as code owners December 17, 2025 23:18

alisonshao changed the title ~~Fix UnicodeDecodeError in CI retry logic (#14771)~~ [CI] Enable retry logic for flaky CI tests (followup to #14771) Dec 17, 2025

alisonshao force-pushed the fix-retry-unicode-error branch from 5bcae2b to 6f56b74 Compare December 17, 2025 23:26

alisonshao added 3 commits December 17, 2025 16:15

alisonshao force-pushed the fix-retry-unicode-error branch from d147b3a to 64b3c09 Compare December 18, 2025 00:16

alisonshao added the enable-retry Enable retry logic for CI tests label Dec 19, 2025

alisonshao and others added 3 commits December 18, 2025 21:42

Merge branch 'main' into fix-retry-unicode-error

47521d4

Add timeout to retriable patterns

a982a8f

Enable retry on scheduled runs in addition to label

6df9191

alisonshao changed the title ~~[CI] Enable retry logic for flaky CI tests (followup to #14771)~~ [CI] Enable retry logic for flaky CI tests Dec 20, 2025

alisonshao and others added 4 commits December 20, 2025 17:59

Always enable retry in CI environment

7ab53b7

Simplify retry summary to short one-liner

f29f496

Include test names in retry summary

dbf0ddf

Merge branch 'main' into fix-retry-unicode-error

7f7d1e9

alisonshao and others added 4 commits December 20, 2025 23:43

Fix missing logs by configuring logger to output to stdout

9a9d997

Merge branch 'main' into fix-retry-unicode-error

ac60c87

Merge branch 'main' into fix-retry-unicode-error

839b9ce

Add bash retry logic for accuracy tests

661751e

Kangyan-Zhou reviewed Dec 22, 2025

View reviewed changes

Add Python run_with_retry module for accuracy tests

89e7b88

Kangyan-Zhou merged commit ac42797 into main Dec 23, 2025
208 of 214 checks passed

Kangyan-Zhou deleted the fix-retry-unicode-error branch December 23, 2025 06:23

jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025

[CI] Enable retry logic for flaky CI tests (sgl-project#14983)

e2b4c10

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

[CI] Enable retry logic for flaky CI tests (sgl-project#14983)

10c5e56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Enable retry logic for flaky CI tests#14983

[CI] Enable retry logic for flaky CI tests#14983
Kangyan-Zhou merged 16 commits intomainfrom
fix-retry-unicode-error

alisonshao commented Dec 12, 2025

Uh oh!

gemini-code-assist Bot commented Dec 12, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Dec 12, 2025

Uh oh!

alisonshao commented Dec 12, 2025

Uh oh!

alisonshao commented Dec 12, 2025

Uh oh!

alisonshao commented Dec 20, 2025

Uh oh!

Kangyan-Zhou commented Dec 21, 2025

Uh oh!

Kangyan-Zhou Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alisonshao commented Dec 12, 2025

Summary

Motivation

Test plan

Uh oh!

gemini-code-assist Bot commented Dec 12, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

alisonshao commented Dec 12, 2025

Uh oh!

alisonshao commented Dec 12, 2025

Uh oh!

alisonshao commented Dec 20, 2025

Uh oh!

Kangyan-Zhou commented Dec 21, 2025

Uh oh!

Kangyan-Zhou Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants