Add retry logic for scheduled CI tests by alisonshao · Pull Request #14771 · sgl-project/sglang

alisonshao · 2025-12-10T02:17:13Z

Summary

This PR implements smart retry logic for CI tests that fail due to accuracy/performance assertion errors (not code errors). The implementation uses Python instead of shell scripts for better maintainability and enables per-file fine-grained retry.

Key Features

Pattern-based error classification: Distinguishes retriable failures (accuracy/performance assertions) from non-retriable failures (SyntaxError, ImportError, TypeError, etc.)
Per-file retry: Retries only the specific test file that failed, saving time compared to re-running entire test suites
Output capture: Captures test output to make intelligent retry decisions
GitHub Step Summary: Writes retry information to GitHub Actions summary for visibility

Changes

python/sglang/test/ci/ci_utils.py

Added RETRIABLE_PATTERNS and NON_RETRIABLE_PATTERNS for error classification
Added is_retriable_failure() function to determine if a failure should be retried
Added write_github_step_summary() for GitHub Actions integration
Extended run_unittest_files() with retry support (enable_retry, max_retry_attempts, retry_wait_seconds)

test/srt/run_suite.py

Added --enable-retry flag to enable smart retry
Added --max-retry-attempts flag (default: 3)
Added --retry-wait-seconds flag (default: 60)

.github/workflows/pr-test.yml

Added SGLANG_CI_ENABLE_RETRY: true environment variable
Added --enable-retry to all run_suite.py invocations
Increased step timeouts by ~10 minutes to account for potential retries

Retriable vs Non-Retriable Errors

Will retry:

AssertionError: 0.77 not greater than 0.8
Failures containing "accuracy", "score", "latency", "throughput"
Generic AssertionError (assumes retriable)

Will NOT retry:

SyntaxError, ImportError, ModuleNotFoundError
TypeError, AttributeError, NameError
RuntimeError, CUDA out of memory, OOM
Segmentation fault, ConnectionRefusedError, FileNotFoundError

Example Output

[CI Retry] test_accuracy.py failed with retriable pattern: accuracy
[CI Retry] Waiting 60s before retry...

[CI Retry] Attempt 2/3 for test_accuracy.py

✓ PASSED on retry (attempt 2): test_accuracy.py

gemini-code-assist · 2025-12-10T02:17:16Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

alisonshao · 2025-12-10T02:27:49Z

/tag-and-rerun-ci

alisonshao · 2025-12-10T09:28:56Z

/rerun-stage unit-test-backend-2-gpu

Kangyan-Zhou · 2025-12-10T09:32:57Z

I wonder whether we should incorporate the logic in run_suite.py instead of adding a shell script wrapper. The shell script is not maintainable and hard to extend with additional features.

I agree.

Please use python over shell

In run_suite.py, we can do per-file fine-grained retry, which can save some time

Last thing, when you retry, we also need to think about that it increased the total time so it will likely exceed the default step timeout

changed from shell to python. also please review and check if the change in timeout make sense. @Kangyan-Zhou @merrymercy

alisonshao · 2025-12-11T07:09:20Z

retry example: https://github.com/sgl-project/sglang/actions/runs/20118177217/job/57738032539?pr=14771#step:5:2832

Kangyan-Zhou · 2025-12-11T07:47:37Z

+    arg_parser.add_argument(
+        "--max-retry-attempts",
+        type=int,
+        default=3,


nit: let's change the default to 2, and maybe change the variable name to be max-attempt so that the naming is more clear (people don't have to think about whether the first run count as retry or not)

Kangyan-Zhou · 2025-12-11T17:46:37Z

+        print(f"Success. Time elapsed: {elapsed_total:.2f}s", flush=True)
    else:
-        print(f"Fail. Time elapsed: {time.perf_counter() - tic:.2f}s", flush=True)
+        print(f"Fail. Time elapsed: {elapsed_total:.2f}s", flush=True)


Can we change all the print to logger.info?

Implements Python-based retry logic in run_suite.py and ci_utils.py: - Adds pattern-based error classification to distinguish retriable failures (accuracy/performance assertions) from non-retriable failures (code errors) - Implements per-file retry with configurable attempts and wait times - Captures test output for retry decision making - Adds GitHub Step Summary for retry tracking - Increases step timeouts to account for potential retries New CLI flags for run_suite.py: - --enable-retry: Enable smart retry for assertion failures - --max-retry-attempts: Max retries per file (default: 3) - --retry-wait-seconds: Wait between retries (default: 60) This approach is more maintainable and allows fine-grained per-file retry instead of re-running entire test suites.

fzyzcjy · 2025-12-12T09:04:59Z

https://github.com/sgl-project/sglang/actions/runs/20159565508/job/57870840756?pr=14958

This reverts commit e59435c.

alisonshao · 2025-12-12T10:13:09Z

https://github.com/sgl-project/sglang/actions/runs/20159565508/job/57870840756?pr=14958

i see the problem, the code reads process.stdout assuming it's UTF-8 encoded text.
lets revert it first. Once this pr is reverted, i will test ci on follow up fix at pr #14983 . @fzyzcjy

Add errors="ignore" to subprocess.Popen to handle non-UTF-8 bytes in test output. This fixes the CI crash when tests output binary data or non-standard encoded characters. Fixes the error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9e in position 6489

fzyzcjy · 2025-12-12T12:27:54Z

sure

This reverts commit e59435c.

Re-introduces the smart retry logic from PR #14771 that was reverted, with a fix for the UnicodeDecodeError that caused CI failures. Changes: - Add retry logic for accuracy/performance assertion failures - Add `errors="ignore"` to subprocess.Popen to handle non-UTF-8 bytes in test output, preventing crashes when tests output binary data - Add CLI flags: --enable-retry, --max-attempts, --retry-wait-seconds - Update both test/srt/run_suite.py and test/run_suite.py The UnicodeDecodeError occurred because the retry logic reads stdout line-by-line with text=True (UTF-8 decoding), and some tests output non-UTF-8 bytes. Using errors="ignore" safely drops invalid bytes while preserving all ASCII error messages needed for retry classification.

Complete the workflow changes from #14771: - Add SGLANG_CI_ENABLE_RETRY env variable - Add --enable-retry flag to test suite commands - Increase timeouts to accommodate retry logic

Complete the workflow changes from #14771: - Add SGLANG_CI_ENABLE_RETRY env variable - Add --enable-retry flag to test suite commands - Add SGLANG_CI_ENABLE_RETRY env to accuracy tests Timeout changes are omitted - retry logic handles timing internally.

Re-introduces the smart retry logic from PR #14771 that was reverted, with a fix for the UnicodeDecodeError that caused CI failures. Changes: - Add retry logic for accuracy/performance assertion failures - Add `errors="ignore"` to subprocess.Popen to handle non-UTF-8 bytes in test output, preventing crashes when tests output binary data - Add CLI flags: --enable-retry, --max-attempts, --retry-wait-seconds - Update both test/srt/run_suite.py and test/run_suite.py The UnicodeDecodeError occurred because the retry logic reads stdout line-by-line with text=True (UTF-8 decoding), and some tests output non-UTF-8 bytes. Using errors="ignore" safely drops invalid bytes while preserving all ASCII error messages needed for retry classification.

Complete the workflow changes from #14771: - Add SGLANG_CI_ENABLE_RETRY env variable - Add --enable-retry flag to test suite commands - Add SGLANG_CI_ENABLE_RETRY env to accuracy tests Timeout changes are omitted - retry logic handles timing internally.

alisonshao requested review from Fridge003, Kangyan-Zhou, ispobock and merrymercy as code owners December 10, 2025 02:17

alisonshao force-pushed the test-retry-logic branch from bec6b4b to 6260d0e Compare December 10, 2025 02:23

alisonshao added the enable-retry Enable retry logic for CI tests label Dec 10, 2025

github-actions Bot added the run-ci label Dec 10, 2025

alisonshao force-pushed the test-retry-logic branch 2 times, most recently from b48f01e to 57a7f50 Compare December 10, 2025 02:43

Kangyan-Zhou reviewed Dec 10, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

alisonshao force-pushed the test-retry-logic branch from ae2cfcd to 6fd9c22 Compare December 11, 2025 00:49

Kangyan-Zhou reviewed Dec 11, 2025

View reviewed changes

alisonshao and others added 5 commits December 11, 2025 16:57

Rename max-retry-attempts to max-attempts; changed default to 2

689f41e

Fix max_attempts parameter name mismatch and update help text

7c4fb11

Add retry arguments to test/run_suite.py for future migration

f0ef5b6

Replace print statements with logger.info in ci_utils.py

bb9e538

alisonshao force-pushed the test-retry-logic branch from bba962c to bb9e538 Compare December 12, 2025 00:57

Kangyan-Zhou merged commit e59435c into main Dec 12, 2025
18 of 36 checks passed

Kangyan-Zhou deleted the test-retry-logic branch December 12, 2025 01:00

fzyzcjy added a commit that referenced this pull request Dec 12, 2025

Revert "Add retry logic for scheduled CI tests (#14771)"

a1181e2

This reverts commit e59435c.

alisonshao mentioned this pull request Dec 12, 2025

[CI] Enable retry logic for flaky CI tests #14983

Merged

Kangyan-Zhou added a commit to Kangyan-Zhou/sglang that referenced this pull request Dec 12, 2025

Revert "Add retry logic for scheduled CI tests (sgl-project#14771)"

6ac81df

This reverts commit e59435c.

Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025

Add retry logic for scheduled CI tests (sgl-project#14771)

5f8d6ca

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

Add retry logic for scheduled CI tests (sgl-project#14771)

9693f3d

Conversation

alisonshao commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Changes

Retriable vs Non-Retriable Errors

Example Output

Uh oh!

gemini-code-assist Bot commented Dec 10, 2025

Uh oh!

alisonshao commented Dec 10, 2025

Uh oh!

alisonshao commented Dec 10, 2025

Uh oh!

Kangyan-Zhou Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merrymercy Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

alisonshao Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

alisonshao commented Dec 11, 2025

Uh oh!

Kangyan-Zhou Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kangyan-Zhou Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fzyzcjy commented Dec 12, 2025

Uh oh!

alisonshao commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alisonshao commented Dec 10, 2025 •

edited

Loading

Kangyan-Zhou Dec 10, 2025 •

edited

Loading

alisonshao Dec 11, 2025 •

edited

Loading

Kangyan-Zhou Dec 11, 2025 •

edited

Loading

alisonshao commented Dec 12, 2025 •

edited

Loading