Skip to content

Add retry logic for scheduled CI tests#14771

Merged
Kangyan-Zhou merged 5 commits intomainfrom
test-retry-logic
Dec 12, 2025
Merged

Add retry logic for scheduled CI tests#14771
Kangyan-Zhou merged 5 commits intomainfrom
test-retry-logic

Conversation

@alisonshao
Copy link
Copy Markdown
Collaborator

@alisonshao alisonshao commented Dec 10, 2025

Summary

This PR implements smart retry logic for CI tests that fail due to accuracy/performance assertion errors (not code errors). The implementation uses Python instead of shell scripts for better maintainability and enables per-file fine-grained retry.

Key Features

  • Pattern-based error classification: Distinguishes retriable failures (accuracy/performance assertions) from non-retriable failures (SyntaxError, ImportError, TypeError, etc.)
  • Per-file retry: Retries only the specific test file that failed, saving time compared to re-running entire test suites
  • Output capture: Captures test output to make intelligent retry decisions
  • GitHub Step Summary: Writes retry information to GitHub Actions summary for visibility

Changes

python/sglang/test/ci/ci_utils.py

  • Added RETRIABLE_PATTERNS and NON_RETRIABLE_PATTERNS for error classification
  • Added is_retriable_failure() function to determine if a failure should be retried
  • Added write_github_step_summary() for GitHub Actions integration
  • Extended run_unittest_files() with retry support (enable_retry, max_retry_attempts, retry_wait_seconds)

test/srt/run_suite.py

  • Added --enable-retry flag to enable smart retry
  • Added --max-retry-attempts flag (default: 3)
  • Added --retry-wait-seconds flag (default: 60)

.github/workflows/pr-test.yml

  • Added SGLANG_CI_ENABLE_RETRY: true environment variable
  • Added --enable-retry to all run_suite.py invocations
  • Increased step timeouts by ~10 minutes to account for potential retries

Retriable vs Non-Retriable Errors

Will retry:

  • AssertionError: 0.77 not greater than 0.8
  • Failures containing "accuracy", "score", "latency", "throughput"
  • Generic AssertionError (assumes retriable)

Will NOT retry:

  • SyntaxError, ImportError, ModuleNotFoundError
  • TypeError, AttributeError, NameError
  • RuntimeError, CUDA out of memory, OOM
  • Segmentation fault, ConnectionRefusedError, FileNotFoundError

Example Output

[CI Retry] test_accuracy.py failed with retriable pattern: accuracy
[CI Retry] Waiting 60s before retry...

[CI Retry] Attempt 2/3 for test_accuracy.py

✓ PASSED on retry (attempt 2): test_accuracy.py

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@alisonshao alisonshao added the enable-retry Enable retry logic for CI tests label Dec 10, 2025
@alisonshao
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@alisonshao alisonshao force-pushed the test-retry-logic branch 2 times, most recently from b48f01e to 57a7f50 Compare December 10, 2025 02:43
@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage unit-test-backend-2-gpu

Comment thread scripts/ci/ci_run_with_retry.sh Outdated
Copy link
Copy Markdown
Collaborator

@Kangyan-Zhou Kangyan-Zhou Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should incorporate the logic in run_suite.py instead of adding a shell script wrapper. The shell script is not maintainable and hard to extend with additional features.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

  • Please use python over shell
  • In run_suite.py, we can do per-file fine-grained retry, which can save some time
  • Last thing, when you retry, we also need to think about that it increased the total time so it will likely exceed the default step timeout

Copy link
Copy Markdown
Collaborator Author

@alisonshao alisonshao Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed from shell to python. also please review and check if the change in timeout make sense. @Kangyan-Zhou @merrymercy

@github-actions

This comment was marked as outdated.

@alisonshao

This comment was marked as outdated.

@alisonshao
Copy link
Copy Markdown
Collaborator Author

Comment thread test/srt/run_suite.py Outdated
arg_parser.add_argument(
"--max-retry-attempts",
type=int,
default=3,
Copy link
Copy Markdown
Collaborator

@Kangyan-Zhou Kangyan-Zhou Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's change the default to 2, and maybe change the variable name to be max-attempt so that the naming is more clear (people don't have to think about whether the first run count as retry or not)

Comment thread python/sglang/test/ci/ci_utils.py Outdated
print(f"Success. Time elapsed: {elapsed_total:.2f}s", flush=True)
else:
print(f"Fail. Time elapsed: {time.perf_counter() - tic:.2f}s", flush=True)
print(f"Fail. Time elapsed: {elapsed_total:.2f}s", flush=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change all the print to logger.info?

alisonshao and others added 5 commits December 11, 2025 16:57
Implements Python-based retry logic in run_suite.py and ci_utils.py:
- Adds pattern-based error classification to distinguish retriable failures
  (accuracy/performance assertions) from non-retriable failures (code errors)
- Implements per-file retry with configurable attempts and wait times
- Captures test output for retry decision making
- Adds GitHub Step Summary for retry tracking
- Increases step timeouts to account for potential retries

New CLI flags for run_suite.py:
- --enable-retry: Enable smart retry for assertion failures
- --max-retry-attempts: Max retries per file (default: 3)
- --retry-wait-seconds: Wait between retries (default: 60)

This approach is more maintainable and allows fine-grained per-file
retry instead of re-running entire test suites.
@Kangyan-Zhou Kangyan-Zhou merged commit e59435c into main Dec 12, 2025
18 of 36 checks passed
@Kangyan-Zhou Kangyan-Zhou deleted the test-retry-logic branch December 12, 2025 01:00
@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Dec 12, 2025

fzyzcjy added a commit that referenced this pull request Dec 12, 2025
@alisonshao
Copy link
Copy Markdown
Collaborator Author

alisonshao commented Dec 12, 2025

https://github.com/sgl-project/sglang/actions/runs/20159565508/job/57870840756?pr=14958

image

i see the problem, the code reads process.stdout assuming it's UTF-8 encoded text.
lets revert it first. Once this pr is reverted, i will test ci on follow up fix at pr #14983 . @fzyzcjy

alisonshao added a commit that referenced this pull request Dec 12, 2025
Add errors="ignore" to subprocess.Popen to handle non-UTF-8 bytes
in test output. This fixes the CI crash when tests output binary
data or non-standard encoded characters.

Fixes the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9e in position 6489
@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Dec 12, 2025

sure

Kangyan-Zhou added a commit to Kangyan-Zhou/sglang that referenced this pull request Dec 12, 2025
alisonshao added a commit that referenced this pull request Dec 12, 2025
Re-introduces the smart retry logic from PR #14771 that was reverted,
with a fix for the UnicodeDecodeError that caused CI failures.

Changes:
- Add retry logic for accuracy/performance assertion failures
- Add `errors="ignore"` to subprocess.Popen to handle non-UTF-8 bytes
  in test output, preventing crashes when tests output binary data
- Add CLI flags: --enable-retry, --max-attempts, --retry-wait-seconds
- Update both test/srt/run_suite.py and test/run_suite.py

The UnicodeDecodeError occurred because the retry logic reads stdout
line-by-line with text=True (UTF-8 decoding), and some tests output
non-UTF-8 bytes. Using errors="ignore" safely drops invalid bytes
while preserving all ASCII error messages needed for retry classification.
alisonshao added a commit that referenced this pull request Dec 12, 2025
Re-introduces the smart retry logic from PR #14771 that was reverted,
with a fix for the UnicodeDecodeError that caused CI failures.

Changes:
- Add retry logic for accuracy/performance assertion failures
- Add `errors="ignore"` to subprocess.Popen to handle non-UTF-8 bytes
  in test output, preventing crashes when tests output binary data
- Add CLI flags: --enable-retry, --max-attempts, --retry-wait-seconds
- Update both test/srt/run_suite.py and test/run_suite.py

The UnicodeDecodeError occurred because the retry logic reads stdout
line-by-line with text=True (UTF-8 decoding), and some tests output
non-UTF-8 bytes. Using errors="ignore" safely drops invalid bytes
while preserving all ASCII error messages needed for retry classification.
Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025
alisonshao added a commit that referenced this pull request Dec 17, 2025
Complete the workflow changes from #14771:
- Add SGLANG_CI_ENABLE_RETRY env variable
- Add --enable-retry flag to test suite commands
- Increase timeouts to accommodate retry logic
alisonshao added a commit that referenced this pull request Dec 17, 2025
Complete the workflow changes from #14771:
- Add SGLANG_CI_ENABLE_RETRY env variable
- Add --enable-retry flag to test suite commands
- Add SGLANG_CI_ENABLE_RETRY env to accuracy tests

Timeout changes are omitted - retry logic handles timing internally.
alisonshao added a commit that referenced this pull request Dec 18, 2025
Re-introduces the smart retry logic from PR #14771 that was reverted,
with a fix for the UnicodeDecodeError that caused CI failures.

Changes:
- Add retry logic for accuracy/performance assertion failures
- Add `errors="ignore"` to subprocess.Popen to handle non-UTF-8 bytes
  in test output, preventing crashes when tests output binary data
- Add CLI flags: --enable-retry, --max-attempts, --retry-wait-seconds
- Update both test/srt/run_suite.py and test/run_suite.py

The UnicodeDecodeError occurred because the retry logic reads stdout
line-by-line with text=True (UTF-8 decoding), and some tests output
non-UTF-8 bytes. Using errors="ignore" safely drops invalid bytes
while preserving all ASCII error messages needed for retry classification.
alisonshao added a commit that referenced this pull request Dec 18, 2025
Complete the workflow changes from #14771:
- Add SGLANG_CI_ENABLE_RETRY env variable
- Add --enable-retry flag to test suite commands
- Add SGLANG_CI_ENABLE_RETRY env to accuracy tests

Timeout changes are omitted - retry logic handles timing internally.
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enable-retry Enable retry logic for CI tests run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants