Add retry logic for scheduled CI tests#14771
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
bec6b4b to
6260d0e
Compare
|
/tag-and-rerun-ci |
b48f01e to
57a7f50
Compare
|
/rerun-stage unit-test-backend-2-gpu |
There was a problem hiding this comment.
I wonder whether we should incorporate the logic in run_suite.py instead of adding a shell script wrapper. The shell script is not maintainable and hard to extend with additional features.
There was a problem hiding this comment.
I agree.
- Please use python over shell
- In
run_suite.py, we can do per-file fine-grained retry, which can save some time - Last thing, when you retry, we also need to think about that it increased the total time so it will likely exceed the default step timeout
There was a problem hiding this comment.
changed from shell to python. also please review and check if the change in timeout make sense. @Kangyan-Zhou @merrymercy
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
ae2cfcd to
6fd9c22
Compare
| arg_parser.add_argument( | ||
| "--max-retry-attempts", | ||
| type=int, | ||
| default=3, |
There was a problem hiding this comment.
nit: let's change the default to 2, and maybe change the variable name to be max-attempt so that the naming is more clear (people don't have to think about whether the first run count as retry or not)
| print(f"Success. Time elapsed: {elapsed_total:.2f}s", flush=True) | ||
| else: | ||
| print(f"Fail. Time elapsed: {time.perf_counter() - tic:.2f}s", flush=True) | ||
| print(f"Fail. Time elapsed: {elapsed_total:.2f}s", flush=True) |
There was a problem hiding this comment.
Can we change all the print to logger.info?
Implements Python-based retry logic in run_suite.py and ci_utils.py: - Adds pattern-based error classification to distinguish retriable failures (accuracy/performance assertions) from non-retriable failures (code errors) - Implements per-file retry with configurable attempts and wait times - Captures test output for retry decision making - Adds GitHub Step Summary for retry tracking - Increases step timeouts to account for potential retries New CLI flags for run_suite.py: - --enable-retry: Enable smart retry for assertion failures - --max-retry-attempts: Max retries per file (default: 3) - --retry-wait-seconds: Wait between retries (default: 60) This approach is more maintainable and allows fine-grained per-file retry instead of re-running entire test suites.
bba962c to
bb9e538
Compare
i see the problem, the code reads process.stdout assuming it's UTF-8 encoded text. |
Add errors="ignore" to subprocess.Popen to handle non-UTF-8 bytes in test output. This fixes the CI crash when tests output binary data or non-standard encoded characters. Fixes the error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9e in position 6489
|
sure |
This reverts commit e59435c.
Re-introduces the smart retry logic from PR #14771 that was reverted, with a fix for the UnicodeDecodeError that caused CI failures. Changes: - Add retry logic for accuracy/performance assertion failures - Add `errors="ignore"` to subprocess.Popen to handle non-UTF-8 bytes in test output, preventing crashes when tests output binary data - Add CLI flags: --enable-retry, --max-attempts, --retry-wait-seconds - Update both test/srt/run_suite.py and test/run_suite.py The UnicodeDecodeError occurred because the retry logic reads stdout line-by-line with text=True (UTF-8 decoding), and some tests output non-UTF-8 bytes. Using errors="ignore" safely drops invalid bytes while preserving all ASCII error messages needed for retry classification.
Re-introduces the smart retry logic from PR #14771 that was reverted, with a fix for the UnicodeDecodeError that caused CI failures. Changes: - Add retry logic for accuracy/performance assertion failures - Add `errors="ignore"` to subprocess.Popen to handle non-UTF-8 bytes in test output, preventing crashes when tests output binary data - Add CLI flags: --enable-retry, --max-attempts, --retry-wait-seconds - Update both test/srt/run_suite.py and test/run_suite.py The UnicodeDecodeError occurred because the retry logic reads stdout line-by-line with text=True (UTF-8 decoding), and some tests output non-UTF-8 bytes. Using errors="ignore" safely drops invalid bytes while preserving all ASCII error messages needed for retry classification.
Complete the workflow changes from #14771: - Add SGLANG_CI_ENABLE_RETRY env variable - Add --enable-retry flag to test suite commands - Increase timeouts to accommodate retry logic
Complete the workflow changes from #14771: - Add SGLANG_CI_ENABLE_RETRY env variable - Add --enable-retry flag to test suite commands - Add SGLANG_CI_ENABLE_RETRY env to accuracy tests Timeout changes are omitted - retry logic handles timing internally.
Re-introduces the smart retry logic from PR #14771 that was reverted, with a fix for the UnicodeDecodeError that caused CI failures. Changes: - Add retry logic for accuracy/performance assertion failures - Add `errors="ignore"` to subprocess.Popen to handle non-UTF-8 bytes in test output, preventing crashes when tests output binary data - Add CLI flags: --enable-retry, --max-attempts, --retry-wait-seconds - Update both test/srt/run_suite.py and test/run_suite.py The UnicodeDecodeError occurred because the retry logic reads stdout line-by-line with text=True (UTF-8 decoding), and some tests output non-UTF-8 bytes. Using errors="ignore" safely drops invalid bytes while preserving all ASCII error messages needed for retry classification.
Complete the workflow changes from #14771: - Add SGLANG_CI_ENABLE_RETRY env variable - Add --enable-retry flag to test suite commands - Add SGLANG_CI_ENABLE_RETRY env to accuracy tests Timeout changes are omitted - retry logic handles timing internally.


Summary
This PR implements smart retry logic for CI tests that fail due to accuracy/performance assertion errors (not code errors). The implementation uses Python instead of shell scripts for better maintainability and enables per-file fine-grained retry.
Key Features
Changes
python/sglang/test/ci/ci_utils.pyRETRIABLE_PATTERNSandNON_RETRIABLE_PATTERNSfor error classificationis_retriable_failure()function to determine if a failure should be retriedwrite_github_step_summary()for GitHub Actions integrationrun_unittest_files()with retry support (enable_retry,max_retry_attempts,retry_wait_seconds)test/srt/run_suite.py--enable-retryflag to enable smart retry--max-retry-attemptsflag (default: 3)--retry-wait-secondsflag (default: 60).github/workflows/pr-test.ymlSGLANG_CI_ENABLE_RETRY: trueenvironment variable--enable-retryto allrun_suite.pyinvocationsRetriable vs Non-Retriable Errors
Will retry:
AssertionError: 0.77 not greater than 0.8AssertionError(assumes retriable)Will NOT retry:
SyntaxError,ImportError,ModuleNotFoundErrorTypeError,AttributeError,NameErrorRuntimeError,CUDA out of memory,OOMSegmentation fault,ConnectionRefusedError,FileNotFoundErrorExample Output