[core] Fix flaky test_worker_exit_intended_user_exit#53909
Merged
jjyao merged 2 commits intoray-project:masterfrom Jun 19, 2025
Merged
[core] Fix flaky test_worker_exit_intended_user_exit#53909jjyao merged 2 commits intoray-project:masterfrom
test_worker_exit_intended_user_exit#53909jjyao merged 2 commits intoray-project:masterfrom
Conversation
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a flaky test by removing the strict assertion on the task failure error message, thus avoiding intermittent failures due to timing differences in error reporting.
- Removed the hard-coded "Socket closed" error message assertion from the worker exit test.
- Updated the verify_failed_task function to accept a None value for error_message.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| python/ray/tests/test_exit_observability.py | Removed the assertion on error_message to address flakiness in the test worker exit scenario. |
| python/ray/_private/state_api_test_utils.py | Updated verify_failed_task signature to allow error_message to be optional and None. |
Comments suppressed due to low confidence (2)
python/ray/tests/test_exit_observability.py:236
- Consider adding an inline comment that explains why the error_message assertion was removed to prevent future confusion regarding the flakiness fix.
error_type="WORKER_DIED", # Since it's a force cancel through kill signal.
python/ray/_private/state_api_test_utils.py:393
- Update the function docstring to mention that the error_message parameter is now optional and can be None, clarifying its accepted types.
name: str, error_type: str, error_message: Union[str, List[str], None] = None
Contributor
|
@codope when will the error message be |
Contributor
Author
The error message depends on exactly when the grpc layer detects the connection failure. Both are valid connection termination errors, but they represent different timing of when the grpc connection detects the worker death:
|
jjyao
approved these changes
Jun 19, 2025
minerharry
pushed a commit
to minerharry/ray
that referenced
this pull request
Jun 27, 2025
…3909) Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
elliot-barn
pushed a commit
that referenced
this pull request
Jul 2, 2025
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Here's one example of this test failing -- https://buildkite.com/ray-project/postmerge/builds/10876#01977721-a652-4d5d-be35-1c0ec445f459/177-1290
The test expects the task error message to contain "Socket closed", but gets different error messages depending on timing:
From the logs, this probably happens because:
Fix
Removed error message assertion from the task cancellation verification in
test_worker_exit_intended_user_exit. This eliminates flakiness without compromising test coverage. Maintains essential verification.The test still validates:
Failure Log Analysis
Cancellation initiated
Worker Receives Cancellation (Worker Side)
Worker Clean Shutdown
Raylet Can't Find Failure Cause