Skip to content

[core] Fix flaky test_worker_exit_intended_user_exit#53909

Merged
jjyao merged 2 commits intoray-project:masterfrom
codope:fix-flaky-test_exit_observability
Jun 19, 2025
Merged

[core] Fix flaky test_worker_exit_intended_user_exit#53909
jjyao merged 2 commits intoray-project:masterfrom
codope:fix-flaky-test_exit_observability

Conversation

@codope
Copy link
Copy Markdown
Contributor

@codope codope commented Jun 18, 2025

Why are these changes needed?

Here's one example of this test failing -- https://buildkite.com/ray-project/postmerge/builds/10876#01977721-a652-4d5d-be35-1c0ec445f459/177-1290

The test expects the task error message to contain "Socket closed", but gets different error messages depending on timing:

  • Expected: "Socket closed"
  • Actual: "RpcError: RPC Error message: recvmsg:Connection reset by peer; RPC Error details: rpc_code: 14"

From the logs, this probably happens because:

  1. The worker exit information propagates through the raylet --> GCS --> driver path
  2. The task failure is detected through the direct RPC connection failure
  3. These two information flows have different latencies and can arrive in different orders

Fix

Removed error message assertion from the task cancellation verification in test_worker_exit_intended_user_exit. This eliminates flakiness without compromising test coverage. Maintains essential verification.

The test still validates:

  • Worker exit type: INTENDED_USER_EXIT
  • Exit detail contains: "ray.cancel"
  • Task failure type: WORKER_DIED

Failure Log Analysis

Cancellation initiated

[2025-06-17 00:22:43,508 I 7865 7865] normal_task_submitter.cc:705: Cancelling a task: 71b133a11e1c461cffffffffffffffffffffffff01000000 force_kill: 1 recursive: 1
  • Driver calls ray.cancel(t, force=True)
  • NormalTaskSubmitter::CancelTask sends RPC with force_kill=1

Worker Receives Cancellation (Worker Side)

[2025-06-17 00:22:43,508 I 9326 9510] core_worker.cc:4453: Cancel a normal task task_id=71b133a11e1c461cffffffffffffffffffffffff01000000
[2025-06-17 00:22:43,508 W 9326 9510] core_worker.cc:1262: Force exit the process. Details: The worker exits because the task cancel-f has received a force ray.cancel request.
  • Worker's CoreWorker::HandleCancelTask processes the request
  • Since force_kill=true, it immediately calls ExitWorker()

Worker Clean Shutdown

[2025-06-17 00:22:43,511 I 9326 9510] raylet_client.cc:73: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=The worker exits because the task cancel-f has received a force ray.cancel request.
  • Worker cleanly disconnects from raylet with proper exit information
  • Exit type is correctly set to INTENDED_USER_EXIT

Raylet Can't Find Failure Cause

[2025-06-17 00:22:43,521 I 9263 9263] (raylet) node_manager.cc:635: didn't find failure cause for task 986822d6e5815953e511f19e1e76e9bf47154b8101000000

[2025-06-17 00:22:43,526 W 7865 9324] normal_task_submitter.cc:660: Task failure cause for task 71b133a11e1c461cffffffffffffffffffffffff01000000: Error type 0 exception string  fail immedediately: 0
[2025-06-17 00:22:43,526 W 7865 9324] task_manager.cc:1119: Task attempt 71b133a11e1c461cffffffffffffffffffffffff01000000 failed with error WORKER_DIED Fail immediately? 0, status RpcError: RPC Error message: recvmsg:Connection reset by peer; RPC Error details:  rpc_code: 14
  • Driver's task submitter detects the broken connection
  • Reports it as WORKER_DIED with RpcError: recvmsg:Connection reset by peer; rpc_code: 14

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Copilot AI review requested due to automatic review settings June 18, 2025 06:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a flaky test by removing the strict assertion on the task failure error message, thus avoiding intermittent failures due to timing differences in error reporting.

  • Removed the hard-coded "Socket closed" error message assertion from the worker exit test.
  • Updated the verify_failed_task function to accept a None value for error_message.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
python/ray/tests/test_exit_observability.py Removed the assertion on error_message to address flakiness in the test worker exit scenario.
python/ray/_private/state_api_test_utils.py Updated verify_failed_task signature to allow error_message to be optional and None.
Comments suppressed due to low confidence (2)

python/ray/tests/test_exit_observability.py:236

  • Consider adding an inline comment that explains why the error_message assertion was removed to prevent future confusion regarding the flakiness fix.
            error_type="WORKER_DIED",  # Since it's a force cancel through kill signal.

python/ray/_private/state_api_test_utils.py:393

  • Update the function docstring to mention that the error_message parameter is now optional and can be None, clarifying its accepted types.
    name: str, error_type: str, error_message: Union[str, List[str], None] = None

@codope codope added the go add ONLY when ready to merge, run all tests label Jun 18, 2025
@jjyao
Copy link
Copy Markdown
Contributor

jjyao commented Jun 18, 2025

@codope when will the error message be Socket closed

@jjyao jjyao self-assigned this Jun 18, 2025
@codope
Copy link
Copy Markdown
Contributor Author

codope commented Jun 19, 2025

@codope when will the error message be Socket closed

The error message depends on exactly when the grpc layer detects the connection failure. Both are valid connection termination errors, but they represent different timing of when the grpc connection detects the worker death:

  1. "Socket closed" - Connection cleanly closed
  2. "Connection reset by peer" - TCP RST packet received (more abrupt termination)

@jjyao jjyao merged commit 7140cf2 into ray-project:master Jun 19, 2025
5 checks passed
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
…3909)

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants