[core] Fix race condition in raylet graceful shutdown#53762
[core] Fix race condition in raylet graceful shutdown#53762edoakes merged 3 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a race condition during raylet graceful shutdown by marking intentional shutdowns and enforcing immediate exit on agent hard failures.
- Wraps the graceful shutdown callback to set
is_shutdown_request_received_before invoking it, preventing unintended fatal crashes. - Updates
AgentManagerto distinguish zero and non-zero agent exits, logging a fatal error and exiting immediately on hard failures. - Clarifies log messages to separate successful agent exits from unexpected crashes.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/ray/raylet/node_manager.cc | Wraps shutdown_raylet_gracefully_ in a lambda that sets is_shutdown_request_received_. |
| src/ray/raylet/agent_manager.cc | Adds branch for non-zero agent exit to RAY_LOG(FATAL) and refines the zero-exit log and reason. |
Comments suppressed due to low confidence (3)
src/ray/raylet/agent_manager.cc:87
- The message 'failed with exit code 0' can be confusing since exit code 0 usually means success; consider rephrasing to 'exited with code 0' for clarity.
RAY_LOG(ERROR) << "The raylet exited immediately because one Ray agent failed with exit code 0"
src/ray/raylet/agent_manager.cc:111
- [nitpick] Include the non-zero
exit_codein the FATAL log message to provide more diagnostic context, matching the detailedreason_messagein the graceful path.
RAY_LOG(FATAL) << "The raylet exited immediately because agent '" << options_.agent_name
src/ray/raylet/agent_manager.cc:109
- Add a unit or integration test that simulates a non-zero agent exit to verify that the FATAL logging and immediate shutdown behavior work as expected.
// Agent exited with a non-zero code, indicating a hard failure.
src/ray/raylet/node_manager.cc
Outdated
There was a problem hiding this comment.
[nitpick] The lambda wrapping shutdown_raylet_gracefully_ is duplicated in two agent manager factories. Consider extracting it into a shared helper to reduce code duplication and improve readability.
| is_shutdown_request_received_ = true; | |
| shutdown_raylet_gracefully_(death_info); | |
| HandleNodeDeath(death_info); |
|
As discussed offline:
So let's split the changes into three PRs:
|
israbbani
left a comment
There was a problem hiding this comment.
As far as I can tell, your solution will work. Good work on the find and root cause!
I have a few followups that should be created and prioritized. I think @edoakes and you've covered them already, but I'll add them here for completeness:
-
During graceful shutdown, the raylet should (ideally) transition to a DRAIN state and keep heartbeating until the last step of the shutdown process. The principle being that the data plane shuts down before the control plane.
-
The GCS cannot reliably send a KILL signal to a node that it thinks is already dead. It’s okay as best-effort STONITH, but if a raylet discovers that it’s heartbeating connection has been terminated by the GCS, it should commit suicide.
🚢 @jjyao should still review since he's most familiar with the GCS -> NodeManager shutdown path.
src/ray/raylet/agent_manager.cc
Outdated
There was a problem hiding this comment.
This log line has a few problems
- @edoakes are grpcio version issues and port conflicts still common root causes for why the dashboard can't start up? There was a customer issue where the user was confused about why they were seeing grpcio in the logs.
- At least a few of the causes her (segfault, OOMed) look like they should not exit with code 0.
I'd suggest removing misleading suggestions and adding the ungraceful ones to the FATAL error.
src/ray/raylet/node_manager.cc
Outdated
There was a problem hiding this comment.
Can you use this->is_shutdown_request_received_ and this->shutdown_raylet_gracefully_ to make the capture more readable? I think binding lambdas to this is a code smell and this at least makes it explicit (see #53654 (comment))
src/ray/raylet/agent_manager.cc
Outdated
There was a problem hiding this comment.
I glanced at the code inside process.cc and it's not obvious to me that waitpid is actually called correctly. If you haven't already investigated it, it might be worth glancing over.
There was a problem hiding this comment.
I did look into that. waitpid is not called in most cases because we launch the agent with a pipe to stdin. Unless the pipe creation failed, Process::Wait() will return status based on pipe-based waiting (fd != -1).
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
c040b1c to
48b6bc4
Compare
jjyao
left a comment
There was a problem hiding this comment.
LGTM.
It makes sense to me that we don't need to gracefully shut down raylet when agent died ungracefully.
## Why are these changes needed? See #53739 for more details. Here's the sequence of events that leads to the flaky behavior: 1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process (e.g., `dashboard_agent`). 2. **Graceful Shutdown Initiated:** Because raylet and the agent share fate, irrespective of the `waitpid` status, the `AgentManager` triggers a [graceful shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105) of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs: `Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION`. 3. **Race Condition:** While the raylet is busy with its slow, graceful shutdown, it may stop responding to heartbeats from the GCS in a timely manner. The GCS eventually times out, marks the raylet node as `DEAD`, and notifies the raylet of this status change. 4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs: `[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS`. ### Proposed Fix In `node_manager.cc`, when creating the `AgentManager` instances, the `shutdown_raylet_gracefully_` callback should be wrapped in a lambda that first sets `is_shutdown_request_received_ = true`. This will prevent the `NodeRemoved` callback from incorrectly triggering a fatal crash during an intentional graceful shutdown. ## Related issue number #53739 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
) ## Why are these changes needed? See ray-project#53739 for more details. Here's the sequence of events that leads to the flaky behavior: 1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process (e.g., `dashboard_agent`). 2. **Graceful Shutdown Initiated:** Because raylet and the agent share fate, irrespective of the `waitpid` status, the `AgentManager` triggers a [graceful shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105) of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs: `Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION`. 3. **Race Condition:** While the raylet is busy with its slow, graceful shutdown, it may stop responding to heartbeats from the GCS in a timely manner. The GCS eventually times out, marks the raylet node as `DEAD`, and notifies the raylet of this status change. 4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs: `[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS`. ### Proposed Fix In `node_manager.cc`, when creating the `AgentManager` instances, the `shutdown_raylet_gracefully_` callback should be wrapped in a lambda that first sets `is_shutdown_request_received_ = true`. This will prevent the `NodeRemoved` callback from incorrectly triggering a fatal crash during an intentional graceful shutdown. ## Related issue number ray-project#53739 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
## Why are these changes needed? See #53739 for more details. Here's the sequence of events that leads to the flaky behavior: 1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process (e.g., `dashboard_agent`). 2. **Graceful Shutdown Initiated:** Because raylet and the agent share fate, irrespective of the `waitpid` status, the `AgentManager` triggers a [graceful shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105) of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs: `Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION`. 3. **Race Condition:** While the raylet is busy with its slow, graceful shutdown, it may stop responding to heartbeats from the GCS in a timely manner. The GCS eventually times out, marks the raylet node as `DEAD`, and notifies the raylet of this status change. 4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs: `[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS`. ### Proposed Fix In `node_manager.cc`, when creating the `AgentManager` instances, the `shutdown_raylet_gracefully_` callback should be wrapped in a lambda that first sets `is_shutdown_request_received_ = true`. This will prevent the `NodeRemoved` callback from incorrectly triggering a fatal crash during an intentional graceful shutdown. ## Related issue number #53739 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Why are these changes needed?
See #53739 for more details. Here's the sequence of events that leads to the flaky behavior:
SIGKILLto an agent process (e.g.,dashboard_agent).waitpidstatus, theAgentManagertriggers a graceful shutdown of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs:Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION.DEAD, and notifies the raylet of this status change.DEADstatus from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs:[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS.Proposed Fix
In
node_manager.cc, when creating theAgentManagerinstances, theshutdown_raylet_gracefully_callback should be wrapped in a lambda that first setsis_shutdown_request_received_ = true. This will prevent theNodeRemovedcallback from incorrectly triggering a fatal crash during an intentional graceful shutdown.Related issue number
#53739