Skip to content

[core] Fix race condition in raylet graceful shutdown#53762

Merged
edoakes merged 3 commits intoray-project:masterfrom
codope:core-53739-race-raylet-shutdown
Jun 16, 2025
Merged

[core] Fix race condition in raylet graceful shutdown#53762
edoakes merged 3 commits intoray-project:masterfrom
codope:core-53739-race-raylet-shutdown

Conversation

@codope
Copy link
Copy Markdown
Contributor

@codope codope commented Jun 12, 2025

Why are these changes needed?

See #53739 for more details. Here's the sequence of events that leads to the flaky behavior:

  1. Agent is Killed: The test sends a SIGKILL to an agent process (e.g., dashboard_agent).
  2. Graceful Shutdown Initiated: Because raylet and the agent share fate, irrespective of the waitpid status, the AgentManager triggers a graceful shutdown of the raylet. This is a slow, multi-step process. This can be seen in the raylet logs: Raylet graceful shutdown triggered, reason = UNEXPECTED_TERMINATION.
  3. Race Condition: While the raylet is busy with its slow, graceful shutdown, it may stop responding to heartbeats from the GCS in a timely manner. The GCS eventually times out, marks the raylet node as DEAD, and notifies the raylet of this status change.
  4. Fatal Crash: The raylet, upon receiving the DEAD status from the GCS for itself, interprets this as an unexpected and fatal error (since it's not aware it's in a planned shutdown). It then crashes immediately, which can be seen in the raylet logs: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS.

Proposed Fix

In node_manager.cc, when creating the AgentManager instances, the shutdown_raylet_gracefully_ callback should be wrapped in a lambda that first sets is_shutdown_request_received_ = true. This will prevent the NodeRemoved callback from incorrectly triggering a fatal crash during an intentional graceful shutdown.

Related issue number

#53739

Copilot AI review requested due to automatic review settings June 12, 2025 06:52
@codope codope added the go add ONLY when ready to merge, run all tests label Jun 12, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition during raylet graceful shutdown by marking intentional shutdowns and enforcing immediate exit on agent hard failures.

  • Wraps the graceful shutdown callback to set is_shutdown_request_received_ before invoking it, preventing unintended fatal crashes.
  • Updates AgentManager to distinguish zero and non-zero agent exits, logging a fatal error and exiting immediately on hard failures.
  • Clarifies log messages to separate successful agent exits from unexpected crashes.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/ray/raylet/node_manager.cc Wraps shutdown_raylet_gracefully_ in a lambda that sets is_shutdown_request_received_.
src/ray/raylet/agent_manager.cc Adds branch for non-zero agent exit to RAY_LOG(FATAL) and refines the zero-exit log and reason.
Comments suppressed due to low confidence (3)

src/ray/raylet/agent_manager.cc:87

  • The message 'failed with exit code 0' can be confusing since exit code 0 usually means success; consider rephrasing to 'exited with code 0' for clarity.
RAY_LOG(ERROR) << "The raylet exited immediately because one Ray agent failed with exit code 0"

src/ray/raylet/agent_manager.cc:111

  • [nitpick] Include the non-zero exit_code in the FATAL log message to provide more diagnostic context, matching the detailed reason_message in the graceful path.
RAY_LOG(FATAL) << "The raylet exited immediately because agent '" << options_.agent_name

src/ray/raylet/agent_manager.cc:109

  • Add a unit or integration test that simulates a non-zero agent exit to verify that the FATAL logging and immediate shutdown behavior work as expected.
// Agent exited with a non-zero code, indicating a hard failure.

Comment on lines 3253 to 3254
Copy link

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The lambda wrapping shutdown_raylet_gracefully_ is duplicated in two agent manager factories. Consider extracting it into a shared helper to reduce code duplication and improve readability.

Suggested change
is_shutdown_request_received_ = true;
shutdown_raylet_gracefully_(death_info);
HandleNodeDeath(death_info);

Copilot uses AI. Check for mistakes.
@edoakes edoakes requested a review from a team June 12, 2025 16:56
@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Jun 12, 2025

As discussed offline:

  • The existing logic to fail health checks during graceful shutdown and then ignore the GCS notification looks like a hack. We should not be failing health checks until the Raylet is fully shut down. The way it's currently written is error prone and also likely to be misleading in terms of observability (GCS thinks the Raylet is unhealthy, it's just shutting down).
  • This change looks fine to me as a patch fix before we fix the above. But we should rename the flag, as it's no long longer set only when the RPC is received, but also on the agent exit path. In general we should unify all shutdown paths, similar to core worker :)
  • The change to un-gracefully exit if the agent dies unexpectedly also makes sense to me, but it's an independent behavior change.

So let's split the changes into three PRs:

  1. Patch fix for the bug described in the description, reusing and renaming the flag.
  2. Change the behavior to crash on unexpected agent death.
  3. Properly fix the shutdown sequence to not fail health checks when draining.

Copy link
Copy Markdown
Contributor

@israbbani israbbani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, your solution will work. Good work on the find and root cause!

I have a few followups that should be created and prioritized. I think @edoakes and you've covered them already, but I'll add them here for completeness:

  1. During graceful shutdown, the raylet should (ideally) transition to a DRAIN state and keep heartbeating until the last step of the shutdown process. The principle being that the data plane shuts down before the control plane.

  2. The GCS cannot reliably send a KILL signal to a node that it thinks is already dead. It’s okay as best-effort STONITH, but if a raylet discovers that it’s heartbeating connection has been terminated by the GCS, it should commit suicide.

🚢 @jjyao should still review since he's most familiar with the GCS -> NodeManager shutdown path.

Comment on lines 87 to 100
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log line has a few problems

  1. @edoakes are grpcio version issues and port conflicts still common root causes for why the dashboard can't start up? There was a customer issue where the user was confused about why they were seeing grpcio in the logs.
  2. At least a few of the causes her (segfault, OOMed) look like they should not exit with code 0.

I'd suggest removing misleading suggestions and adding the ungraceful ones to the FATAL error.

Comment on lines 3252 to 3254
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use this->is_shutdown_request_received_ and this->shutdown_raylet_gracefully_ to make the capture more readable? I think binding lambdas to this is a code smell and this at least makes it explicit (see #53654 (comment))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I glanced at the code inside process.cc and it's not obvious to me that waitpid is actually called correctly. If you haven't already investigated it, it might be worth glancing over.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did look into that. waitpid is not called in most cases because we launch the agent with a pipe to stdin. Unless the pipe creation failed, Process::Wait() will return status based on pipe-based waiting (fd != -1).

codope added 3 commits June 16, 2025 08:03
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
@edoakes edoakes merged commit 9e25884 into ray-project:master Jun 16, 2025
5 checks passed
Copy link
Copy Markdown
Contributor

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

It makes sense to me that we don't need to gracefully shut down raylet when agent died ungracefully.

elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
## Why are these changes needed?

See #53739 for more details. Here's the sequence of events that leads to
the flaky behavior:

1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process
(e.g., `dashboard_agent`).
2. **Graceful Shutdown Initiated:** Because raylet and the agent share
fate, irrespective of the `waitpid` status, the `AgentManager` triggers
a [graceful
shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105)
of the raylet. This is a slow, multi-step process. This can be seen in
the raylet logs: `Raylet graceful shutdown triggered, reason =
UNEXPECTED_TERMINATION`.
3. **Race Condition:** While the raylet is busy with its slow, graceful
shutdown, it may stop responding to heartbeats from the GCS in a timely
manner. The GCS eventually times out, marks the raylet node as `DEAD`,
and notifies the raylet of this status change.
4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from
the GCS for itself, interprets this as an unexpected and fatal error
(since it's not aware it's in a planned shutdown). It then crashes
immediately, which can be seen in the raylet logs: `[Timeout] Exiting
because this node manager has mistakenly been marked as dead by the
GCS`.

### Proposed Fix

In `node_manager.cc`, when creating the `AgentManager` instances, the
`shutdown_raylet_gracefully_` callback should be wrapped in a lambda
that first sets `is_shutdown_request_received_ = true`. This will
prevent the `NodeRemoved` callback from incorrectly triggering a fatal
crash during an intentional graceful shutdown.

## Related issue number

#53739

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
)

## Why are these changes needed?

See ray-project#53739 for more details. Here's the sequence of events that leads to
the flaky behavior:

1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process
(e.g., `dashboard_agent`).
2. **Graceful Shutdown Initiated:** Because raylet and the agent share
fate, irrespective of the `waitpid` status, the `AgentManager` triggers
a [graceful
shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105)
of the raylet. This is a slow, multi-step process. This can be seen in
the raylet logs: `Raylet graceful shutdown triggered, reason =
UNEXPECTED_TERMINATION`.
3. **Race Condition:** While the raylet is busy with its slow, graceful
shutdown, it may stop responding to heartbeats from the GCS in a timely
manner. The GCS eventually times out, marks the raylet node as `DEAD`,
and notifies the raylet of this status change.
4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from
the GCS for itself, interprets this as an unexpected and fatal error
(since it's not aware it's in a planned shutdown). It then crashes
immediately, which can be seen in the raylet logs: `[Timeout] Exiting
because this node manager has mistakenly been marked as dead by the
GCS`.

### Proposed Fix

In `node_manager.cc`, when creating the `AgentManager` instances, the
`shutdown_raylet_gracefully_` callback should be wrapped in a lambda
that first sets `is_shutdown_request_received_ = true`. This will
prevent the `NodeRemoved` callback from incorrectly triggering a fatal
crash during an intentional graceful shutdown.

## Related issue number

ray-project#53739

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
## Why are these changes needed?

See #53739 for more details. Here's the sequence of events that leads to
the flaky behavior:

1. **Agent is Killed:** The test sends a `SIGKILL` to an agent process
(e.g., `dashboard_agent`).
2. **Graceful Shutdown Initiated:** Because raylet and the agent share
fate, irrespective of the `waitpid` status, the `AgentManager` triggers
a [graceful
shutdown](https://github.com/ray-project/ray/blob/d3fd0d255c00755b4eb2e6e2cd5a8f764e6898aa/src/ray/raylet/agent_manager.cc#L105)
of the raylet. This is a slow, multi-step process. This can be seen in
the raylet logs: `Raylet graceful shutdown triggered, reason =
UNEXPECTED_TERMINATION`.
3. **Race Condition:** While the raylet is busy with its slow, graceful
shutdown, it may stop responding to heartbeats from the GCS in a timely
manner. The GCS eventually times out, marks the raylet node as `DEAD`,
and notifies the raylet of this status change.
4. **Fatal Crash:** The raylet, upon receiving the `DEAD` status from
the GCS for itself, interprets this as an unexpected and fatal error
(since it's not aware it's in a planned shutdown). It then crashes
immediately, which can be seen in the raylet logs: `[Timeout] Exiting
because this node manager has mistakenly been marked as dead by the
GCS`.

### Proposed Fix

In `node_manager.cc`, when creating the `AgentManager` instances, the
`shutdown_raylet_gracefully_` callback should be wrapped in a lambda
that first sets `is_shutdown_request_received_ = true`. This will
prevent the `NodeRemoved` callback from incorrectly triggering a fatal
crash during an intentional graceful shutdown.

## Related issue number

#53739

---------

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants