Skip to content

[core] Detached actor being killed when its parent actor crashes #40864

@edoakes

Description

@edoakes

While debugging a release test failure, we discovered that in some cases Serve replica actors are being killed due to fate sharing with the controller.

This should never happen because all actors started by Serve (the controller, replicas, proxies) are detached, so they should not fate share with the controller (relevant code in the raylet).

We see a number of log lines like the following in the Raylet logs in multiple runs of the Serve long-running failure test case:

[2023-11-01 07:10:17,273 I 825 825] (raylet) node_manager.cc:1104: The leased worker dd7d4d82da8fef21e59667dba16f2bce15203c8832039284cbb26461 is killed because the owner process 2b60b506544d378c192b7e1cbf989be4058f41015c00fa7f30e50f91 died.
[2023-11-01 07:10:17,273 I 825 825] (raylet) node_manager.cc:1104: The leased worker 3844620037d1ea4a19c830bb548edd9726cd4521cc78f2c7871367d6 is killed because the owner process 2b60b506544d378c192b7e1cbf989be4058f41015c00fa7f30e50f91 died.

All of the referenced actors are detached actors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corecore-workerstability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions