Skip to content

[GCS] The node will never be detected again when gcs server restart #9379

@wumuzi520

Description

@wumuzi520

What is the problem?

The node will never be detected again when gcs server restart.
Ray version and other system information (Python version, TensorFlow version, OS):

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

@pytest.mark.parametrize(
    "ray_start_cluster_head",
    [generate_internal_config_map(num_heartbeats_timeout=20)],
    indirect=True)
def test_node_failure_detector_when_gcs_server_restart(ray_start_cluster_head):
    """Checks that the node failure detector is correct when gcs server restart.

    We set the cluster to timeout nodes after 2 seconds of heartbeats. We
    then remove a node and restart gcs server again to check
    that the alive node count is 2, then wait another 2.5 seconds to check that
    the one of the node is timed out.
    """
    cluster = ray_start_cluster_head
    worker = cluster.add_node()
    cluster.wait_for_nodes()

    cluster.head_node.kill_gcs_server()
    cluster.remove_node(worker, allow_graceful=False)
    cluster.head_node.start_gcs_server()

    nodes = ray.nodes()
    assert len(nodes) == 2
    assert nodes[0]["alive"] and nodes[1]["alive"]

    time.sleep(2.5)
    nodes = ray.nodes()
    assert len(nodes) == 2

    dead_count = 0
    for node in nodes:
        if not node["alive"]:
            dead_count += 1
    assert dead_count == 1

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn't

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions