What is the problem?
The node will never be detected again when gcs server restart.
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
@pytest.mark.parametrize(
"ray_start_cluster_head",
[generate_internal_config_map(num_heartbeats_timeout=20)],
indirect=True)
def test_node_failure_detector_when_gcs_server_restart(ray_start_cluster_head):
"""Checks that the node failure detector is correct when gcs server restart.
We set the cluster to timeout nodes after 2 seconds of heartbeats. We
then remove a node and restart gcs server again to check
that the alive node count is 2, then wait another 2.5 seconds to check that
the one of the node is timed out.
"""
cluster = ray_start_cluster_head
worker = cluster.add_node()
cluster.wait_for_nodes()
cluster.head_node.kill_gcs_server()
cluster.remove_node(worker, allow_graceful=False)
cluster.head_node.start_gcs_server()
nodes = ray.nodes()
assert len(nodes) == 2
assert nodes[0]["alive"] and nodes[1]["alive"]
time.sleep(2.5)
nodes = ray.nodes()
assert len(nodes) == 2
dead_count = 0
for node in nodes:
if not node["alive"]:
dead_count += 1
assert dead_count == 1
If we cannot run your script, we cannot fix your issue.
What is the problem?
The node will never be detected again when gcs server restart.
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
If we cannot run your script, we cannot fix your issue.