-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't
Description
What is the problem?
The node will never be detected again when gcs server restart.
Ray version and other system information (Python version, TensorFlow version, OS):
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
@pytest.mark.parametrize(
"ray_start_cluster_head",
[generate_internal_config_map(num_heartbeats_timeout=20)],
indirect=True)
def test_node_failure_detector_when_gcs_server_restart(ray_start_cluster_head):
"""Checks that the node failure detector is correct when gcs server restart.
We set the cluster to timeout nodes after 2 seconds of heartbeats. We
then remove a node and restart gcs server again to check
that the alive node count is 2, then wait another 2.5 seconds to check that
the one of the node is timed out.
"""
cluster = ray_start_cluster_head
worker = cluster.add_node()
cluster.wait_for_nodes()
cluster.head_node.kill_gcs_server()
cluster.remove_node(worker, allow_graceful=False)
cluster.head_node.start_gcs_server()
nodes = ray.nodes()
assert len(nodes) == 2
assert nodes[0]["alive"] and nodes[1]["alive"]
time.sleep(2.5)
nodes = ray.nodes()
assert len(nodes) == 2
dead_count = 0
for node in nodes:
if not node["alive"]:
dead_count += 1
assert dead_count == 1
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't