Related to #91447 test failure. We believe the failure circumstances are rare: the circumstances were created by a NullPointerException that has been fixed, and what remains is hypothetical.
It's possible for a node-left task to get interrupted prior to removing the node from the master's list of faultyNodes. Nodes on the faultyNodes list do not receive cluster state updates, and are eventually removed. Subsequently, when the node attempts to rejoin, after test network disruptions have ceased, the node-join request can succeed, but the node will never receive the cluster state update, consider the node-join a failure, and will resend node-join requests until the LagDetector removes the node from the faultyNodes list.
A solution would be for a node-join request to first run a new node-left request, if the node is seen to still be present in the cluster state. Complete the node-left operation before the node-join proceeds. This will ensure that all of the node-left logic runs successfully, including removing the node from the list of faultyNodes, and there's clean state on which to apply a node-join request. A comment on the test failure has further details on this suggestion.
Related to #91447 test failure. We believe the failure circumstances are rare: the circumstances were created by a NullPointerException that has been fixed, and what remains is hypothetical.
It's possible for a node-left task to get interrupted prior to removing the node from the master's list of faultyNodes. Nodes on the faultyNodes list do not receive cluster state updates, and are eventually removed. Subsequently, when the node attempts to rejoin, after test network disruptions have ceased, the node-join request can succeed, but the node will never receive the cluster state update, consider the node-join a failure, and will resend node-join requests until the LagDetector removes the node from the faultyNodes list.
A solution would be for a node-join request to first run a new node-left request, if the node is seen to still be present in the cluster state. Complete the node-left operation before the node-join proceeds. This will ensure that all of the node-left logic runs successfully, including removing the node from the list of faultyNodes, and there's clean state on which to apply a node-join request. A comment on the test failure has further details on this suggestion.