Skip to content

Conversation

@dperny
Copy link
Contributor

@dperny dperny commented May 23, 2024

- What I did

Fix a minor race condition that could cause a node promotion to fail if it happened right after another node was demoted.

- How I did it

If a node is promoted right after another node is demoted, there exists the possibility of a race, by which the newly promoted manager attempts to connect to the newly demoted manager for its initial Raft membership. This connection fails, and the whole swarm Node object exits.

At this point, the daemon nodeRunner sees the exit and restarts the Node.

However, if the address of the no-longer-manager is recorded in the nodeRunner's config.joinAddr, the Node again attempts to connect to the no-longer-manager, and crashes again. This repeats. The workaround is to remove the node entirely and rejoin the Swarm as a new node.

This change erases config.joinAddr from the restart of the nodeRunner, if the node has previously become Ready. The node becoming Ready indicates that at some point, it did successfully join the cluster, in some fashion. If it has successfully joined the cluster, then Swarm has its own persistent record of known manager addresses.

If no joinAddr is provided, then Swarm will choose from its persisted list of managers to join, and will join a functioning manager.

- How to verify it

I'm unsure where we would stick an integration test, and the implementation thereof would probably be a nightmare.

To verify manually:

  1. Create a cluster with 3 Manager nodes.
  2. Add a worker node. This we will call "The Worker". Note which node the IP of the join command will send the worker to. This we will call "The Target"
  3. On a node that is not The Target, run the command docker node demote [The Target's node id] && sleep 0.1 && docker node promote [The Worker's node id].
  4. Without this patch, the promotion will fail. The node will get stuck. With this patch, the promotion will succeed.

- Description for the changelog

* Fixed an issue where rapidly promoting a node after another node was demoted could cause the promoted node to fail its promotion.

If a node is promoted right after another node is demoted, there exists
the possibility of a race, by which the newly promoted manager attempts
to connect to the newly demoted manager for its initial Raft membership.
This connection fails, and the whole swarm Node object exits.

At this point, the daemon nodeRunner sees the exit and restarts the
Node.

However, if the address of the no-longer-manager is recorded in the
nodeRunner's config.joinAddr, the Node again attempts to connect to the
no-longer-manager, and crashes again. This repeats. The solution is to
remove the node entirely and rejoin the Swarm as a new node.

This change erases config.joinAddr from the restart of the nodeRunner,
if the node has previously become Ready. The node becoming Ready
indicates that at some point, it did successfully join the cluster, in
some fashion. If it has successfully joined the cluster, then Swarm has
its own persistent record of known manager addresses. If no joinAddr is
provided, then Swarm will choose from its persisted list of managers to
join, and will join a functioning manager.

Signed-off-by: Drew Erny <derny@mirantis.com>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dperny
Copy link
Contributor Author

dperny commented May 23, 2024

Closes #37175.

@thaJeztah thaJeztah merged commit 5cd2e6a into moby:master May 23, 2024
@thaJeztah
Copy link
Member

And it's merged! Can you open cherry-picks for 23.0, 25.0 and 26.1 branches, @dperny ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docker node promote failing

3 participants