Fix issue where node promotion could fail #47854

dperny · 2024-05-23T18:07:32Z

Closes Docker node promote failing #37175

- What I did

Fix a minor race condition that could cause a node promotion to fail if it happened right after another node was demoted.

- How I did it

If a node is promoted right after another node is demoted, there exists the possibility of a race, by which the newly promoted manager attempts to connect to the newly demoted manager for its initial Raft membership. This connection fails, and the whole swarm Node object exits.

At this point, the daemon nodeRunner sees the exit and restarts the Node.

However, if the address of the no-longer-manager is recorded in the nodeRunner's config.joinAddr, the Node again attempts to connect to the no-longer-manager, and crashes again. This repeats. The workaround is to remove the node entirely and rejoin the Swarm as a new node.

This change erases config.joinAddr from the restart of the nodeRunner, if the node has previously become Ready. The node becoming Ready indicates that at some point, it did successfully join the cluster, in some fashion. If it has successfully joined the cluster, then Swarm has its own persistent record of known manager addresses.

If no joinAddr is provided, then Swarm will choose from its persisted list of managers to join, and will join a functioning manager.

- How to verify it

I'm unsure where we would stick an integration test, and the implementation thereof would probably be a nightmare.

To verify manually:

Create a cluster with 3 Manager nodes.
Add a worker node. This we will call "The Worker". Note which node the IP of the join command will send the worker to. This we will call "The Target"
On a node that is not The Target, run the command docker node demote [The Target's node id] && sleep 0.1 && docker node promote [The Worker's node id].
Without this patch, the promotion will fail. The node will get stuck. With this patch, the promotion will succeed.

- Description for the changelog

* Fixed an issue where rapidly promoting a node after another node was demoted could cause the promoted node to fail its promotion.

If a node is promoted right after another node is demoted, there exists the possibility of a race, by which the newly promoted manager attempts to connect to the newly demoted manager for its initial Raft membership. This connection fails, and the whole swarm Node object exits. At this point, the daemon nodeRunner sees the exit and restarts the Node. However, if the address of the no-longer-manager is recorded in the nodeRunner's config.joinAddr, the Node again attempts to connect to the no-longer-manager, and crashes again. This repeats. The solution is to remove the node entirely and rejoin the Swarm as a new node. This change erases config.joinAddr from the restart of the nodeRunner, if the node has previously become Ready. The node becoming Ready indicates that at some point, it did successfully join the cluster, in some fashion. If it has successfully joined the cluster, then Swarm has its own persistent record of known manager addresses. If no joinAddr is provided, then Swarm will choose from its persisted list of managers to join, and will join a functioning manager. Signed-off-by: Drew Erny <derny@mirantis.com>

thaJeztah

LGTM

dperny · 2024-05-23T18:51:30Z

Closes #37175.

thaJeztah · 2024-05-23T19:15:33Z

And it's merged! Can you open cherry-picks for 23.0, 25.0 and 26.1 branches, @dperny ?

thaJeztah added status/2-code-review area/swarm kind/bugfix PR's that fix bugs labels May 23, 2024

thaJeztah added this to the 27.0.0 milestone May 23, 2024

thaJeztah added process/cherry-pick/23.0 process/cherry-pick/25.0 process/cherry-pick/26.1 labels May 23, 2024

neersighted approved these changes May 23, 2024

View reviewed changes

thaJeztah approved these changes May 23, 2024

View reviewed changes

thaJeztah merged commit 5cd2e6a into moby:master May 23, 2024

This was referenced May 28, 2024

[23.0 backport] Fix issue where node promotion could fail #47868

Merged

[25.0 backport] Fix issue where node promotion could fail #47869

Merged

[26.1 backport] Fix issue where node promotion could fail #47870

Merged

neersighted removed the process/cherry-pick/23.0 label May 28, 2024

thaJeztah added process/cherry-picked and removed process/cherry-pick/25.0 process/cherry-pick/26.1 labels Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix issue where node promotion could fail #47854

Fix issue where node promotion could fail #47854

Uh oh!

dperny commented May 23, 2024 •

edited by neersighted

Loading

Uh oh!

thaJeztah left a comment

Uh oh!

dperny commented May 23, 2024

Uh oh!

thaJeztah commented May 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix issue where node promotion could fail #47854

Fix issue where node promotion could fail #47854

Uh oh!

Conversation

dperny commented May 23, 2024 • edited by neersighted Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

dperny commented May 23, 2024

Uh oh!

thaJeztah commented May 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dperny commented May 23, 2024 •

edited by neersighted

Loading