Fix issue where node promotion could fail #47854
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
- What I did
Fix a minor race condition that could cause a node promotion to fail if it happened right after another node was demoted.
- How I did it
If a node is promoted right after another node is demoted, there exists the possibility of a race, by which the newly promoted manager attempts to connect to the newly demoted manager for its initial Raft membership. This connection fails, and the whole swarm Node object exits.
At this point, the daemon nodeRunner sees the exit and restarts the Node.
However, if the address of the no-longer-manager is recorded in the nodeRunner's config.joinAddr, the Node again attempts to connect to the no-longer-manager, and crashes again. This repeats. The workaround is to remove the node entirely and rejoin the Swarm as a new node.
This change erases config.joinAddr from the restart of the nodeRunner, if the node has previously become Ready. The node becoming Ready indicates that at some point, it did successfully join the cluster, in some fashion. If it has successfully joined the cluster, then Swarm has its own persistent record of known manager addresses.
If no joinAddr is provided, then Swarm will choose from its persisted list of managers to join, and will join a functioning manager.
- How to verify it
I'm unsure where we would stick an integration test, and the implementation thereof would probably be a nightmare.
To verify manually:
docker node demote [The Target's node id] && sleep 0.1 && docker node promote [The Worker's node id].- Description for the changelog
* Fixed an issue where rapidly promoting a node after another node was demoted could cause the promoted node to fail its promotion.