-
Notifications
You must be signed in to change notification settings - Fork 18.9k
[26.1 backport] Fix issue where node promotion could fail #47870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
If a node is promoted right after another node is demoted, there exists the possibility of a race, by which the newly promoted manager attempts to connect to the newly demoted manager for its initial Raft membership. This connection fails, and the whole swarm Node object exits. At this point, the daemon nodeRunner sees the exit and restarts the Node. However, if the address of the no-longer-manager is recorded in the nodeRunner's config.joinAddr, the Node again attempts to connect to the no-longer-manager, and crashes again. This repeats. The solution is to remove the node entirely and rejoin the Swarm as a new node. This change erases config.joinAddr from the restart of the nodeRunner, if the node has previously become Ready. The node becoming Ready indicates that at some point, it did successfully join the cluster, in some fashion. If it has successfully joined the cluster, then Swarm has its own persistent record of known manager addresses. If no joinAddr is provided, then Swarm will choose from its persisted list of managers to join, and will join a functioning manager. Signed-off-by: Drew Erny <derny@mirantis.com> (cherry picked from commit 16e5c41) Signed-off-by: Drew Erny <derny@mirantis.com>
thaJeztah
approved these changes
May 29, 2024
Member
thaJeztah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
renovate bot
added a commit
to earthly/dind
that referenced
this pull request
Jun 10, 2024
[](https://renovatebot.com) This PR contains the following updates: | Package | Update | Change | |---|---|---| | [docker/docker](https://togithub.com/docker/docker) | patch | `26.1.3` -> `26.1.4` | --- ### Release Notes <details> <summary>docker/docker (docker/docker)</summary> ### [`v26.1.4`](https://togithub.com/moby/moby/releases/tag/v26.1.4) [Compare Source](https://togithub.com/docker/docker/compare/v26.1.3...v26.1.4) #### 26.1.4 For a full list of pull requests and changes in this release, refer to the relevant GitHub milestones: - [docker/cli, 26.1.4 milestone](https://togithub.com/docker/cli/issues?q=is%3Aclosed+milestone%3A26.1.4) - [moby/moby, 26.1.4 milestone](https://togithub.com/moby/moby/issues?q=is%3Aclosed+milestone%3A26.1.4) - Deprecated and removed features, see [Deprecated Features](https://togithub.com/docker/cli/blob/v26.1.4/docs/deprecated.md). - Changes to the Engine API, see [API version history](https://togithub.com/moby/moby/blob/v26.1.4/docs/api/version-history.md). ##### Security This release updates the Go runtime to 1.21.11 which contains security fixes for: - [CVE-2024-24789] - [CVE-2024-24790] - A symlink time of check to time of use race condition during directory removal reported by Addison Crump ([@​addisoncrump](https://togithub.com/addisoncrump)). ##### Bug fixes and enhancements - Fixed an issue where promoting a node immediately after another node was demoted could cause the promotion to fail. [moby/moby#47870](https://togithub.com/moby/moby/pull/47870) - Prevent the daemon log from being spammed with `superfluous response.WriteHeader call ...` messages.. [moby/moby#47843](https://togithub.com/moby/moby/pull/47843) - Don't show empty hints when plugins return an empty hook message. [docker/cli#5083](https://togithub.com/docker/cli/pull/5083) - Added `ContextType: "moby"` to the context list/inspect output to address a compatibility issue with Visual Studio Container Tools. [docker/cli#5095](https://togithub.com/docker/cli/pull/5095) - Fix a compatibility issue with Visual Studio Container Tools. [docker/cli#5095](https://togithub.com/docker/cli/pull/5095) ##### Packaging updates - Update containerd (static binaries only) to [v1.7.17](https://togithub.com/containerd/containerd/releases/tag/v1.7.17). [moby/moby#47841](https://togithub.com/moby/moby/pull/47841) - [CVE-2024-24789], [CVE-2024-24790]: Update Go runtime to 1.21.11. [moby/moby#47904](https://togithub.com/moby/moby/pull/47904) - Update Compose to [v2.27.1](https://togithub.com/docker/compose/releases/tag/v2.27.1). [docker/docker-ce-packages#1022](https://togithub.com/docker/docker-ce-packaging/pull/1022) - Update Buildx to [v0.14.1](https://togithub.com/docker/buildx/releases/tag/v0.14.1). [docker/docker-ce-packages#1021](https://togithub.com/docker/docker-ce-packaging/pull/1021) [CVE-2024-24789]: https://togithub.com/golang/go/issues/66869 [CVE-2024-24790]: https://togithub.com/golang/go/issues/67680 </details> --- ### Configuration 📅 **Schedule**: Branch creation - "after 6am on monday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/earthly/dind). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zOTMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjM5My4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZSJdfQ==--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
renovate bot
added a commit
to earthly/dind
that referenced
this pull request
Jun 10, 2024
[](https://renovatebot.com) This PR contains the following updates: | Package | Update | Change | |---|---|---| | [docker/docker](https://togithub.com/docker/docker) | patch | `26.1.3` -> `26.1.4` | --- ### Release Notes <details> <summary>docker/docker (docker/docker)</summary> ### [`v26.1.4`](https://togithub.com/moby/moby/releases/tag/v26.1.4) [Compare Source](https://togithub.com/docker/docker/compare/v26.1.3...v26.1.4) #### 26.1.4 For a full list of pull requests and changes in this release, refer to the relevant GitHub milestones: - [docker/cli, 26.1.4 milestone](https://togithub.com/docker/cli/issues?q=is%3Aclosed+milestone%3A26.1.4) - [moby/moby, 26.1.4 milestone](https://togithub.com/moby/moby/issues?q=is%3Aclosed+milestone%3A26.1.4) - Deprecated and removed features, see [Deprecated Features](https://togithub.com/docker/cli/blob/v26.1.4/docs/deprecated.md). - Changes to the Engine API, see [API version history](https://togithub.com/moby/moby/blob/v26.1.4/docs/api/version-history.md). ##### Security This release updates the Go runtime to 1.21.11 which contains security fixes for: - [CVE-2024-24789] - [CVE-2024-24790] - A symlink time of check to time of use race condition during directory removal reported by Addison Crump ([@​addisoncrump](https://togithub.com/addisoncrump)). ##### Bug fixes and enhancements - Fixed an issue where promoting a node immediately after another node was demoted could cause the promotion to fail. [moby/moby#47870](https://togithub.com/moby/moby/pull/47870) - Prevent the daemon log from being spammed with `superfluous response.WriteHeader call ...` messages.. [moby/moby#47843](https://togithub.com/moby/moby/pull/47843) - Don't show empty hints when plugins return an empty hook message. [docker/cli#5083](https://togithub.com/docker/cli/pull/5083) - Added `ContextType: "moby"` to the context list/inspect output to address a compatibility issue with Visual Studio Container Tools. [docker/cli#5095](https://togithub.com/docker/cli/pull/5095) - Fix a compatibility issue with Visual Studio Container Tools. [docker/cli#5095](https://togithub.com/docker/cli/pull/5095) ##### Packaging updates - Update containerd (static binaries only) to [v1.7.17](https://togithub.com/containerd/containerd/releases/tag/v1.7.17). [moby/moby#47841](https://togithub.com/moby/moby/pull/47841) - [CVE-2024-24789], [CVE-2024-24790]: Update Go runtime to 1.21.11. [moby/moby#47904](https://togithub.com/moby/moby/pull/47904) - Update Compose to [v2.27.1](https://togithub.com/docker/compose/releases/tag/v2.27.1). [docker/docker-ce-packages#1022](https://togithub.com/docker/docker-ce-packaging/pull/1022) - Update Buildx to [v0.14.1](https://togithub.com/docker/buildx/releases/tag/v0.14.1). [docker/docker-ce-packages#1021](https://togithub.com/docker/docker-ce-packaging/pull/1021) [CVE-2024-24789]: https://togithub.com/golang/go/issues/66869 [CVE-2024-24790]: https://togithub.com/golang/go/issues/67680 </details> --- ### Configuration 📅 **Schedule**: Branch creation - "after 6am on monday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/earthly/dind). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zOTMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjM5My4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZSJdfQ==--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
- What I did
Fix a minor race condition that could cause a node promotion to fail if it happened right after another node was demoted.
- How I did it
If a node is promoted right after another node is demoted, there exists the possibility of a race, by which the newly promoted manager attempts to connect to the newly demoted manager for its initial Raft membership. This connection fails, and the whole swarm Node object exits.
At this point, the daemon nodeRunner sees the exit and restarts the Node.
However, if the address of the no-longer-manager is recorded in the nodeRunner's config.joinAddr, the Node again attempts to connect to the no-longer-manager, and crashes again. This repeats. The workaround is to remove the node entirely and rejoin the Swarm as a new node.
This change erases config.joinAddr from the restart of the nodeRunner, if the node has previously become Ready. The node becoming Ready indicates that at some point, it did successfully join the cluster, in some fashion. If it has successfully joined the cluster, then Swarm has its own persistent record of known manager addresses.
If no joinAddr is provided, then Swarm will choose from its persisted list of managers to join, and will join a functioning manager.
- How to verify it
I'm unsure where we would stick an integration test, and the implementation thereof would probably be a nightmare.
To verify manually:
docker node demote [The Target's node id] && sleep 0.1 && docker node promote [The Worker's node id].- Description for the changelog
* Fixed an issue where rapidly promoting a node after another node was demoted could cause the promoted node to fail its promotion.