Skip to content

Conversation

@dperny
Copy link
Contributor

@dperny dperny commented May 28, 2024

- What I did

Fix a minor race condition that could cause a node promotion to fail if it happened right after another node was demoted.

- How I did it

If a node is promoted right after another node is demoted, there exists the possibility of a race, by which the newly promoted manager attempts to connect to the newly demoted manager for its initial Raft membership. This connection fails, and the whole swarm Node object exits.

At this point, the daemon nodeRunner sees the exit and restarts the Node.

However, if the address of the no-longer-manager is recorded in the nodeRunner's config.joinAddr, the Node again attempts to connect to the no-longer-manager, and crashes again. This repeats. The workaround is to remove the node entirely and rejoin the Swarm as a new node.

This change erases config.joinAddr from the restart of the nodeRunner, if the node has previously become Ready. The node becoming Ready indicates that at some point, it did successfully join the cluster, in some fashion. If it has successfully joined the cluster, then Swarm has its own persistent record of known manager addresses.

If no joinAddr is provided, then Swarm will choose from its persisted list of managers to join, and will join a functioning manager.

- How to verify it

I'm unsure where we would stick an integration test, and the implementation thereof would probably be a nightmare.

To verify manually:

  1. Create a cluster with 3 Manager nodes.
  2. Add a worker node. This we will call "The Worker". Note which node the IP of the join command will send the worker to. This we will call "The Target"
  3. On a node that is not The Target, run the command docker node demote [The Target's node id] && sleep 0.1 && docker node promote [The Worker's node id].
  4. Without this patch, the promotion will fail. The node will get stuck. With this patch, the promotion will succeed.

- Description for the changelog

* Fixed an issue where rapidly promoting a node after another node was demoted could cause the promoted node to fail its promotion.

If a node is promoted right after another node is demoted, there exists
the possibility of a race, by which the newly promoted manager attempts
to connect to the newly demoted manager for its initial Raft membership.
This connection fails, and the whole swarm Node object exits.

At this point, the daemon nodeRunner sees the exit and restarts the
Node.

However, if the address of the no-longer-manager is recorded in the
nodeRunner's config.joinAddr, the Node again attempts to connect to the
no-longer-manager, and crashes again. This repeats. The solution is to
remove the node entirely and rejoin the Swarm as a new node.

This change erases config.joinAddr from the restart of the nodeRunner,
if the node has previously become Ready. The node becoming Ready
indicates that at some point, it did successfully join the cluster, in
some fashion. If it has successfully joined the cluster, then Swarm has
its own persistent record of known manager addresses. If no joinAddr is
provided, then Swarm will choose from its persisted list of managers to
join, and will join a functioning manager.

Signed-off-by: Drew Erny <derny@mirantis.com>
(cherry picked from commit 16e5c41)
Signed-off-by: Drew Erny <derny@mirantis.com>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah thaJeztah merged commit f679e1d into moby:26.1 May 29, 2024
renovate bot added a commit to earthly/dind that referenced this pull request Jun 10, 2024
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [docker/docker](https://togithub.com/docker/docker) | patch | `26.1.3`
-> `26.1.4` |

---

### Release Notes

<details>
<summary>docker/docker (docker/docker)</summary>

### [`v26.1.4`](https://togithub.com/moby/moby/releases/tag/v26.1.4)

[Compare
Source](https://togithub.com/docker/docker/compare/v26.1.3...v26.1.4)

#### 26.1.4

For a full list of pull requests and changes in this release, refer to
the relevant GitHub milestones:

- [docker/cli, 26.1.4
milestone](https://togithub.com/docker/cli/issues?q=is%3Aclosed+milestone%3A26.1.4)
- [moby/moby, 26.1.4
milestone](https://togithub.com/moby/moby/issues?q=is%3Aclosed+milestone%3A26.1.4)
- Deprecated and removed features, see [Deprecated
Features](https://togithub.com/docker/cli/blob/v26.1.4/docs/deprecated.md).
- Changes to the Engine API, see [API version
history](https://togithub.com/moby/moby/blob/v26.1.4/docs/api/version-history.md).

##### Security

This release updates the Go runtime to 1.21.11 which contains security
fixes for:

-   [CVE-2024-24789]
-   [CVE-2024-24790]
- A symlink time of check to time of use race condition during directory
removal reported by Addison Crump
([@&#8203;addisoncrump](https://togithub.com/addisoncrump)).

##### Bug fixes and enhancements

- Fixed an issue where promoting a node immediately after another node
was demoted could cause the promotion to fail.
[moby/moby#47870](https://togithub.com/moby/moby/pull/47870)
- Prevent the daemon log from being spammed with `superfluous
response.WriteHeader call ...` messages..
[moby/moby#47843](https://togithub.com/moby/moby/pull/47843)
- Don't show empty hints when plugins return an empty hook message.
[docker/cli#5083](https://togithub.com/docker/cli/pull/5083)
- Added `ContextType: "moby"` to the context list/inspect output to
address a compatibility issue with Visual Studio Container Tools.
[docker/cli#5095](https://togithub.com/docker/cli/pull/5095)
- Fix a compatibility issue with Visual Studio Container Tools.
[docker/cli#5095](https://togithub.com/docker/cli/pull/5095)

##### Packaging updates

- Update containerd (static binaries only) to
[v1.7.17](https://togithub.com/containerd/containerd/releases/tag/v1.7.17).
[moby/moby#47841](https://togithub.com/moby/moby/pull/47841)
- [CVE-2024-24789], [CVE-2024-24790]: Update Go runtime to 1.21.11.
[moby/moby#47904](https://togithub.com/moby/moby/pull/47904)
- Update Compose to
[v2.27.1](https://togithub.com/docker/compose/releases/tag/v2.27.1).
[docker/docker-ce-packages#1022](https://togithub.com/docker/docker-ce-packaging/pull/1022)
- Update Buildx to
[v0.14.1](https://togithub.com/docker/buildx/releases/tag/v0.14.1).
[docker/docker-ce-packages#1021](https://togithub.com/docker/docker-ce-packaging/pull/1021)

    [CVE-2024-24789]: https://togithub.com/golang/go/issues/66869

    [CVE-2024-24790]: https://togithub.com/golang/go/issues/67680

</details>

---

### Configuration

📅 **Schedule**: Branch creation - "after 6am on monday" (UTC), Automerge
- At any time (no schedule defined).

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/earthly/dind).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zOTMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjM5My4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZSJdfQ==-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
renovate bot added a commit to earthly/dind that referenced this pull request Jun 10, 2024
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [docker/docker](https://togithub.com/docker/docker) | patch | `26.1.3`
-> `26.1.4` |

---

### Release Notes

<details>
<summary>docker/docker (docker/docker)</summary>

### [`v26.1.4`](https://togithub.com/moby/moby/releases/tag/v26.1.4)

[Compare
Source](https://togithub.com/docker/docker/compare/v26.1.3...v26.1.4)

#### 26.1.4

For a full list of pull requests and changes in this release, refer to
the relevant GitHub milestones:

- [docker/cli, 26.1.4
milestone](https://togithub.com/docker/cli/issues?q=is%3Aclosed+milestone%3A26.1.4)
- [moby/moby, 26.1.4
milestone](https://togithub.com/moby/moby/issues?q=is%3Aclosed+milestone%3A26.1.4)
- Deprecated and removed features, see [Deprecated
Features](https://togithub.com/docker/cli/blob/v26.1.4/docs/deprecated.md).
- Changes to the Engine API, see [API version
history](https://togithub.com/moby/moby/blob/v26.1.4/docs/api/version-history.md).

##### Security

This release updates the Go runtime to 1.21.11 which contains security
fixes for:

-   [CVE-2024-24789]
-   [CVE-2024-24790]
- A symlink time of check to time of use race condition during directory
removal reported by Addison Crump
([@&#8203;addisoncrump](https://togithub.com/addisoncrump)).

##### Bug fixes and enhancements

- Fixed an issue where promoting a node immediately after another node
was demoted could cause the promotion to fail.
[moby/moby#47870](https://togithub.com/moby/moby/pull/47870)
- Prevent the daemon log from being spammed with `superfluous
response.WriteHeader call ...` messages..
[moby/moby#47843](https://togithub.com/moby/moby/pull/47843)
- Don't show empty hints when plugins return an empty hook message.
[docker/cli#5083](https://togithub.com/docker/cli/pull/5083)
- Added `ContextType: "moby"` to the context list/inspect output to
address a compatibility issue with Visual Studio Container Tools.
[docker/cli#5095](https://togithub.com/docker/cli/pull/5095)
- Fix a compatibility issue with Visual Studio Container Tools.
[docker/cli#5095](https://togithub.com/docker/cli/pull/5095)

##### Packaging updates

- Update containerd (static binaries only) to
[v1.7.17](https://togithub.com/containerd/containerd/releases/tag/v1.7.17).
[moby/moby#47841](https://togithub.com/moby/moby/pull/47841)
- [CVE-2024-24789], [CVE-2024-24790]: Update Go runtime to 1.21.11.
[moby/moby#47904](https://togithub.com/moby/moby/pull/47904)
- Update Compose to
[v2.27.1](https://togithub.com/docker/compose/releases/tag/v2.27.1).
[docker/docker-ce-packages#1022](https://togithub.com/docker/docker-ce-packaging/pull/1022)
- Update Buildx to
[v0.14.1](https://togithub.com/docker/buildx/releases/tag/v0.14.1).
[docker/docker-ce-packages#1021](https://togithub.com/docker/docker-ce-packaging/pull/1021)

    [CVE-2024-24789]: https://togithub.com/golang/go/issues/66869

    [CVE-2024-24790]: https://togithub.com/golang/go/issues/67680

</details>

---

### Configuration

📅 **Schedule**: Branch creation - "after 6am on monday" (UTC), Automerge
- At any time (no schedule defined).

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/earthly/dind).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zOTMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjM5My4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZSJdfQ==-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants