Skip to content

Don't PublishAllPorts AND specify the same IP bindings, else we get "address already in use" reliably on docker v28#776

Merged
kegsay merged 8 commits intomainfrom
anoa/fix_network_interface_flake
May 8, 2025
Merged

Don't PublishAllPorts AND specify the same IP bindings, else we get "address already in use" reliably on docker v28#776
kegsay merged 8 commits intomainfrom
anoa/fix_network_interface_flake

Conversation

@anoadragon453
Copy link
Member

@anoadragon453 anoadragon453 commented May 7, 2025

Docker v28 refactored the networking code. Complement was driving the networking API incorrectly, by using PublishAllPorts in addition to specifying specific PortBindings. This was because a niche use for Homerunner needed the homeservers to bind to a specific address, see #560 However, Docker was happy with this up until v28 where it would produce "address already in use" errors. This then created unrelated problems because the error occurred during ContainerStart. We never cleaned up the container properly, thus subsequent attempts to create a container with the same name would produce Conflict. The container name "/complement_crypto_dirty_hs1" is already in use by container "ef141595c2188aa4d0ece3aa5f74fdb43a7ba276cf95a00076ad387881d7b24a". You have to remove (or rename) that container to be able to reuse that name


Old theory:

See the explanation in moby/moby#49935.

Also see the Docker v28 release notes:

  • Add a new netlabel com.docker.network.endpoint.ifname to customize the interface name used when connecting a container to a network. It's supported by all built-in network drivers on Linux. moby/moby#49155

    • When a container is created with multiple networks specified, there's no guarantee on the order networks will be connected to the container. So, if a custom interface name uses the same prefix as the auto-generated names, for example eth, the container might fail to start.

    • The recommended practice is to use a different prefix, for example en0, or a numerical suffix high enough to never collide, for example eth100.

    • This label can be specified on docker network connect via the --driver-opt flag, for example:

      docker network connect --driver-opt=com.docker.network.endpoint.ifname=foobar …
    • Or via the long-form --network flag on docker run, for example:

      docker run --network=name=bridge,driver-opt=com.docker.network.endpoint.ifname=foobar …

Closes #775

…ult gateway

These options were introduced in v28. They are ignored by older Docker
client versions.
@anoadragon453 anoadragon453 requested review from a team and kegsay as code owners May 7, 2025 16:42
@anoadragon453
Copy link
Member Author

Next issue to contend with:

CreateDirtyDeployment failed: CreateDirtyDeployment: dirty: failed to list networks. Error response from daemon: client version 1.49 is too new. Maximum supported API version is 1.48

I plan to split the go1.22 -> 1.23 update into a separate PR once CI is passing.

@anoadragon453 anoadragon453 marked this pull request as draft May 7, 2025 17:35
@3nprob 3nprob mentioned this pull request May 7, 2025
1 task
@3nprob
Copy link

3nprob commented May 7, 2025

Next issue to contend with:

CreateDirtyDeployment failed: CreateDirtyDeployment: dirty: failed to list networks. Error response from daemon: client version 1.49 is too new. Maximum supported API version is 1.48

Looks like this is just about matching the minor version:

@kegsay kegsay changed the title Update docker to v28.1.1 and pin network interface name Don't PublishAllPorts AND specify the same IP bindings, else we get "address already in use" reliably on docker v28 May 8, 2025
@kegsay kegsay marked this pull request as ready for review May 8, 2025 08:06
@kegsay kegsay merged commit d2e04c9 into main May 8, 2025
4 checks passed
@kegsay kegsay deleted the anoa/fix_network_interface_flake branch May 8, 2025 08:12
MadLittleMods added a commit that referenced this pull request May 26, 2025
…server `BaseURL`

Regressed in #776
which started using the Docker default (`0.0.0.0`).

This caused downstream issues in some of our out-of-repo Complement tests
which stress some SSO/OIDC login flows and expect to be using the canonical
`public_baseurl` configured in Synapse.
MadLittleMods added a commit that referenced this pull request May 29, 2025
…server `BaseURL` (#781)

Regressed in #776 which started using the Docker default `HostIP` (`0.0.0.0`) because we removed the specific `HSPortBindingIP` port bindings for `8008` and `8448` (see "Root cause" section below).

### Problem

This caused downstream issues in some of our out-of-repo Complement tests (Element internal) which stress some SSO/OIDC login flows with Synapse. 

Before #776, [`SsoRedirectServlet`](https://github.com/element-hq/synapse/blob/33ba8860c43d4770ea119a09a4fcbbf366f3b32e/synapse/rest/client/login.py#L645-L683) received a request like this `http://127.0.0.1:8008/_matrix/client/v3/login/sso/redirect/oidc-test_idp?redirectUrl=http%3A%2F%2Fapp.my-fake-client.io%2Fclient-callback`. But now it's seeing `http://0.0.0.0:8008/_matrix/client/v3/login/sso/redirect/oidc-test_idp?redirectUrl=http%3A%2F%2Fapp.my-fake-client.io%2Fclient-callback` and tries to redirect to the canonical `http://127.0.0.1:8008/` address which would then go to the step we expect in the tests (authorization endpoint).

Basically, the `host` header in the request used to be set to `127.0.0.1:8008` but now it's `0.0.0.0:8008`. Synapse wants to use the canonical URL so the cookies are available so it redirects you to the canonical `public_baseurl` version first. Relevant Synapse code that cares about using the canonical URL to get cookies back -> [`SsoRedirectServlet`](https://github.com/element-hq/synapse/blob/33ba8860c43d4770ea119a09a4fcbbf366f3b32e/synapse/rest/client/login.py#L645-L683)

We could alternatively, update our out-of-repo tests to account for this extra redirect but it seems more appropriate to update Complement to use the the configured `HSPortBindingIP` as expected.


### Root cause

The host name used in the requests during the Complement tests is from the `client.BaseURL` which is sourced from the [Docker container port mapping/bindings](https://github.com/matrix-org/complement/blob/28a09014afd1f9c75a3549b4cd9d8fc480ba1149/internal/docker/deployer.go#L521) in Complement.

In #776, we removed the explicit port bindings for `8008` with `HSPortBindingIP` ([`127.0.0.1`](https://github.com/matrix-org/complement/blob/28a09014afd1f9c75a3549b4cd9d8fc480ba1149/config/config.go#L180)) so now it uses Docker's default `0.0.0.0`.

```patch
diff --git a/internal/docker/deployer.go b/internal/docker/deployer.go
index cdcf51a..a186b30 100644
--- a/internal/docker/deployer.go
+++ b/internal/docker/deployer.go
@@ -382,20 +380,8 @@ func deployImage(
 	}, &container.HostConfig{
 		CapAdd:          []string{"NET_ADMIN"}, // TODO : this should be some sort of option
 		PublishAllPorts: true,
-		PortBindings: nat.PortMap{
-			nat.Port("8008/tcp"): []nat.PortBinding{
-				{
-					HostIP: cfg.HSPortBindingIP,
-				},
-			},
-			nat.Port("8448/tcp"): []nat.PortBinding{
-				{
-					HostIP: cfg.HSPortBindingIP,
-				},
-			},
-		},
```
jevolk pushed a commit to matrix-construct/complement that referenced this pull request Jul 22, 2025
…address already in use" reliably on docker v28 (matrix-org#776)

* Update docker client v26.1.5 -> v28.1.1

* Set a deterministic network interface name, and ensure we're the default gateway

These options were introduced in v28. They are ignored by older Docker
client versions.

* Update go from `1.22` -> `1.23`

This was required as of moby v28: moby/moby#49541

* Update types to new names/locations for docker v28

* Downgrade docker to work with more installations:

* Reapply v28.0.4 since we need priority?

* Remove needless config; this will break homerunner

* Include alias again for fed tests

---------

Co-authored-by: Kegan Dougal <7190048+kegsay@users.noreply.github.com>
jevolk pushed a commit to matrix-construct/complement that referenced this pull request Jul 22, 2025
…server `BaseURL` (matrix-org#781)

Regressed in matrix-org#776 which started using the Docker default `HostIP` (`0.0.0.0`) because we removed the specific `HSPortBindingIP` port bindings for `8008` and `8448` (see "Root cause" section below).

### Problem

This caused downstream issues in some of our out-of-repo Complement tests (Element internal) which stress some SSO/OIDC login flows with Synapse. 

Before matrix-org#776, [`SsoRedirectServlet`](https://github.com/element-hq/synapse/blob/33ba8860c43d4770ea119a09a4fcbbf366f3b32e/synapse/rest/client/login.py#L645-L683) received a request like this `http://127.0.0.1:8008/_matrix/client/v3/login/sso/redirect/oidc-test_idp?redirectUrl=http%3A%2F%2Fapp.my-fake-client.io%2Fclient-callback`. But now it's seeing `http://0.0.0.0:8008/_matrix/client/v3/login/sso/redirect/oidc-test_idp?redirectUrl=http%3A%2F%2Fapp.my-fake-client.io%2Fclient-callback` and tries to redirect to the canonical `http://127.0.0.1:8008/` address which would then go to the step we expect in the tests (authorization endpoint).

Basically, the `host` header in the request used to be set to `127.0.0.1:8008` but now it's `0.0.0.0:8008`. Synapse wants to use the canonical URL so the cookies are available so it redirects you to the canonical `public_baseurl` version first. Relevant Synapse code that cares about using the canonical URL to get cookies back -> [`SsoRedirectServlet`](https://github.com/element-hq/synapse/blob/33ba8860c43d4770ea119a09a4fcbbf366f3b32e/synapse/rest/client/login.py#L645-L683)

We could alternatively, update our out-of-repo tests to account for this extra redirect but it seems more appropriate to update Complement to use the the configured `HSPortBindingIP` as expected.


### Root cause

The host name used in the requests during the Complement tests is from the `client.BaseURL` which is sourced from the [Docker container port mapping/bindings](https://github.com/matrix-org/complement/blob/28a09014afd1f9c75a3549b4cd9d8fc480ba1149/internal/docker/deployer.go#L521) in Complement.

In matrix-org#776, we removed the explicit port bindings for `8008` with `HSPortBindingIP` ([`127.0.0.1`](https://github.com/matrix-org/complement/blob/28a09014afd1f9c75a3549b4cd9d8fc480ba1149/config/config.go#L180)) so now it uses Docker's default `0.0.0.0`.

```patch
diff --git a/internal/docker/deployer.go b/internal/docker/deployer.go
index cdcf51a..a186b30 100644
--- a/internal/docker/deployer.go
+++ b/internal/docker/deployer.go
@@ -382,20 +380,8 @@ func deployImage(
 	}, &container.HostConfig{
 		CapAdd:          []string{"NET_ADMIN"}, // TODO : this should be some sort of option
 		PublishAllPorts: true,
-		PortBindings: nat.PortMap{
-			nat.Port("8008/tcp"): []nat.PortBinding{
-				{
-					HostIP: cfg.HSPortBindingIP,
-				},
-			},
-			nat.Port("8448/tcp"): []nat.PortBinding{
-				{
-					HostIP: cfg.HSPortBindingIP,
-				},
-			},
-		},
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docker v28 update causes failed to listen on TCP socket: address already in use

3 participants