-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Description
Description
Related to:
- [BUG] <title>Docker Engine 28.x regression in IP to interface assignment docker/compose#12776
- Allocate IPv6 addresses after detecting IPv6 support #47406
- libnet: add support for custom interface names #49155
- [BUG] Ifname label in driver-opt doesn't work docker/compose#12740
- api: add GwPriority field to EndpointSettings #48936
To summarise some of the discussion on docker/compose#12776 ...
- Before moby 28.0.0, for a given compose config, interface names in containers were always assigned to the same network endpoints.
- In moby 28.0.0:
- That changed unexpectedly, interface names became unpredictable by-default.
- New option
com.docker.network.endpoint.ifnamewas added, making it possible to explicitly assign an interface name. - New option
GwPrioritywas added to make it possible to determine which network provides a container's default gateway.
This issue is for discussion of the issue and next steps.
How the change came about
Multiple network endpoints can be described in the API's container create request. But field EndpointsConfig is an unordered map, not a list. So, no predictable order for interface naming (eth0, eth1, ...) can be implied by a create request.
(Compose switched from using separate NetworkConnect to supplying multiple endpoints in the create request. But, that had no effect on network interface naming, because the connect calls were made before the container was started. Endpoints were accumulated internally in the same way.)
Interface names are allocated by libnetwork, during a call to populateNetworkResources. The code has changed a bit but, before and since 28.0.0, the daemon adds network connections to a container during sbJoin, by iterating over a temporary map that's populated from the API's map. There's a call to populateNetworkResources during that map iteration - so, definitely no predictable ordering there, endpoints have been through two Go maps.
But, there's another call to populateNetworkResources from SetKey - before and since 28.0.0. That call is made by iterating over libnetwork's list of endpoints sb.Endpoints - which is ordered.
Before 28.0.0, SetKey was a callback from the OCI prestart hook. Now, it's called once container task creation is complete and the container's configuration can be inspected (making it possible to check whether to allocate IPv6 addresses). The call to sbJoin now happens after SetKey.
Before 28.0.0, the SetKey call was made later than the sbJoin and it did the work - sbJoin returned early because there was no OS Sandbox yet.
Since 28.0.0, the sbJoin call is made later than SetKey and it does the work - SetKey takes no action (in this case) because sb.Endpoints has not yet been populated by sbJoin calls.
So, populateNetworkResources was called in sb.Endpoints order. Now, it's called in a random order.
The old order
But, the sb.Endpoints order has also changed ... the Endpoints are stored in gateway priority order, defined by Endpoint.Less before 28.0.0, and since.
Before 28.0.0, in sb.Endpoints - non-gateway, external, dual-stack networks came first - then IPv4-only, internal and gateway networks. Networks with the same properties were sorted lexicographically.
Since 28.0.0, there's way to set gateway priority explicitly and that takes priority. Then dual stack networks are preferred over single stack (because we now have IPv6-only networks as well as IPv4-only), and the other rules were unchanged.
Ordering by gateway priority doesn't make sense as a way to determine network interface naming. While it would have made things stable for a fixed set of network connections, adding another network would have re-ordered things, and therefore renamed interfaces, apparently unpredictably - lexicographical ordering would sometimes have helped but not always (depending on network configurations). It wasn't even good for gateway selection, which is why we added a way to be explicit about gateway priority.
As @akerouanton described here - we shouldn't conflate gateway priority and endpoint naming.
What to do about it?
(As discussed in the networking maintainers call, 6th May.)
The only way to keep interface naming stable for those who relying on it pre-28.0.0 would be to preserve the original Endpoint.Less ordering indefinitely, even though it's not really fit for purpose - names may change on container restart, or add a new network to the config and the names may all change.
We now have a way to explicitly name interfaces for users who need predictable names.
In a given configuration, by using the new interface naming option to assign the interface names that would have been assigned by 28.0.0, networks will reliably have the same names before and after 28.0.0. (With no need to supply different compose configuration files, because the interface-naming label will just be ignored by pre-28.0.0 builds. So, backwards compatibility is possible for configurations that need it.)
Release 28.0.0 shipped on 20th Feb, and the issue was reported on 25th April. So, either there are a low number of affected users, or most affected users looked at the release notes and realised they could use the new interface naming to solve the problem.
Because the interface naming order isn't (and wasn't) guaranteed across container restarts and seemingly innocuous configuration changes, it's probably better to be explicitly unpredictable - to encourage use of explicit naming when the naming matters. We can support that indefinitely, more easily and reliably than supporting the old/unintended ordering.
So - there's no plan to restore the old sort order.
We considered going the other way, without explicit names perhaps interface numbering should be completely random to make it clear to users that it has to be configured to be stable. But, being able to rely on a container with a single interface having an eth0 is reasonable, we don't need to change that.
@corhere suggested using alternative network names to give interfaces a second name, without the usual netdev restrictions ... a container can only have a single endpoint in each network, so we'd use the network name (if possible, else not assign an alternative name, to avoid collisions or ambiguity). Then, with no additional config needed, interfaces would have predictable names ... that enhancement seems worth investigating.
Reproduce
n/a
Expected behavior
No response
docker version
28.0.0docker info
n/aAdditional Info
No response