Skip to content

IPv6 only: changes to bridge driver and gateway Endpoint selection#48323

Merged
thaJeztah merged 4 commits intomoby:masterfrom
robmry:v6only/bridge
Nov 7, 2024
Merged

IPv6 only: changes to bridge driver and gateway Endpoint selection#48323
thaJeztah merged 4 commits intomoby:masterfrom
robmry:v6only/bridge

Conversation

@robmry
Copy link
Contributor

@robmry robmry commented Aug 12, 2024

- What I did

  • Added EnableIPv4 to the bridge driver.
  • Gave libnetwork control over the bridge's use of docker-proxy to proxy from IPv6 host addresses to IPv4-only containers.
    • (See the notes below.)
  • Separated selection of IPv4 and IPv6 gateway Endpoints.
    • When a container's connected to an IPv4-only network and an IPv6-only network, it needs different gateways for each address family.

- How I did it

Notes on gateway endpoints ...

A gateway endpoint gets the default route and, for the bridge driver, port mappings from the host.

Before this change, a dual-stack endpoint was preferred over an IPv4-only endpoint - if there was an IPv6 endpoint, it was always dual-stack. So, there was no way for the IPv4 and IPv6 gateway endpoints to be different.

Now, a container with IPv4-only and IPv6-only endpoints needs two separate gateways. This is further complicated by the bridge driver's use of docker-proxy to map from host IPv6 addresses to an IPv4-only container, when the host address is unspecified in a port mapping.

If an IPv6 Endpoint is added to a Sandbox that only has an IPv4 endpoint, proxying from the host IPv6 address has to stop (so, the IPv4 gateway hasn't changed, but the driver still needs to be reconfigured). And vice-versa, if an IPv6-only endpoint is removed, leaving only IPv4-only endpoints, the host IPv6 proxy needs to be added.

Note that the new IPv6 endpoint may not belong to a bridge network, the bridge driver doesn't have the information it needs to deal with that ... proxying host-IPv6 to container-IPv4 is policy that needs to be handled by libnetwork. So, there's a new (private) driver option netlabel.NoProxy6To4 that can be passed to the driver's ProgramExternalConnectivity. Its value is calculated by libnetwork when an endpoint is added or removed, true if the IPv4 and IPv6 gateways are different. When its value changes, the IPv4 gateway is removed by RevokeExternalConnectivity, then re-added. So, ongoing connections will be interrupted and, if host ports are not specified in a port mapping, they'll change when the mapping is re-created.

Before this change, when adding a dual-stack macvlan/ipvlan endpoint to a Sandbox that only has IPv4 endpoints, the dual-stack endpoint became the gateway. So, if the existing gateway was an IPv4-only bridge endpoint with port mappings from the host, those port mappings would be removed. With this change, that behaviour is preserved - and the same logic applies when an IPv6-only macvlan/ipvlan is added to a Sandbox with a bridge IPv4-only endpoint; port mappings from host-IPv6 are removed.

We could offer better control over the way gateway endpoints are selected, default routes, what happens when networks are added/removed, and host port mappings. We could extend the driver interface to add a way to change gateway settings. But, I've treated all that as out-of-scope here.

- How to verify it

New and updated tests.

- Description for the changelog

@robmry robmry changed the title V6only/bridge IPv4 only: changes to bridge driver and gateway Endpoint selection Aug 12, 2024
@robmry robmry self-assigned this Aug 12, 2024
@robmry robmry added kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. area/networking Networking area/networking/ipv6 Networking area/networking/d/bridge Networking area/networking/portmapping Networking labels Aug 12, 2024
@robmry robmry added this to the 28.0.0 milestone Aug 12, 2024
@robmry robmry changed the title IPv4 only: changes to bridge driver and gateway Endpoint selection IPv6 only: changes to bridge driver and gateway Endpoint selection Aug 12, 2024
@robmry robmry force-pushed the v6only/bridge branch 4 times, most recently from ed1f7ed to 1a3cf05 Compare September 25, 2024 13:48
@robmry robmry marked this pull request as ready for review September 25, 2024 14:37

// NoProxy6To4 disables proxying from an IPv6 host port to an IPv4-only
// container, when the default binding address is 0.0.0.0.
NoProxy6To4 = DriverPrivatePrefix + ".no_proxy_6to4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we come up with a more solid way of picking up endpoints that should expose ports, we'll probably drop this label. But, DriverPrivatePrefix is a red herring, as there's nothing taking care of discarding 'private' labels from user-supplied options. So, in fact, this new label is now part of the 'public API', and dropping it in a future release would be a breaking change.

We should at least complement this comment to warn that it's not meant to be used by end-users. Or, we can be stricter, and drop all 'private' labels from the set of user-defined options passed to libnetwork.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the comment to ...

// NoProxy6To4 disables proxying from an IPv6 host port to an IPv4-only
// container, when the default binding address is 0.0.0.0. This label
// is intended for internal use, it may be removed in a future release.

}

// selectGatewayEndpoint is like getGatewayEndpoint, but selects only from
// endpoints. It can be called when the Sandbox mutex is already held.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But nothing is holding sb's mutex when it's called from getGatewayEndpoint. At best, some code paths call joinLeaveStart to protect against races, but that's not universal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split selectGatewayEndpoint out of getGatewayEndpoint so that it could be called from clearNetworkResources with the mutex already held.

getGatewayEndpoint now calls selectGatewayEndpoint(sb.Endpoints()) ... and sb.Endpoints() acquires the mutex to make a copy of the slice.

Previously, getGatewayEndpoint did this, which is equivalent ...

func (sb *Sandbox) getGatewayEndpoint() *Endpoint {
	for _, ep := range sb.Endpoints() {
		...

So, if there's a path with a race, I don't think this change affects it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, absolutely, I should have been more explicit! In terms of concurrency, your code is equivalent to what we had before.

But I think the comment

It can be called when the Sandbox mutex is already held.

is a bit misleading, because obvisouly that's no what's happening just above.

Copy link
Member

@akerouanton akerouanton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about selectGatewayEndpoint

if c.Mtu, err = strconv.Atoi(value); err != nil {
return parseErr(label, value, err.Error())
}
case netlabel.EnableIPv4:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always surprised that we need to parse the same label twice. We should look at unifying fromLabels and parseNetworkOptions at some point.

}

// selectGatewayEndpoint is like getGatewayEndpoint, but selects only from
// endpoints. It can be called when the Sandbox mutex is already held.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, absolutely, I should have been more explicit! In terms of concurrency, your code is equivalent to what we had before.

But I think the comment

It can be called when the Sandbox mutex is already held.

is a bit misleading, because obvisouly that's no what's happening just above.

if err = d.ProgramExternalConnectivity(ctx, n.ID(), ep.ID(), sb.Labels()); err != nil {
labels := sb.Labels()
labels[netlabel.NoProxy6To4] = noProxy6To4After
if err := d.ProgramExternalConnectivity(ctx, n.ID(), ep.ID(), labels); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use the newly introduced ep.programExtrernalConnectivity here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ep.programExternalConnectivity loads the driver and network from the store. It's used by sbLeave when the departing endpoint is a gateway, and some other Endpoint needs to take over the gateway-ness. That other endpoint may have a different driver, and will have a different network.

Here, during sbJoin, the driver and network are ep's - so they're already loaded and the driver call can be made directly.

extN, err := extEp.getNetworkFromStore()
if err != nil {
return fmt.Errorf("failed to get network from store for revoking external connectivity during join: %v", err)
gwepAfter4, gwepAfter6 := sb.getGatewayEndpoint()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heck, not critizing your code, but that level of Single Responsibility violation is super crazy! 🙈

We should really straighten that code up at some point.

if ipamV4Data[0].Gateway != nil {
c.AddressIPv4 = types.GetIPNetCopy(ipamV4Data[0].Gateway)
}
if ipamV4Data[0].Gateway != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can't get an IPAM config with no gateway assigned

moby/libnetwork/network.go

Lines 1622 to 1638 in 367c910

if gws, ok := d.Meta[netlabel.Gateway]; ok {
if d.Gateway, err = types.ParseCIDR(gws); err != nil {
return types.InvalidParameterErrorf("failed to parse gateway address (%v) returned by ipam driver: %v", gws, err)
}
}
// If user requested a specific gateway, libnetwork will allocate it
// irrespective of whether ipam driver returned a gateway already.
// If none of the above is true, libnetwork will allocate one.
if cfg.Gateway != "" || d.Gateway == nil {
gatewayOpts := map[string]string{
ipamapi.RequestAddressType: netlabel.Gateway,
}
if d.Gateway, _, err = ipam.RequestAddress(d.PoolID, net.ParseIP(cfg.Gateway), gatewayOpts); err != nil {
return types.InternalErrorf("failed to allocate gateway (%v): %v", cfg.Gateway, err)
}
}

And falling back to the Pool address would be invalid as it's the 'network address'. Probably something we should clean up in a follow-up.

@robmry
Copy link
Contributor Author

robmry commented Oct 31, 2024

Rebased - no other changes.


var err error
ep.portMapping, err = n.addPortMappings(context.TODO(), ep.addr, ep.addrv6, cfg, n.config.DefaultBindingIP)
ep.portMapping, err = n.addPortMappings(context.TODO(), ep.addr, ep.addrv6, cfg, n.config.DefaultBindingIP, ep.extConnConfig.NoProxy6To4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to these changes, but I'm curious why this function checks for ep.extConnConfig.ExposedPorts or ep.extConnConfig.PortBindings to be nil if it doesn't use those values 🤔 (also curious why ExposedPorts is needed for Portbindings to be used);

if ep.extConnConfig == nil ||
ep.extConnConfig.ExposedPorts == nil ||
ep.extConnConfig.PortBindings == nil {
return
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Port bindings don't imply that ports should be exposed (I'm not sure why, but I don't think we can change it now). If ports aren't exposed, the binding just doesn't do anything.

So, if either field is missing, there's nothing to do.

Perhaps I could have changed it to check ep.portBindings instead, in 4f09af6 which is the change that removed direct use of those values. I think that'd probably be equivalent now (and would have been before that change) ... maybe something to look at in a follow-up.

Comment on lines +633 to +634
log.G(ctx).Debugf("Revoking %s external connectivity on endpoint %s (%s), NoProxy6To4:%v",
role, ep.Name(), ep.ID(), noProxy6To4Before)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good to use the same structured logs fields for this one as used below (there's also some existing logs further up in this function that log ep.Name() but not as structured log).

Not blocking though (but probably good to make a pass at doing so); perhaps constructing a logger could work for this (logger := ....WithFields() or ctx := log.WithLogger(....)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - done, as a separate commit.

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +769 to +773
return types.InvalidParameterErrorf("ipv4 pool is empty")
}
if config.EnableIPv6 && (len(ipV6Data) == 0 || ipV6Data[0].Pool.String() == "::/0") {
return types.InvalidParameterErrorf("ipv6 pool is empty")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈 looks like we're not consistent on casing; first error above uses IPv4 / IPv6, and these both use lowercase 😂

@thaJeztah
Copy link
Member

Doh! I think I caused a merge conflict in my linting PR; happy do do the rebase for you

@thaJeztah
Copy link
Member

(CI is green, so happy to merge without waiting for it to fully run again after the merge conflict is resolved)

Signed-off-by: Rob Murray <rob.murray@docker.com>
When the host has IPv6 support but the container doesn't, the
default bind address is 0.0.0.0, and a port mapping does not
include a host address - the default behaviour (of the bridge
driver) is to run a docker-proxy to map from any IPv6 host
address to the IPv4 address of a container's gateway endpoint.

Driver option com.docker.network.driver.private.no_proxy_6to4
can now be used to disable that behaviour when configuring
a Sandbox's external connectivity.

Signed-off-by: Rob Murray <rob.murray@docker.com>
A dual-stack endpoint still has priority when selecting a gateway
Endpoint for a Sandbox. But, now there are IPv6-only networks, it
is possible to have a Sandbox with only IPv4-only and IPv6-only
endpoints. This change means they are both gateway endpoints.

Tell the network driver it mustn't proxy host-IPv6 to endpoint-IPv4
when there's an IPv6 gateway endpoint (which may belong to a different
net driver). Update that when networks are connected/disconnected.

Signed-off-by: Rob Murray <rob.murray@docker.com>
Signed-off-by: Rob Murray <rob.murray@docker.com>
@robmry
Copy link
Contributor Author

robmry commented Nov 7, 2024

Doh! I think I caused a merge conflict in my linting PR; happy do do the rebase for you
(CI is green, so happy to merge without waiting for it to fully run again after the merge conflict is resolved)

No worries - it's rebased, I'll let CI run because there were quite a few commits (and a conflict in a test with the /etc/hosts change, that git couldn't spot).

@thaJeztah
Copy link
Member

All green again! Merging; thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/networking/d/bridge Networking area/networking/ipv6 Networking area/networking/portmapping Networking area/networking Networking kind/enhancement Enhancements are not bugs or new features but can improve usability or performance.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants