Skip to content

Conversation

@robmry
Copy link
Contributor

@robmry robmry commented May 28, 2024

- What I did

  • Allocate the same port for IPv4 and IPv6.

  • Add bridge driver option com.docker.network.bridge.gateway_mode_ipv6=<nat|routed> (and com.docker.network.bridge.gateway_mode_ipv4=<nat|routed>).

    • When set to routed, no NAT or masquerade iptables rules are set up.
    • The container is still firewalled, with an open port for each -p.
    • In inspect output, a non-NAT'd port mapping is shown with "HostPort": "".
    • So, if the host's network has routing for a docker network, it's possible to communicate directly with the container's address.
    • As with existing -p (HostConfig.PortBindings) options, a --expose (Config.ExposedPorts) for the container's port is still required to make the -p do anything. On its own, --expose doesn't result in the port being opened.
    • Host address/ports in -p for a disabled-NAT network are ignored, the container's port is opened.

- How I did it

  • Replaced the v4/v6 portmappers with (simpler) bridge code.
  • Also moved the bridge-specific per-port iptables code into the bridge driver.

- How to verify it

New/updated tests.

# docker network create --ipv6 --subnet fdee::/64 -o com.docker.network.bridge.gateway_mode_ipv6=routed nnn
# netcat -l -p 8080   # make the first port in the range below unusable
# docker run --rm -ti --network nnn --name c1 -p 8080-8083:80 nginx
# docker inspect c1
[
    {
        ...
        "HostConfig": {
            ...
            "PortBindings": {
                "80/tcp": [
                    {
                        "HostIp": "",
                        "HostPort": "8080-8083"
                    }
                ]
            },
        "Config": {
            ...
            "ExposedPorts": {
                "80/tcp": {}
            },
        },
        "NetworkSettings": {
            ...
            "Ports": {
                "80/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8081"
                    },
                    {
                        "HostIp": "::",
                        "HostPort": ""
                    }
                ]
            }
}}]

# IPv4 has NAT rules ...
*nat
-A POSTROUTING -s 172.19.0.2/32 -d 172.19.0.2/32 -p tcp -m tcp --dport 80 -j MASQUERADE
-A POSTROUTING -s 172.19.0.0/16 ! -o br-c68a9a829188 -j MASQUERADE
-A DOCKER ! -i br-c68a9a829188 -p tcp -m tcp --dport 8081 -j DNAT --to-destination 172.19.0.2:80
*filter
-A DOCKER -d 172.19.0.2/32 ! -i br-c68a9a829188 -o br-c68a9a829188 -p tcp -m tcp --dport 80 -j ACCEPT

# IPv6 has no NAT rules ...
*filter
-A DOCKER -d fdee::2/128 ! -i br-c68a9a829188 -o br-c68a9a829188 -p tcp -m tcp --dport 80 -j ACCEPT

# docker-proxy for IPv4 only ...
# ps -ef | grep docker-proxy
root     11894 11420  0 15:08 pts/0    00:00:00 /usr/local/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8081 -container-ip 172.19.0.2 -container-port 80

On a "CentOS 9 (stream)" VM, running firewalld, with addresses 192.168.175.43 and fd48:70ba:de62:0:a00:27ff:fece:c637 ...

  • created a network with no-nat for IPv6 as above (with IPv6 subnet fd12:3456::/64).
  • Ran firewall-cmd --reload, checked that the rules were re-created.
  • From another host, added a static route to the VM and checked that 'http://[fd12:3456::2]' and curl 'http://192.168.175.43:8080' worked as expected.

- Description for the changelog

- In port mappings with no host port specified, and with host port ranges, allocate the same port for IPv4 and IPv6.
- Add bridge driver option `com.docker.network.bridge.gateway_mode_ipv6=<nat|routed>`, when
  set to `routed` no NAT or masquerading rules are configured for port mappings.
  - This enables direct IPv6 access to the container, if the host's network can route packets for the
     container's address to the host. Mapped ports will be opened in the containers firewall.
  - For example, `docker network create --ipv6 -o com.docker.network.bridge.gateway_mode_ipv6=routed mynet`.
  - Option `com.docker.network.bridge.gateway_mode_ipv4=<nat|routed>` is also available.

@robmry robmry force-pushed the portmapper_fixes_and_nonat branch from 0770137 to 68c6534 Compare May 29, 2024 14:22
@robmry robmry changed the title Portmapper fixes and nonat Portmapper improvements, and options to disable NAT May 29, 2024
@robmry robmry force-pushed the portmapper_fixes_and_nonat branch from 68c6534 to d594859 Compare May 29, 2024 16:11
@robmry robmry self-assigned this May 29, 2024
@robmry robmry added this to the 27.0.0 milestone May 29, 2024
@robmry robmry added status/1-design-review status/2-code-review kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny area/networking Networking impact/changelog impact/documentation kind/bugfix PR's that fix bugs area/networking/ipv6 Networking area/networking/portmapping Networking labels May 29, 2024
@robmry robmry requested review from akerouanton and corhere May 29, 2024 16:14
@robmry robmry marked this pull request as ready for review May 29, 2024 16:25
@robmry robmry force-pushed the portmapper_fixes_and_nonat branch from d594859 to e066f01 Compare May 30, 2024 18:05
p.mutex.Lock()
defer p.mutex.Unlock()

if ip == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's never called with ip == nil, and removing the check makes the code simpler. But, I didn't do the same for RequestPortInRange, so I can put it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's never called with ip == nil

Oh, right! Then, maybe remove that in a follow-up commit / PR.

// on live-restore we don't want to allocate different ports - so, remove the range
// from the operational data.
for _, pm := range ep.portMapping {
pm.HostPortEnd = pm.HostPort
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this is really the right place to do that. Based on your comment, it seems we mix state and config in a struct meant to store only the end state. Maybe this should be done when ep.portMapping entries are created instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dealing with old saved-state - it represents running config, the ports can't be a range, but that's how the old code stored it.

The new code will treat it as a range on live-restore - it won't know it's live-restore, it's just been given the struct as config as if it had come from the API.

So the old saved-state has to be fixed up ... where else could it be done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dealing with old saved-state

Oh, right 🙈 Maybe we should consider dropping this compatibility code after a few releases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment to say it can be removed when upgrade from 26.x is no longer supported.

} else if b.HostIP.To4() != nil {
v = iptables.IPv4
} else {
// The binding is between containerV4 and hostV6, handled by docker-proxy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since IPv4-mapped IPv6 addrs aren't properly unmapped, I think this condition doesn't hold true.

Overall I think this condition is a bit cursed since each branch doesn't test the same field. I would rewrite it to explicitly early return whenever b.IP and b.HostIp aren't of the same family -- but then net's lack of expressiveness kicks in (ie. you'd have to write a helper function).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I'll take a look at the handling of IPv4-mapped IPv6.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

if bnd.HostIP.To4() == nil {
portmapper = n.portMapperV6
func (n *bridgeNetwork) reapplyPerPortIptables4() {
n.reapplyPerPortIptables(func(b *portBinding) bool { return b.IP.To4() != nil })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's contrived! Better use an arg that clearly states what address family should be restored, and then let it filter port bindings based on that.

We should in a follow-up if we can just restore everything in a single pass and call iptables.OnReloaded only once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing in a filter function is contrived? Better than a bool ipv4 or similar I think. But, can do that.

@robmry robmry force-pushed the portmapper_fixes_and_nonat branch 2 times, most recently from e61b9db to 026c263 Compare June 4, 2024 15:55
Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this talk about IPv6-mapped-IPv4 addresses is making me nervous. There is no semantic difference between listen()ing or bind()ing on [::ffff:10.0.0.1]:8080 vs. 10.0.0.1:8080. Only IPv4 packets will be accepted/transmitted over the wire for such a socket. So why does the portmapper need to fuss so much about net.IP slice lengths?

Comment on lines +141 to +135
ipstr := ip.String()
if _, ok := p.ipMap[ipstr]; !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered converting the implementation to use net/netip types internally, or would it be a bit much to tack onto this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it'd be better - but also yes, a bit too much for this PR.

Comment on lines +356 to +410
n.Lock()
defer n.Unlock()
if n.driver == nil {
return ""
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's with the locking and defensive coding? It looks a lot like n.driver is guaranteed to be non-nil and immutable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right - but every other 'bridgeNetwork method makes the same checks. A separate review/rework of all that could be good. But, for this change, it seems best to stick with tradition.

@robmry robmry force-pushed the portmapper_fixes_and_nonat branch from 026c263 to bc962e6 Compare June 5, 2024 14:45
@robmry
Copy link
Contributor Author

robmry commented Jun 5, 2024

All this talk about IPv6-mapped-IPv4 addresses is making me nervous. There is no semantic difference between listen()ing or bind()ing on [::ffff:10.0.0.1]:8080 vs. 10.0.0.1:8080. Only IPv4 packets will be accepted/transmitted over the wire for such a socket. So why does the portmapper need to fuss so much about net.IP slice lengths?

It needs to put rules into iptables or ip6tables, so it has to know what sort of address it's really dealing with.

@robmry robmry force-pushed the portmapper_fixes_and_nonat branch from bc962e6 to f6cbc15 Compare June 5, 2024 15:21
@robmry robmry force-pushed the portmapper_fixes_and_nonat branch 2 times, most recently from 8ce32a2 to 4c0e873 Compare June 5, 2024 22:25
robmry added 5 commits June 11, 2024 22:33
Add bridge driver options...
  com.docker.network.bridge.gateway_mode_ipv4=<nat|routed>
  com.docker.network.bridge.gateway_mode_ipv6=<nat|routed>

If set to "routed", no NAT or masquerade rules are set up for port
mappings.

When NAT is disabled, the mapping is shown in 'inspect' output with
no host port number. For example, for "-p 80" with NAT disabled for
IPv6 but not IPv4:

    "80/tcp": [
        {
            "HostIp": "0.0.0.0",
            "HostPort": "32768"
        },
        {
            "HostIp": "::",
            "HostPort": ""
        }

Signed-off-by: Rob Murray <rob.murray@docker.com>
Without this change, if a port mapping did not specify a host address
and the network was IPv6-enabled, the same port would be allocated for
mappings from '0.0.0.0' and '::'. But, if the port mapping was specified
with explicit addresses even, for example:
  -p 0.0.0.0:8080-8083:80 -p '[::]:8083-8080:80'

This change looks for port mappings that only differ in the host IP
address, and makes sure it allocates the same port for all of them. If
it can't, it fails with an error.

Signed-off-by: Rob Murray <rob.murray@docker.com>
It was added so that tests could replace it before it was picked
up and used by a new network's PortMapper, so that tests were isolated
from each other. Now the PortMapper is not used by the bridge driver,
neither is driver's portAllocator.

Instead of replacing the driver.portAllocator in tests, reset the
singleton instance using its ReleaseAll().

Un-export portallocator.NewInstance, now the tests aren't using it.

Signed-off-by: Rob Murray <rob.murray@docker.com>
Display a PortBinding in a format that's more like the one
used in the CLI, but includes the container IP if known.

Signed-off-by: Rob Murray <rob.murray@docker.com>
When bridge driver opt com.docker.network.bridge.gatway_mode_ipv[46]
is set to "routed", there is no NAT.

When there's no NAT, there's no meaning to the HostPort field in a
port mapping (all the port mapping does is open the container's port),
and the HostIP field is only used to determine the address family.

So, check port bindings, and raise errors if fields are unexpectedly
set when the mapping only applies to a gateway_mode=routed network.
Zero-addresses are allowed, to say the mapping/open-port should be
IPv4-only or IPv6-only, and host ports are not allowed.

A mapping with no host address, so it applies to IPv4 and IPv6 when
the default binding is 0.0.0.0, may include a host port if either
uses NAT. The port number is ignored for the directly-routed family.

Signed-off-by: Rob Murray <rob.murray@docker.com>
@robmry robmry force-pushed the portmapper_fixes_and_nonat branch from 8c2e4ca to 01eecb6 Compare June 11, 2024 21:34
@robmry robmry merged commit 52333f3 into moby:master Jun 13, 2024
@robmry robmry deleted the portmapper_fixes_and_nonat branch June 13, 2024 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/networking/ipv6 Networking area/networking/portmapping Networking area/networking Networking impact/changelog impact/documentation kind/bugfix PR's that fix bugs kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny status/1-design-review status/2-code-review

Projects

None yet

3 participants