Skip to content

Don't enforce new validation rules for existing networks#47361

Merged
thaJeztah merged 2 commits intomoby:masterfrom
robmry:47331_swarm_ipam_validation
Feb 16, 2024
Merged

Don't enforce new validation rules for existing networks#47361
thaJeztah merged 2 commits intomoby:masterfrom
robmry:47331_swarm_ipam_validation

Conversation

@robmry
Copy link
Contributor

@robmry robmry commented Feb 8, 2024

- What I did

Non-swarm networks created before validation checks were added on network creation (#45759) continued working in 25.0.x because the checks were not re-run.

But, a swarm network with invalid IPAM config (for example, subnet 10.0.0.1/24) fails to initialise on upgrade to 25.0.x.

Fixes #47331

- How I did it

Swarm creates networks (with agent=true) to ensure they exist on each agent, when needed - ignoring the NetworkNameError that says the network already existed.

By ignoring IPAM validation errors on creation of a network with agent=true, pre-existing swarm networks with IPAM config that would fail the new checks will continue to work too.

New swarm (overlay) networks are still validated, because they are initially created with 'agent=false'.

- How to verify it

As described in #47331 ...

Start with a 24.0.x build:

  • docker network rm ingress
  • docker network create --driver overlay --ingress ingress --subnet=10.0.0.1/1
  • docker service create -p 80:80 nginx
  • Upgrade to docker 25.0.3

The service failed to start, logging errors like this ...

level=error msg="fatal task error" error="invalid network config:\ninvalid subnet 10.0.0.1/16: it should be 10.0.0.0/16" module=node/agent/taskmanager node.id=s6ui3u8t66gml6czf2jwxhpgn service.id=rty5ui8sw8j6a1yl5tl98e6z9 task.id=lffsjphhk89mcwhhvn6heb91x

With this change, the service starts, and log lines like this are generated ("ingress" is configured as above, "oln" is a non-ingress overlay network) ...

WARN[2024-02-08T17:38:29.881000427Z] Continuing with validation errors in agent network IPAM  error="invalid network config:\ninvalid subnet 10.0.0.1/16: it should be 10.0.0.0/16" network=ingress
WARN[2024-02-08T17:38:29.881073302Z] Continuing with validation errors in agent network IPAM  error="invalid network config:\ninvalid subnet 10.1.0.1/16: it should be 10.1.0.0/16" network=oln

It's not possible to create new overlay and bridge networks with the same issue ...

# docker network create --subnet 10.10.0.1/24 nnn
Error response from daemon: invalid network config:
invalid subnet 10.10.0.1/24: it should be 10.10.0.0/24

# docker network create --driver overlay ooo --subnet=10.2.0.1/16
Error response from daemon: invalid network config:
invalid subnet 10.2.0.1/16: it should be 10.2.0.0/16

- Description for the changelog

Do not enforce new validation rules for existing swarm networks

@robmry robmry self-assigned this Feb 8, 2024
@robmry robmry marked this pull request as ready for review February 8, 2024 18:48
@robmry robmry requested review from corhere and thaJeztah February 8, 2024 18:49
Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see how newly-created networks are always validated before passed to Swarmkit: the (*Daemon).createNetwork(agent=false) flow runs as usual, all the way up to the (*Controller).NewNetwork() call. Libnetwork notices that it's a Swarm-scope network and returns libnetwork.ManagerRedirectError, and the caller creates the network in the Swarm state instead.

func (n *networkRouter) postNetworkCreate(ctx context.Context, w http.ResponseWriter, r *http.Request, vars map[string]string) error {
if err := httputils.ParseForm(r); err != nil {
return err
}
var create types.NetworkCreateRequest
if err := httputils.ReadJSON(r, &create); err != nil {
return err
}
if nws, err := n.cluster.GetNetworksByName(create.Name); err == nil && len(nws) > 0 {
return libnetwork.NetworkNameError(create.Name)
}
nw, err := n.backend.CreateNetwork(create)
if err != nil {
if _, ok := err.(libnetwork.ManagerRedirectError); !ok {
return err
}
id, err := n.cluster.CreateNetwork(create)
if err != nil {
return err
}
nw = &types.NetworkCreateResponse{
ID: id,
}
}
return httputils.WriteJSON(w, http.StatusCreated, nw)
}

In light of this I am okay with your fix as-is, under one condition: a comment is added in func (*networkRouter) postNetworkCreate which calls out that the n.backend.CreateNetwork(create) call is necessary for validating the network spec for Swarm-scoped networks, despite not actually creating the network. I'm just a bit worried that we might accidentally regress with an overzealous libnetwork refactoring, given how surprising the control flow is.

Comment on lines +339 to +340
// agent networks here, swarm networks will continue to work too. (New overlay
// networks are initially created as !agent, so they are still validated.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a swarm-scoped network is created it is run through createNetwork with !agent, but the network is not actually created by the createNetwork call. And this is true for all swarm-scoped networks irrespective of driver, not just overlay. (#32981)

Non-swarm networks created before network-creation-time validation
was added in 25.0.0 continued working, because the checks are not
re-run.

But, swarm creates networks when needed (with 'agent=true'), to
ensure they exist on each agent - ignoring the NetworkNameError
that says the network already existed.

By ignoring validation errors on creation of a network with
agent=true, pre-existing swarm networks with IPAM config that would
fail the new checks will continue to work too.

New swarm (overlay) networks are still validated, because they are
initially created with 'agent=false'.

Signed-off-by: Rob Murray <rob.murray@docker.com>
Signed-off-by: Rob Murray <rob.murray@docker.com>
@robmry robmry force-pushed the 47331_swarm_ipam_validation branch from 73a4606 to a26c953 Compare February 9, 2024 11:57
@robmry
Copy link
Contributor Author

robmry commented Feb 9, 2024

Thank you @corhere, I've made those updates.

Copy link
Member

@akerouanton akerouanton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Existing Docker Swarm Network with invalid subnet config is not upgraded gracefully when upgrading to 25.x

5 participants