Fix 'failed to get network during CreateEndpoint'#41011
Fix 'failed to get network during CreateEndpoint'#41011xinfengliu wants to merge 1 commit intomoby:masterfrom
Conversation
|
Ooops, I didn't notice a modified file is in |
b02c627 to
5531a65
Compare
thaJeztah
left a comment
There was a problem hiding this comment.
Left some thoughts / suggestions
There was a problem hiding this comment.
This looks scary; shouldn't these be removed, or is the code in tryDetachContainerFromClusterNetwork meant to replace this?
There was a problem hiding this comment.
It is the code tryDetachContainerFromClusterNetwork meant to replace this. If removeNetworks is in controller's method, then it is asynchronized with container-start, and will cause test case in docker/for-linux#888 (comment) to fail.
Another way may be adding a delay in controller's Start(), but it is not reliable.
There was a problem hiding this comment.
instead of doing this, what are your thoughts on applying a retry logic below in Start ?
There was a problem hiding this comment.
@arkodg I'm afraid we are not able to do retry logic in Start, nc.adapter.networkAttach(ctx) in Start just sends a message to trigger container starting to continue.
There was a problem hiding this comment.
trying to understand why
moby/daemon/container_operations.go
Line 428 in a23ca16
There was a problem hiding this comment.
@arkodg
See:
moby/daemon/container_operations.go
Lines 748 to 784 in a23ca16
The original problem occurs at L784, the retry logic inside L748 does not help.
daemon/container_operations.go
Outdated
There was a problem hiding this comment.
If we failed to detach the container from the network, should we still try to delete the network? I think this would always fail in that case (as there's still endpoints attached)? 🤔
There was a problem hiding this comment.
Yes, if DetachNetwork fails, then there's no point continuing to delete the network. I will modify it.
There is only one possible error in DetachNetwork: could not find network attachment in c.attachers[aKey], I don't know how this could happen.
There was a problem hiding this comment.
Modified per your suggestion.
There was a problem hiding this comment.
After thinking, I still feel we should run daemon.DeleteManagedNetwork no matter DetachNetwork fails or not, because this is nearly the final stage of container-stop cleanup and network deletion is not run elsewhere when container runs with attachable overlay network.
a46cec4 to
63f5231
Compare
Fix 'failed to get network during CreateEndpoint' during container starting. 2 scenarios: 1. For swarmkit service containers Change the error type to `libnetwork.ErrNoSuchNetwork`, so `Start()` in `daemon/cluster/executor/container/controller.go` will recreate the network. 2. For 'docker run --network <attachable overlay network>' Add remove-network logic in `tryDetachContainerFromClusterNetwork()` in `daemon/container_operations.go`. Also added test case for scenario 2. Signed-off-by: Xinfeng Liu <xinfeng.liu@gmail.com>
63f5231 to
5363fed
Compare
Further optimize fix potential IP overlap with node LB (method 2). - Refactor test codes - Add testServiceInvalidMount - No need to run removeNetworks() when task container startup failure. - Add removeNetworks() in Prepare() of daemon/cluster/executor/container/controller.go. - Bring over moby#41011 to make a integrated fix. Signed-off-by: Xinfeng Liu <xinfeng.liu@gmail.com>
|
@xinfengliu what's the status on this one; do we need both the changes in libnetwork and this one, or only "one of them" ? |
|
@thaJeztah So moby/libnetwork#2554 and this PR is independent from each other. |
| assert.NilError(t, err) | ||
|
|
||
| var timeout time.Duration = 20 * time.Second | ||
| err = client.ContainerRestart(ctx, c1, &timeout) |
There was a problem hiding this comment.
does this work with a smaller restart value , default in compose and cli is 10s
There was a problem hiding this comment.
@arkodg Stopping busybox container need at least 10 seconds since busybox does not have signal handler for SIGTERM. So I feel the timeout here needs to be longer than 10 seconds.
There was a problem hiding this comment.
I'm afraid setting timeout to 11s will cause CI flaky.
|
I don't understand this PR, but i can confirm that it fixes the original issue. |
|
Any chance that someone might pick up this PR? I think several users are waiting desperately for this fix. |
Further optimize fix potential IP overlap with node LB (method 2). - Refactor test codes - Add testServiceInvalidMount - No need to run removeNetworks() when task container startup failure. - Add removeNetworks() in Prepare() of daemon/cluster/executor/container/controller.go. - Bring over moby#41011 to make a integrated fix. Signed-off-by: Xinfeng Liu <xinfeng.liu@gmail.com>
Fixes docker/for-linux#888
This is a race condition issue. Before starting container, the code ensures the network exists on the node, then during creating endpoint (as part of steps in starting container), the network does not exist anymore then the error
failed to get network during CreateEndpointoccurs.So far I have known about 2 scenarios causing this problem:
During above short period of starting container, another container is stopped and at that time it is the last container on the network on the node, so the network is removed.
For a sample scenario, see my analysis at failed to get network during CreateEndpoint docker/for-linux#888 (comment)
Restarting a container that was started via
docker run --net <attachable network>See: failed to get network during CreateEndpoint docker/for-linux#888 (comment)
The reason is that this kind of container has a hiden correspondent swarkit task, after container is stopped, swarmkit will remove this task and call network-deletion to remove the network if it is the last container on this network on the node. The problem is that this network-deletion is asynchronizedly done by a goroutine which is different from container stop/start goroutine. This race condition causes
failed to get network during CreateEndpointduring container starting.Signed-off-by: Xinfeng Liu xinfeng.liu@gmail.com
- What I did
Fix 'failed to get network during CreateEndpoint' during container starting.
- How I did it
For scenario 1 above:
Change the error type to
libnetwork.ErrNoSuchNetwork, soStart()indaemon/cluster/executor/container/controller.gowill recreate the network.For scenario 2 above:
Move network-deletion logic to
tryDetachContainerFromClusterNetwork()indaemon/container_operations.go. So we make sure the network deleting is done before containerStop returns.- How to verify it
The issue is not easily recreated deliberately. I tested many times using
docker stack deployand 'docker service update --force` randomly, and could recreated the issue occasionaly.From the log of my test, I can verify my fix works as expected.
The container
test_web.1.6uo4tek9dcioirebj7iejl105ran into the "network not found" error, and the network got recreated.From
docker service psoutput at the same time, the task6uo4tek9dciodid not fail.Not sure if there's a need to add multiple retries in
createNetworks(). This issue should not have happen if the PR #40997 (comment) is merged.It can be recreated reliably. I will try adding tests in CI.
- Description for the changelog
Fix 'failed to get network during CreateEndpoint' during container starting.
- A picture of a cute animal (not mandatory but encouraged)