Fix disconnects on service updates #2074

YarekTyshchenko · 2018-02-09T15:40:32Z

Related to moby/moby#30321, swarm removes the network as the container
is shutting down, not honouring the stop_grace_period.

Signed-off-by: Yarek Tyshchenko yarek.tyshchenko@awin.com

Related to moby/moby#30321, swarm removes the network as the container is shutting down, not honouring the `stop_grace_period`. Signed-off-by: Yarek Tyshchenko <yarek.tyshchenko@awin.com>

codecov-io · 2018-02-09T15:58:42Z

Codecov Report

❗ No coverage uploaded for pull request base (master@c592ee4). Click here to learn what that means.
The diff coverage is 14.28%.

@@            Coverage Diff            @@
##             master    #2074   +/-   ##
=========================================
  Coverage          ?   40.43%           
=========================================
  Files             ?      138           
  Lines             ?    22198           
  Branches          ?        0           
=========================================
  Hits              ?     8975           
  Misses            ?    11906           
  Partials          ?     1317

Impacted Files	Coverage Δ
sandbox.go	`43.78% <0%> (ø)`
endpoint.go	`54.63% <50%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c592ee4...90c5343. Read the comment docs.

abhi · 2018-02-09T16:52:10Z

@YarekTyshchenko we moved away from this appoach to address the problem. Can you explain your use case ?

YarekTyshchenko · 2018-02-09T17:36:09Z

@abhi The issue that we are seeing is that we see cut connections on swarm updates. This is what I think is happening:

Imagine a situation where a container is serving requests that take around 2 seconds to complete, continuously at a rate of 10s of requests per second. Webserver is threaded/event based and is able to handle this load.
On swarm update a new container is started, swarm redirects new connections to it
Old container is still processing several requests, that will finish in a few seconds
Swarm issues a shutdown on the old container, at the same time it seems to cut all current connections to it

The couple of connections that were being processed while the switchover was happening have failed, but what we expect is that the connections won't be cut until the old container actually shuts down. The timeout for container shutdown honours stop_grace_period, so this gives the upper bound on how long an update can possibly take.

I want to mention that this has nothing to do with whats happening inside the container, because if we trap the stop signal and do nothing, the connections still get cut off.

abhi · 2018-02-09T22:41:21Z

@YarekTyshchenko This was the original change I had proposed and we have seen lot of race conditions in this approach.
In case of swarm updates , The service is disabled to old container and 2 seconds later shutdown as a rolling update. Also disable service is done from the top in moby/moby and we would ideally like to keep it that way. Both Enable and Disable of a service will be controlled by swarm agent (docker daemon) and libnetwork will not be doing it explicity.

YarekTyshchenko · 2018-02-12T09:44:59Z

@abhi Hmm, I see. It was one of your patches that I took this change from, it fixes the issue, but I understand that this isn't the way now, it may be better to close this issue now.

What is your view on how should this problem be addressed? I'm thinking that it this is more serious than people realise, as it doesn't show up on benchmarks where requests are returned instantly, which means as soon as people deploy this with real workloads they will start seeing failures.

Both Enable and Disable of a service will be controlled by swarm agent (docker daemon) and libnetwork will not be doing it explicity.

As swarm router is acting as a load balancer it needs to understand three states for each backend: Enabled, Draining, Disabled.

Perhaps there is a way to get moby to do the idiomatic thing here and set containers to drain during shutdown. We can't be the only ones that are facing this problem?

Thanks for looking at this.

abhi · 2018-02-12T16:25:34Z

@YarekTyshchenko we do recognize this is an issue. With the current design we probably will end up breaking one of the scenarios. Thank you for bringing this up. We will post an update when we have a cleaner solution to solve all the 3 scenarios. Stay Tuned.

fcrisciani · 2018-02-14T21:35:59Z

@YarekTyshchenko just to give you an update on this. The issue is on IPVS behavior, when the service is removed, looks like it stop the forwarding of the packets also of already established connections.
@ctelfer is working on a patch that will guarantee the proper use of IPVS that will down-weight the backend that is going to be disabled to avoid new connection to land there and will be remove it from the load balancer on the sbLeave. Can we close this PR?

YarekTyshchenko · 2018-02-15T10:52:51Z

@fcrisciani Fantastic! I'm looking forward to testing the patch

Fix disconnects on service updates

90c5343

Related to moby/moby#30321, swarm removes the network as the container is shutting down, not honouring the `stop_grace_period`. Signed-off-by: Yarek Tyshchenko <yarek.tyshchenko@awin.com>

YarekTyshchenko mentioned this pull request Feb 14, 2018

Patch to fix libnetwork cutting off streaming connections on update docker-archive/docker-ce#419

Closed

YarekTyshchenko closed this Feb 15, 2018

This was referenced Mar 14, 2018

Gracefully remove LB endpoints from services #2112

Merged

Import libnetwork fix for rolling updates moby/moby#36638

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix disconnects on service updates #2074

Fix disconnects on service updates #2074

Uh oh!

YarekTyshchenko commented Feb 9, 2018

Uh oh!

codecov-io commented Feb 9, 2018

Uh oh!

abhi commented Feb 9, 2018

Uh oh!

YarekTyshchenko commented Feb 9, 2018

Uh oh!

abhi commented Feb 9, 2018

Uh oh!

YarekTyshchenko commented Feb 12, 2018

Uh oh!

abhi commented Feb 12, 2018

Uh oh!

fcrisciani commented Feb 14, 2018

Uh oh!

YarekTyshchenko commented Feb 15, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix disconnects on service updates #2074

Fix disconnects on service updates #2074

Uh oh!

Conversation

YarekTyshchenko commented Feb 9, 2018

Uh oh!

codecov-io commented Feb 9, 2018

Codecov Report

Uh oh!

abhi commented Feb 9, 2018

Uh oh!

YarekTyshchenko commented Feb 9, 2018

Uh oh!

abhi commented Feb 9, 2018

Uh oh!

YarekTyshchenko commented Feb 12, 2018

Uh oh!

abhi commented Feb 12, 2018

Uh oh!

fcrisciani commented Feb 14, 2018

Uh oh!

YarekTyshchenko commented Feb 15, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants