Skip to content

Conversation

@YarekTyshchenko
Copy link

Related to moby/moby#30321, swarm removes the network as the container
is shutting down, not honouring the stop_grace_period.

Signed-off-by: Yarek Tyshchenko yarek.tyshchenko@awin.com

Related to moby/moby#30321, swarm removes the network as the container
is shutting down, not honouring the `stop_grace_period`.

Signed-off-by: Yarek Tyshchenko <yarek.tyshchenko@awin.com>
@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@c592ee4). Click here to learn what that means.
The diff coverage is 14.28%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #2074   +/-   ##
=========================================
  Coverage          ?   40.43%           
=========================================
  Files             ?      138           
  Lines             ?    22198           
  Branches          ?        0           
=========================================
  Hits              ?     8975           
  Misses            ?    11906           
  Partials          ?     1317
Impacted Files Coverage Δ
sandbox.go 43.78% <0%> (ø)
endpoint.go 54.63% <50%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c592ee4...90c5343. Read the comment docs.

@abhi
Copy link
Contributor

abhi commented Feb 9, 2018

@YarekTyshchenko we moved away from this appoach to address the problem. Can you explain your use case ?

@YarekTyshchenko
Copy link
Author

@abhi The issue that we are seeing is that we see cut connections on swarm updates. This is what I think is happening:

  1. Imagine a situation where a container is serving requests that take around 2 seconds to complete, continuously at a rate of 10s of requests per second. Webserver is threaded/event based and is able to handle this load.
  2. On swarm update a new container is started, swarm redirects new connections to it
  3. Old container is still processing several requests, that will finish in a few seconds
  4. Swarm issues a shutdown on the old container, at the same time it seems to cut all current connections to it

The couple of connections that were being processed while the switchover was happening have failed, but what we expect is that the connections won't be cut until the old container actually shuts down. The timeout for container shutdown honours stop_grace_period, so this gives the upper bound on how long an update can possibly take.

I want to mention that this has nothing to do with whats happening inside the container, because if we trap the stop signal and do nothing, the connections still get cut off.

@abhi
Copy link
Contributor

abhi commented Feb 9, 2018

@YarekTyshchenko This was the original change I had proposed and we have seen lot of race conditions in this approach.
In case of swarm updates , The service is disabled to old container and 2 seconds later shutdown as a rolling update. Also disable service is done from the top in moby/moby and we would ideally like to keep it that way. Both Enable and Disable of a service will be controlled by swarm agent (docker daemon) and libnetwork will not be doing it explicity.

@YarekTyshchenko
Copy link
Author

@abhi Hmm, I see. It was one of your patches that I took this change from, it fixes the issue, but I understand that this isn't the way now, it may be better to close this issue now.

What is your view on how should this problem be addressed? I'm thinking that it this is more serious than people realise, as it doesn't show up on benchmarks where requests are returned instantly, which means as soon as people deploy this with real workloads they will start seeing failures.

Both Enable and Disable of a service will be controlled by swarm agent (docker daemon) and libnetwork will not be doing it explicity.

As swarm router is acting as a load balancer it needs to understand three states for each backend: Enabled, Draining, Disabled.

Perhaps there is a way to get moby to do the idiomatic thing here and set containers to drain during shutdown. We can't be the only ones that are facing this problem?

Thanks for looking at this.

@abhi
Copy link
Contributor

abhi commented Feb 12, 2018

@YarekTyshchenko we do recognize this is an issue. With the current design we probably will end up breaking one of the scenarios. Thank you for bringing this up. We will post an update when we have a cleaner solution to solve all the 3 scenarios. Stay Tuned.

@fcrisciani
Copy link

@YarekTyshchenko just to give you an update on this. The issue is on IPVS behavior, when the service is removed, looks like it stop the forwarding of the packets also of already established connections.
@ctelfer is working on a patch that will guarantee the proper use of IPVS that will down-weight the backend that is going to be disabled to avoid new connection to land there and will be remove it from the load balancer on the sbLeave. Can we close this PR?

@YarekTyshchenko
Copy link
Author

@fcrisciani Fantastic! I'm looking forward to testing the patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants