Skip to content

feat: Gracefully drain listeners before envoy shutdown on pod termination#2633

Merged
arkodg merged 34 commits intoenvoyproxy:mainfrom
davidalger:algerdev/graceful-pod-termination
Feb 27, 2024
Merged

feat: Gracefully drain listeners before envoy shutdown on pod termination#2633
arkodg merged 34 commits intoenvoyproxy:mainfrom
davidalger:algerdev/graceful-pod-termination

Conversation

@davidalger
Copy link
Copy Markdown
Contributor

@davidalger davidalger commented Feb 16, 2024

What this PR does:

This PR is a step in the direction of achieving a hitless upgrade. Some challenges to achieving this have been outlined by @guydc in #1712 (comment) and #2610

This proposed implementation of a PreStop lifecycle hook will allow Envoy proxies to gracefully shutdown when pods are terminated by Kubernetes. This is done by:

a) Initiating a graceful drain sequence via the drain_listeners?graceful admin endpoint.

b) Sleeping for a minimum drain duration of 15 seconds allowing ample time for endpoint deprogramming to occur.

c) Monitors total.server_connections statistic exiting once total connection count reaches 0

d) Times out after 600 seconds after which Envoy will be sent SIGTERM by Kubernetes regardless of how many open connections remain. This timeout lines up with the default value for --drain-time-s

e) Configures terminationGracePeriodSeconds to 900 seconds. This provides ample time for Envoy to cleanly exit after connections have been drained.

Some prior art w.r.t graceful shutdown for Envoy:

Summary of why we need this:

  • Kubernetes uses SIGTERM to signal processes to gracefully shutdown, however Envoy does not support gracefully draining connections during shutdowns triggered by SIGTERM.
  • Endpoint deprogramming is eventually consistent and the time this takes varies somewhat based on the k8s implementation and size of the cluster.

Caveats:

When TCP listeners are used (rather than HTTP listeners), connections will not drain per the below note from draining architecture documentation. This means in the event of long-living TCP connections, Envoy proxies will simply wait in the PreStop hook for the 600 second timeout to occur after which these connections will be closed when Envoy receives the SIGTERM from Kubernetes.

Note that although draining is a per-listener concept, it must be supported at the network filter level. Currently the only filters that support graceful draining are Redis, Mongo, Thrift (if the envoy.reloadable_features.thrift_connection_draining runtime feature is enabled), and HTTP connection manager.

Reference: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining

@davidalger davidalger requested a review from a team as a code owner February 16, 2024 21:40
@davidalger
Copy link
Copy Markdown
Contributor Author

Testing / Load generator with unaltered EG quickstart.yaml:

run from another pod within the cluster

hey -c 100 -q 10 -z 300s -host www.example.com http://172.18.255.200/
PreStop logs from rolling restart of single replica deployment
$ kubectl logs -n envoy-gateway-system deploy/envoy-default-eg-e41e7b31 -f | grep -v start_time
Found 2 pods, using pod/envoy-default-eg-e41e7b31-85c848c76b-w8p5t
[2024-02-16 21:23:58.508][1][warning][main] [source/server/server.cc:923] There is no configured limit to the number of allowed active downstream connections. Configure a limit in `envoy.resource_monitors.downstream_connections` resource monitor.
[2024-02-16 21:24:21.965][37][info][PreStop] initiating graceful drain with 15 second minimum drain period and 600 second timeout
[2024-02-16 21:24:21.969][37][info][PreStop] total connections: 100
[2024-02-16 21:24:22.974][37][info][PreStop] total connections: 100
[2024-02-16 21:24:23.978][37][info][PreStop] total connections: 97
[2024-02-16 21:24:24.983][37][info][PreStop] total connections: 97
[2024-02-16 21:24:25.988][37][info][PreStop] total connections: 97
[2024-02-16 21:24:26.992][37][info][PreStop] total connections: 97
[2024-02-16 21:24:27.997][37][info][PreStop] total connections: 97
[2024-02-16 21:24:29.001][37][info][PreStop] total connections: 64
[2024-02-16 21:24:30.005][37][info][PreStop] total connections: 64
[2024-02-16 21:24:31.010][37][info][PreStop] total connections: 64
[2024-02-16 21:24:32.014][37][info][PreStop] total connections: 64
[2024-02-16 21:24:33.018][37][info][PreStop] total connections: 64
[2024-02-16 21:24:34.023][37][info][PreStop] total connections: 29
[2024-02-16 21:24:35.027][37][info][PreStop] total connections: 29
[2024-02-16 21:24:36.032][37][info][PreStop] total connections: 29
[2024-02-16 21:24:37.036][37][info][PreStop] total connections: 29
[2024-02-16 21:24:37.037][37][info][PreStop] minimum drain period reached; allowed to exit when total connections reaches zero
[2024-02-16 21:24:38.042][37][info][PreStop] total connections: 29
[2024-02-16 21:24:39.046][37][info][PreStop] total connections: 7
[2024-02-16 21:24:40.050][37][info][PreStop] total connections: 7
[2024-02-16 21:24:41.055][37][info][PreStop] total connections: 7
[2024-02-16 21:24:42.059][37][info][PreStop] total connections: 7
[2024-02-16 21:24:43.064][37][info][PreStop] total connections: 7
[2024-02-16 21:24:44.068][37][info][PreStop] total connections: 1
[2024-02-16 21:24:45.073][37][info][PreStop] total connections: 1
[2024-02-16 21:24:46.077][37][info][PreStop] total connections: 1
[2024-02-16 21:24:47.082][37][info][PreStop] total connections: 1
[2024-02-16 21:24:48.086][37][info][PreStop] total connections: 1
[2024-02-16 21:24:49.091][37][info][PreStop] total connections: 0
[2024-02-16 21:24:49.096][1][warning][main] [source/server/server.cc:895] caught ENVOY_SIGTERM
[2024-02-16 21:24:49.100][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:155] DeltaAggregatedResources gRPC config stream to xds_cluster closed: 13, 

After restarting the single replica deployment, Envoy was scaled up to 5 replicas and restarted twice with the hey load generator continuing to run.

load generator output demonstrating 100% success responses for the test
$ hey -c 100 -q 10 -z 300s -host www.example.com http://172.18.255.200/

Summary:
  Total:	300.0754 secs
  Slowest:	0.4689 secs
  Fastest:	0.0012 secs
  Average:	0.1215 secs
  Requests/sec:	812.2893
  
  Total data:	121874000 bytes
  Size/request:	500 bytes

Response time histogram:
  0.001 [1]	|
  0.048 [3263]	|■
  0.095 [4661]	|■
  0.142 [222708]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.188 [4640]	|■
  0.235 [6894]	|■
  0.282 [990]	|
  0.329 [370]	|
  0.375 [141]	|
  0.422 [66]	|
  0.469 [14]	|


Latency distribution:
  10% in 0.1130 secs
  25% in 0.1168 secs
  50% in 0.1196 secs
  75% in 0.1224 secs
  90% in 0.1264 secs
  95% in 0.1520 secs
  99% in 0.2142 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0012 secs, 0.4689 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:	0.0006 secs, 0.0000 secs, 0.1484 secs
  resp wait:	0.0015 secs, 0.0002 secs, 0.2476 secs
  resp read:	0.0594 secs, 0.0000 secs, 0.3291 secs

Status code distribution:
  [200]	243748 responses


@davidalger
Copy link
Copy Markdown
Contributor Author

Testing / Load generator with EG quickstart.yaml plus 1s delay:

run from another pod within the cluster

hey -c 20 -q 10 -z 300s -host www.example.com http://172.18.255.200/
fault-injection-delay simulating 1s response times
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: fault-injection-delay
spec:
  targetRef:
    group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: backend
  faultInjection:
    delay:
      fixedDelay: 1s
PreStop logs from rolling restart of single replica deployment
$ kubectl logs -n envoy-gateway-system deploy/envoy-default-eg-e41e7b31 -f | grep -v start_time
Found 2 pods, using pod/envoy-default-eg-e41e7b31-7854bd848-qpd82
[2024-02-16 21:31:03.157][1][warning][main] [source/server/server.cc:923] There is no configured limit to the number of allowed active downstream connections. Configure a limit in `envoy.resource_monitors.downstream_connections` resource monitor.
[2024-02-16 21:33:14.274][37][info][PreStop] initiating graceful drain with 15 second minimum drain period and 600 second timeout
[2024-02-16 21:33:14.278][37][info][PreStop] total connections: 20
[2024-02-16 21:33:15.286][37][info][PreStop] total connections: 20
[2024-02-16 21:33:16.308][37][info][PreStop] total connections: 20
[2024-02-16 21:33:17.327][37][info][PreStop] total connections: 20
[2024-02-16 21:33:18.347][37][info][PreStop] total connections: 19
[2024-02-16 21:33:19.353][37][info][PreStop] total connections: 19
[2024-02-16 21:33:20.374][37][info][PreStop] total connections: 19
[2024-02-16 21:33:21.395][37][info][PreStop] total connections: 19
[2024-02-16 21:33:22.405][37][info][PreStop] total connections: 19
[2024-02-16 21:33:23.426][37][info][PreStop] total connections: 19
[2024-02-16 21:33:24.443][37][info][PreStop] total connections: 19
[2024-02-16 21:33:25.484][37][info][PreStop] total connections: 19
[2024-02-16 21:33:26.498][37][info][PreStop] total connections: 19
[2024-02-16 21:33:27.517][37][info][PreStop] total connections: 19
[2024-02-16 21:33:28.539][37][info][PreStop] total connections: 16
[2024-02-16 21:33:29.568][37][info][PreStop] total connections: 16
[2024-02-16 21:33:30.594][37][info][PreStop] total connections: 16
[2024-02-16 21:33:30.598][37][info][PreStop] minimum drain period reached; allowed to exit when total connections reaches zero
[2024-02-16 21:33:31.608][37][info][PreStop] total connections: 16
[2024-02-16 21:33:32.629][37][info][PreStop] total connections: 16
[2024-02-16 21:33:33.656][37][info][PreStop] total connections: 16
[2024-02-16 21:33:34.681][37][info][PreStop] total connections: 16
[2024-02-16 21:33:35.707][37][info][PreStop] total connections: 16
[2024-02-16 21:33:36.729][37][info][PreStop] total connections: 16
[2024-02-16 21:33:37.848][37][info][PreStop] total connections: 16
[2024-02-16 21:33:38.862][37][info][PreStop] total connections: 14
[2024-02-16 21:33:39.889][37][info][PreStop] total connections: 14
[2024-02-16 21:33:40.900][37][info][PreStop] total connections: 14
[2024-02-16 21:33:41.914][37][info][PreStop] total connections: 14
[2024-02-16 21:33:42.926][37][info][PreStop] total connections: 14
[2024-02-16 21:33:43.943][37][info][PreStop] total connections: 11
[2024-02-16 21:33:44.948][37][info][PreStop] total connections: 11
[2024-02-16 21:33:45.958][37][info][PreStop] total connections: 11
[2024-02-16 21:33:46.969][37][info][PreStop] total connections: 11
[2024-02-16 21:33:47.992][37][info][PreStop] total connections: 11
[2024-02-16 21:33:49.007][37][info][PreStop] total connections: 10
[2024-02-16 21:33:50.026][37][info][PreStop] total connections: 10
[2024-02-16 21:33:51.048][37][info][PreStop] total connections: 10
[2024-02-16 21:33:52.055][37][info][PreStop] total connections: 10
[2024-02-16 21:33:53.068][37][info][PreStop] total connections: 10
[2024-02-16 21:33:54.094][37][info][PreStop] total connections: 8
[2024-02-16 21:33:55.101][37][info][PreStop] total connections: 8
[2024-02-16 21:33:56.109][37][info][PreStop] total connections: 8
[2024-02-16 21:33:57.121][37][info][PreStop] total connections: 8
[2024-02-16 21:33:58.134][37][info][PreStop] total connections: 8
[2024-02-16 21:33:59.147][37][info][PreStop] total connections: 7
[2024-02-16 21:34:00.166][37][info][PreStop] total connections: 7
[2024-02-16 21:34:01.184][37][info][PreStop] total connections: 7
[2024-02-16 21:34:02.206][37][info][PreStop] total connections: 7
[2024-02-16 21:34:03.328][37][info][PreStop] total connections: 7
[2024-02-16 21:34:04.341][37][info][PreStop] total connections: 4
[2024-02-16 21:34:05.462][37][info][PreStop] total connections: 4
[2024-02-16 21:34:06.505][37][info][PreStop] total connections: 4
[2024-02-16 21:34:07.625][37][info][PreStop] total connections: 4
[2024-02-16 21:34:08.652][37][info][PreStop] total connections: 1
[2024-02-16 21:34:09.678][37][info][PreStop] total connections: 1
[2024-02-16 21:34:10.687][37][info][PreStop] total connections: 1
[2024-02-16 21:34:11.710][37][info][PreStop] total connections: 1
[2024-02-16 21:34:12.731][37][info][PreStop] total connections: 1
[2024-02-16 21:34:13.860][37][info][PreStop] total connections: 0
[2024-02-16 21:34:13.880][1][warning][main] [source/server/server.cc:895] caught ENVOY_SIGTERM
[2024-02-16 21:34:13.886][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:155] DeltaAggregatedResources gRPC config stream to xds_cluster closed: 13, 

After restarting the single replica deployment, Envoy was scaled up to 5 replicas and restarted twice with the hey load generator continuing to run.

load generator output demonstrating 100% success responses for the test
hey -c 20 -q 10 -z 300s -host www.example.com http://172.18.255.200/

Summary:
  Total:	300.7588 secs
  Slowest:	1.1698 secs
  Fastest:	0.9969 secs
  Average:	1.0085 secs
  Requests/sec:	19.8165
  
  Total data:	2980000 bytes
  Size/request:	500 bytes

Response time histogram:
  0.997 [1]	|
  1.014 [5073]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  1.031 [798]	|■■■■■■
  1.049 [65]	|■
  1.066 [3]	|
  1.083 [0]	|
  1.101 [0]	|
  1.118 [0]	|
  1.135 [1]	|
  1.152 [9]	|
  1.170 [10]	|


Latency distribution:
  10% in 1.0017 secs
  25% in 1.0033 secs
  50% in 1.0061 secs
  75% in 1.0110 secs
  90% in 1.0168 secs
  95% in 1.0216 secs
  99% in 1.0363 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0003 secs, 0.9969 secs, 1.1698 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:	0.0004 secs, 0.0000 secs, 0.0308 secs
  resp wait:	1.0035 secs, 0.9960 secs, 1.0380 secs
  resp read:	0.0020 secs, 0.0000 secs, 0.0292 secs

Status code distribution:
  [200]	5960 responses

@davidalger
Copy link
Copy Markdown
Contributor Author

@arkodg @guydc Would like thoughts from the both of you on the proposed implementation and approach demonstrated here and also whether these defaults are sufficient or need knobs to adjust:

  • minimum drain duration (currently 15s)
  • graceful drain timeout (currently 600s)
  • terminationGracePeriodSeconds (currently 900s)
  • total.server_connections exit threshold (currently 0)

…tion

Signed-off-by: David Alger <davidmalger@gmail.com>
@davidalger davidalger force-pushed the algerdev/graceful-pod-termination branch from e2c7dc7 to b4baa8f Compare February 16, 2024 22:02
@guydc
Copy link
Copy Markdown
Contributor

guydc commented Feb 16, 2024

This lines up with the default value for --parent-shutdown-time-s

Is there any significance to using the same value here, if we're not using the envoy hot reload feature?

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 16, 2024

Codecov Report

Attention: Patch coverage is 93.56725% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 63.23%. Comparing base (72fadb7) to head (afcbf23).

Files Patch % Lines
internal/cmd/envoy.go 86.36% 4 Missing and 2 partials ⚠️
...ternal/infrastructure/kubernetes/proxy/resource.go 96.59% 2 Missing and 1 partial ⚠️
internal/logging/log.go 88.23% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2633      +/-   ##
==========================================
+ Coverage   63.00%   63.23%   +0.22%     
==========================================
  Files         122      123       +1     
  Lines       19770    19922     +152     
==========================================
+ Hits        12457    12598     +141     
- Misses       6506     6513       +7     
- Partials      807      811       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@davidalger
Copy link
Copy Markdown
Contributor Author

davidalger commented Feb 16, 2024

This lines up with the default value for --parent-shutdown-time-s

Is there any significance to using the same value here, if we're not using the envoy hot reload feature?

@guydc No particular significance to the value being the same in the context of k8s, but it is why I chose 5 minutes greater than the drain timeout since the docs state the drain time should be less than the parent shutdown time set via the --parent-shutdown-time-s option.

The important bit is that terminationGracePeriodSeconds must be higher than the graceful drain timeout so that k8s will send a SIGTERM instead of a SIGKILL if the drain runs up the clock against the full 600 second drain timeout.

return &corev1.Lifecycle{
PreStop: &corev1.LifecycleHandler{
Exec: &corev1.ExecAction{
Command: []string{"/bin/bash", "-c", strings.ReplaceAll(strings.TrimSpace(dedent.Dedent(`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this portable to distroless images?

Copy link
Copy Markdown
Contributor Author

@davidalger davidalger Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually a really good call out, as I'm pretty sure it won't work in any pre-built distroless image. Probably won't break anything, but won't do anything either since the exec will fail and k8s will proceed with an immediate SIGTERM to Envoy. So now I'm having a sudden realization that v0.6.0 started using the distroless image, but the dev builds (which I tested with) don't. 🤦🏻

The options are more limited with distroless images since they typically don't include extra binaries such as bash, cat, or awk very likely leaving us with the only real viable long-term option being to build our own Envoy proxy images to include something like a wrapper process or a static executable to manage the the termination process.

There are other approaches that could work without building our own proxy image, some would tend to be fragile or require exposing the admin port. It's possible a sidecar container along with an http prestop hook could work, but will definitely take a bit more effort to setup.

Copy link
Copy Markdown
Contributor

@arkodg arkodg Feb 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we implement a) and b) using the HTTP and Sleep pre stop hooks ?
https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#hook-handler-implementations
we may need to convert the preStop.httpGet into a POST using envoy itself :)

Copy link
Copy Markdown
Contributor Author

@davidalger davidalger Feb 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is unfortunately no way to accomplish this only leveraging k8s sleep and http lifecycle hooks:

  • HTTP requests cannot be made to a port bound to 127.0.0.1 (I've tried this) so requests to the Envoy admin port would require it be exposed which would pose a security risk.
  • Envoy doesn't exit after a graceful drain since the process was designed for hot reloading. It would always run out the clock on the terminationGracePeriodSeconds even if there were no open connections.
  • Sleep prestop actions are too new and cannot be used even to just introduce a basic X second sleep for deprogramming as they won't reach beta until k8s v1.30.

Here is what I'm going to propose as a path forward:

  • Implement a sub-command in Envoy Gateway called envoy shutdown which will execute the graceful drain and block until both the min-drain-period has elapsed and open connections have closed (similar to what the bash script does currently)
  • Implement a sub-command in Envoy Gateway called envoy shutdown-manager which exposes an HTTP endpoint on port 19002 and responds to /shutdown/ready which will block until the graceful drain initiated by the envoy shutdown sub-command has completed.
  • Add a shutdown-manager container to the Envoy proxy pod running the envoy shutdown-manager http server.
  • HTTP preStop hook on the envoy container requesting /shutdown/ready on port 19002
  • Exec preStop hook on the shutdown-manager container which executes the envoy shutdown sub-command

When the pod gets a request to shutdown, the preStop hook will be triggered on both the envoy and shutdown-manager containers simultaneously. The preStop on the shutdown-manager container will execute the graceful drain and the preStop on the envoy pod will block until the drain is complete. Each of the containers will receive the SIGTERM from k8s per-normal when the preStop hook on the container completes.

I've mocked this up as a proof-of-concept to test the setup. Unless there are objections I will plan on re-implementing the drain process per the above outline which is very similar to what Contour did previously and appears it will work based on my testing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @davidalger , exposing /shutdown on envoy has the same security risk, if its exposed on 0.0.0.0, it can be called by anyone .

Copy link
Copy Markdown
Contributor

@arkodg arkodg Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does preStop.httpGet.host = localhost work ?

Tried this. It fails with a PreStopHook failed warning added to the pod status and the container exits without allowing the drain to occur. I'm fairly certain this is because the request is coming from kubelet, which isn't local to the pod and thus can't actually make a request to the loopback.

exposing /shutdown on envoy has the same security risk, if its exposed on 0.0.0.0, it can be called by anyone .

This is mitigated by having the /shutdown/ready endpoint largely be a dumb wait that does nothing more than block envoy container shutdown until the the drain logic in the exec prestop hook on the shutdown-manager container finishes.

There really are two reasons why the actual drain logic is in an exec to a sub-command on the eg binary:

a) The security aspect, since http prestop hooks must be exposed to the rest of the cluster

b) The sub-command will be useful for a local (any non-k8s) EG deployment when such may be implemented in the future

I've pushed up my work in progress for those that may like to take a look. There are still some things to do to finish it up, but it's functional. Couple example log sets from restarts under varying load, with no failed requests:

example 1 logs from shutdown-manager container
2024-02-21T14:47:39.037Z    INFO    shutdown-manager    envoy/shutdown_manager.go:72    starting shutdown manager
2024-02-21T14:47:55.365Z    INFO    shutdown-manager    envoy/shutdown_manager.go:86    received shutdown ready request
2024-02-21T14:47:55.398Z    INFO    shutdown-manager    envoy/shutdown_manager.go:115    initiating graceful drain with 15 second minimum drain period and 600 second timeout
2024-02-21T14:47:55.399Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 10
2024-02-21T14:47:56.401Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 10
2024-02-21T14:47:57.402Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 10
2024-02-21T14:47:58.404Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 10
2024-02-21T14:47:59.408Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 9
2024-02-21T14:48:00.412Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 9
2024-02-21T14:48:01.414Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 9
2024-02-21T14:48:02.421Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 9
2024-02-21T14:48:03.427Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 9
2024-02-21T14:48:04.428Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 5
2024-02-21T14:48:05.438Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 5
2024-02-21T14:48:06.441Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 5
2024-02-21T14:48:07.443Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 5
2024-02-21T14:48:08.454Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 5
2024-02-21T14:48:09.461Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 0
2024-02-21T14:48:10.464Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 0
2024-02-21T14:48:10.464Z    INFO    shutdown-manager    envoy/shutdown_manager.go:140    minimum drain period reached; will exit when total connections is 0
2024-02-21T14:48:10.464Z    INFO    shutdown-manager    envoy/shutdown_manager.go:148    graceful drain sequence completed
2024-02-21T14:48:10.473Z    INFO    shutdown-manager    envoy/shutdown_manager.go:62    received SIGTERM
2024-02-21T14:48:11.386Z    INFO    shutdown-manager    envoy/shutdown_manager.go:97    shutdown readiness detected
Stream closed EOF for envoy-gateway-system/envoy-default-eg-e41e7b31-65bb79b45-mkgdt (shutdown-manager)
example 2 logs from shutdown-manager container
2024-02-21T14:47:55.887Z    INFO    shutdown-manager    envoy/shutdown_manager.go:72    starting shutdown manager
2024-02-21T14:48:56.058Z    INFO    shutdown-manager    envoy/shutdown_manager.go:86    received shutdown ready request
2024-02-21T14:48:56.087Z    INFO    shutdown-manager    envoy/shutdown_manager.go:115    initiating graceful drain with 15 second minimum drain period and 600 second timeout
2024-02-21T14:48:56.089Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 100
2024-02-21T14:48:57.089Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 100
2024-02-21T14:48:58.090Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 100
2024-02-21T14:48:59.091Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 100
2024-02-21T14:49:00.091Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 100
2024-02-21T14:49:01.093Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 74
2024-02-21T14:49:02.094Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 74
2024-02-21T14:49:03.094Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 74
2024-02-21T14:49:04.096Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 74
2024-02-21T14:49:05.096Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 74
2024-02-21T14:49:06.097Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 33
2024-02-21T14:49:07.099Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 33
2024-02-21T14:49:08.100Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 33
2024-02-21T14:49:09.101Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 33
2024-02-21T14:49:10.102Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 33
2024-02-21T14:49:11.103Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 11
2024-02-21T14:49:11.103Z    INFO    shutdown-manager    envoy/shutdown_manager.go:140    minimum drain period reached; will exit when total connections is 0
2024-02-21T14:49:12.105Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 11
2024-02-21T14:49:13.106Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 11
2024-02-21T14:49:14.106Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 11
2024-02-21T14:49:15.107Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 11
2024-02-21T14:49:16.108Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:17.109Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:18.111Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:19.111Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:20.112Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:21.113Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:22.114Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:23.115Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:24.116Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:25.117Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:26.119Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:27.120Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:28.121Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:29.122Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:30.124Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 1
2024-02-21T14:49:31.133Z    INFO    shutdown-manager    envoy/shutdown_manager.go:196    total connections: 0
2024-02-21T14:49:31.133Z    INFO    shutdown-manager    envoy/shutdown_manager.go:148    graceful drain sequence completed
2024-02-21T14:49:31.148Z    INFO    shutdown-manager    envoy/shutdown_manager.go:62    received SIGTERM
2024-02-21T14:49:32.080Z    INFO    shutdown-manager    envoy/shutdown_manager.go:97    shutdown readiness detected
Stream closed EOF for envoy-gateway-system/envoy-default-eg-e41e7b31-65bb79b45-crxbp (shutdown-manager)

Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
@davidalger davidalger marked this pull request as draft February 21, 2024 15:12
Signed-off-by: David Alger <davidmalger@gmail.com>
@davidalger
Copy link
Copy Markdown
Contributor Author

/retest

},
},
{
Name: "shutdown-manager",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a sidecar container ?

  • this will ensure that preStop for shutdown-manager runs before envoy
  • building off this
    • preStop exec shutdown performs a touch on the tmp file and exits
    • preStop exec /shutdown/ready performs the envoy shutdown procedure and blocks until complete
    • /shutdown/ready can return a non 200 error code immediately if tmp file doesnt exist

Copy link
Copy Markdown
Contributor Author

@davidalger davidalger Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure of the shutdown sequencing as I don't see much about it in the announcement blog post or on the docs page other than indications that pod termination is not blocked by sidecar containers.

Update: Found it under pod lifecycle docs:

Beginning with Kubernetes 1.29, if your Pod includes one or more sidecar containers (init containers with an Always restart policy), the kubelet will delay sending the TERM signal to these sidecar containers until the last main container has fully terminated. The sidecar containers will be terminated in the reverse order they are defined in the Pod spec. This ensures that sidecar containers continue serving the other containers in the Pod until they are no longer needed.

I could look into this if we are OK with dropping support for 1.28 where sidecars containers are feature gated only being enabled by default in 1.29. Not sure I'd recommend this though with 1.29 not yet being available as stable/GA on AKS or GKE

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good point, we cant rely on order yet, since sidecars just GA'd

then I propose /shutdown/ready sleeps for 1s (ensuring in case of termination, the other preStop had the chance to create the tmp file) before checking whether the tmp file exists or not.
If it doesnt, return 404, else run the shutdown steps and return 200.
This eliminates the need for the for within the http handler.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at a high level, big +1 with this approach

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then I propose /shutdown/ready sleeps for 1s (ensuring in case of termination, the other preStop had the chance to create the tmp file) before checking whether the tmp file exists or not.
If it doesnt, return 404, else run the shutdown steps and return 200.

Not sure I follow. As soon as the /shutdown/ready request returns (no matter the response code) kubelet will immediately send sigterm to PID 1 (envoy).

Was the concern with the for loop that it could never return? I've modified this to ensure requests don't endlessly poll after the ready-timeout: 4a3b90f since http.TimeoutHandler of course wouldn't interrupt the processing just breaking the connection.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the concerns I was trying to mitigate were

  • using more CPU with the for
  • allowing non kubelet clients to hoard the shutdown handler CPU

Since we cannot establish order between the envoy and shutdown-manager container, I was proposing we use a delay (1s / 2s) to ensure that for the preStop case, we can be confident that the other preStop has signaled to us (using a file) that its time to shutdown.

For the non preStop case, it would just be a delay + return 404

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in b3dcffd

Copy link
Copy Markdown
Contributor

@guydc guydc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!


// Reconfigure logger to write to stdout of main process if running in Kubernetes
if _, k8s := os.LookupEnv("KUBERNETES_SERVICE_HOST"); k8s && os.Getpid() != 1 {
logger = logging.FileLogger("/proc/1/fd/1", "shutdown-manager", v1alpha1.LogLevelInfo)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a discussion here about this option maybe not working if running the process as non-root.
Not sure if this concern is applicable in this case...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writing directly to /proc/1/fd/1 will be problematic if the process drops root privileges but PID 1 is still running as root

I don't think it will apply in this case, as neither PID 1 nor the exec'd process are running as root since the container sets the uid:gid in the build (see here)

// Setup HTTP handler
handler := http.NewServeMux()
handler.HandleFunc("/shutdown/ready", func(w http.ResponseWriter, r *http.Request) {
shutdownReadyHandler(w, r, ShutdownReadyFile)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this timeout before the handler actually detects shutdown readdiness due to the server timeouts and/or kubelet timeouts? If so, what would be the behavior?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has a configurable timeout (defaulting to 10s longer than the drain timeout) as a safeguard, but it won't timeout prior to this per documented and observed behavior.

If this did return before the exec prestop completed the drain, envoy would be sent a SIGTERM and exit after which you may observe errors such as this one logged until the drain timeout elapses and the shutdown command exits:

shutdown-manager 2024-02-22T20:48:54.259Z    ERROR    shutdown-manager    envoy/shutdown_manager.go:145    error getting total connections    {"error": "Get \"http://127.0.0.1:19000//stats?filter=^server\\\\.total_connections$&format=json\": dial tcp 127.0.0.1:19000: connect: connection refused"}

This should never occur with properly configured (staggered) timeouts, so I would hesitate to code around it for an initial implementation preferring to let the errors be logged to get eyes on the potential problem.

Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
@davidalger
Copy link
Copy Markdown
Contributor Author

/retest

return deployment, nil
}

func expectedTerminationGracePeriodSeconds(cfg *egv1a1.ShutdownConfig) *int64 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be equal to DrainTimeout ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

terminationGracePeriodSeconds needs to be higher than the drain timeout so the pod has time to handle the SIGTERM before being sent a SIGKILL. It doesn't have to be much, but it does need to be marginally higher, and assuming other timeouts are functioning it should never be reached under normal circumstances.

logger.Error(err, "error getting total connections")
}

if elapsedTime > minDrainDuration && !allowedToExit {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would it make sense to move minDrainDuration outside the loop and sleep for that duration before performs checks and calls ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a matter of personal preference, but I opted for this method so that drain logs make it obvious what is happening and one gets visibility on connection counts changing more immediately after starting the graceful drain vs waiting X seconds silently before starting to log for the drain process.

minDrainDuration.Seconds(), drainTimeout.Seconds()))

// Send request to Envoy admin API to initiate the graceful drain sequence
if resp, err := http.Post(fmt.Sprintf("http://%s:%d/drain_listeners?graceful&skip_exit",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this cause the health check defined in

"@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
to fail ?

also curious why is skip_exit needed ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The healthcheck endpoint will not fail. It is unimportant though as the the ready status of the endpoint will immediately become false when the pod enters the terminating status triggering anything that uses the endpoint to begin deprogramming.

skip_exit is useful to ensure new connections are not unexpectedly blocked if a drain takes than drain-time-s on the Envoy proxy, so it becomes important where long drain timeouts are configured. It's a knob recently added to the drain API to support Istio's graceful drain process as they transition to native sidecars: envoyproxy/envoy#30003 and I thought it made sense to include it so we don't have to also configure drain-time-s on Envoy where longer than 10m drain periods are needed.

Copy link
Copy Markdown
Contributor

@arkodg arkodg Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its important to fail the healthcheck endpoint, so downstream LBs that are using a active health check mechanism can deprogram the endpoint faster

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

referring to calling this endpoint explicitly as the first step of the shutdown process
https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--healthcheck-fail
cc @guydc

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, fast failure is also a good thing to maintain

updated in f249d73 to first fail the healthcheck and then proceed with a graceful drain. in order to do this, I've configured MODIFY_ONLY drain type on the HTTP listener so the call to /healthcheck/fail will not change the drain strategy used.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the LB concern is related to a specific deployment type (1 proxy per Node with ExternalTrafficPolicy: Local). This is already handled in K8s 1.26 by: https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/#traffic-loss-from-load-balancers-during-rolling-updates

Signed-off-by: David Alger <davidmalger@gmail.com>
arkodg
arkodg previously approved these changes Feb 26, 2024
Copy link
Copy Markdown
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for building this out @davidalger, all users will benefit from this work, and not have to deal with custom images or scripts !
have a few non blocking questions, overall LGTM !

Signed-off-by: David Alger <davidmalger@gmail.com>
@davidalger davidalger force-pushed the algerdev/graceful-pod-termination branch from 532d9e0 to eafa8a5 Compare February 26, 2024 20:15
@davidalger davidalger requested a review from arkodg February 26, 2024 22:25
Signed-off-by: David Alger <davidmalger@gmail.com>
Copy link
Copy Markdown
Contributor

@guydc guydc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants