feat: Gracefully drain listeners before envoy shutdown on pod termination#2633
Conversation
|
Testing / Load generator with unaltered EG run from another pod within the cluster hey -c 100 -q 10 -z 300s -host www.example.com http://172.18.255.200/
|
|
Testing / Load generator with EG run from another pod within the cluster
|
|
@arkodg @guydc Would like thoughts from the both of you on the proposed implementation and approach demonstrated here and also whether these defaults are sufficient or need knobs to adjust:
|
…tion Signed-off-by: David Alger <davidmalger@gmail.com>
e2c7dc7 to
b4baa8f
Compare
Is there any significance to using the same value here, if we're not using the envoy hot reload feature? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2633 +/- ##
==========================================
+ Coverage 63.00% 63.23% +0.22%
==========================================
Files 122 123 +1
Lines 19770 19922 +152
==========================================
+ Hits 12457 12598 +141
- Misses 6506 6513 +7
- Partials 807 811 +4 ☔ View full report in Codecov by Sentry. |
@guydc No particular significance to the value being the same in the context of k8s, but it is why I chose 5 minutes greater than the drain timeout since the docs state The important bit is that |
| return &corev1.Lifecycle{ | ||
| PreStop: &corev1.LifecycleHandler{ | ||
| Exec: &corev1.ExecAction{ | ||
| Command: []string{"/bin/bash", "-c", strings.ReplaceAll(strings.TrimSpace(dedent.Dedent(` |
There was a problem hiding this comment.
is this portable to distroless images?
There was a problem hiding this comment.
That's actually a really good call out, as I'm pretty sure it won't work in any pre-built distroless image. Probably won't break anything, but won't do anything either since the exec will fail and k8s will proceed with an immediate SIGTERM to Envoy. So now I'm having a sudden realization that v0.6.0 started using the distroless image, but the dev builds (which I tested with) don't. 🤦🏻
The options are more limited with distroless images since they typically don't include extra binaries such as bash, cat, or awk very likely leaving us with the only real viable long-term option being to build our own Envoy proxy images to include something like a wrapper process or a static executable to manage the the termination process.
There are other approaches that could work without building our own proxy image, some would tend to be fragile or require exposing the admin port. It's possible a sidecar container along with an http prestop hook could work, but will definitely take a bit more effort to setup.
There was a problem hiding this comment.
can we implement a) and b) using the HTTP and Sleep pre stop hooks ?
https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#hook-handler-implementations
we may need to convert the preStop.httpGet into a POST using envoy itself :)
There was a problem hiding this comment.
There is unfortunately no way to accomplish this only leveraging k8s sleep and http lifecycle hooks:
- HTTP requests cannot be made to a port bound to
127.0.0.1(I've tried this) so requests to the Envoy admin port would require it be exposed which would pose a security risk. - Envoy doesn't exit after a graceful drain since the process was designed for hot reloading. It would always run out the clock on the
terminationGracePeriodSecondseven if there were no open connections. - Sleep prestop actions are too new and cannot be used even to just introduce a basic X second sleep for deprogramming as they won't reach beta until k8s v1.30.
Here is what I'm going to propose as a path forward:
- Implement a sub-command in Envoy Gateway called
envoy shutdownwhich will execute the graceful drain and block until both the min-drain-period has elapsed and open connections have closed (similar to what the bash script does currently) - Implement a sub-command in Envoy Gateway called
envoy shutdown-managerwhich exposes an HTTP endpoint on port 19002 and responds to/shutdown/readywhich will block until the graceful drain initiated by theenvoy shutdownsub-command has completed. - Add a
shutdown-managercontainer to the Envoy proxy pod running theenvoy shutdown-managerhttp server. - HTTP
preStophook on theenvoycontainer requesting/shutdown/readyon port 19002 - Exec
preStophook on theshutdown-managercontainer which executes theenvoy shutdownsub-command
When the pod gets a request to shutdown, the preStop hook will be triggered on both the envoy and shutdown-manager containers simultaneously. The preStop on the shutdown-manager container will execute the graceful drain and the preStop on the envoy pod will block until the drain is complete. Each of the containers will receive the SIGTERM from k8s per-normal when the preStop hook on the container completes.
I've mocked this up as a proof-of-concept to test the setup. Unless there are objections I will plan on re-implementing the drain process per the above outline which is very similar to what Contour did previously and appears it will work based on my testing.
There was a problem hiding this comment.
hey @davidalger , exposing /shutdown on envoy has the same security risk, if its exposed on 0.0.0.0, it can be called by anyone .
There was a problem hiding this comment.
@davidalger does preStop.httpGet.host = localhost work ?
https://github.com/kubernetes/kubernetes/blob/b069ef6e13a2f44ce844acd9ff88a6957fc6d418/pkg/apis/core/types.go#L2201
There was a problem hiding this comment.
does preStop.httpGet.host = localhost work ?
Tried this. It fails with a PreStopHook failed warning added to the pod status and the container exits without allowing the drain to occur. I'm fairly certain this is because the request is coming from kubelet, which isn't local to the pod and thus can't actually make a request to the loopback.
exposing /shutdown on envoy has the same security risk, if its exposed on 0.0.0.0, it can be called by anyone .
This is mitigated by having the /shutdown/ready endpoint largely be a dumb wait that does nothing more than block envoy container shutdown until the the drain logic in the exec prestop hook on the shutdown-manager container finishes.
There really are two reasons why the actual drain logic is in an exec to a sub-command on the eg binary:
a) The security aspect, since http prestop hooks must be exposed to the rest of the cluster
b) The sub-command will be useful for a local (any non-k8s) EG deployment when such may be implemented in the future
I've pushed up my work in progress for those that may like to take a look. There are still some things to do to finish it up, but it's functional. Couple example log sets from restarts under varying load, with no failed requests:
example 1 logs from shutdown-manager container
2024-02-21T14:47:39.037Z INFO shutdown-manager envoy/shutdown_manager.go:72 starting shutdown manager
2024-02-21T14:47:55.365Z INFO shutdown-manager envoy/shutdown_manager.go:86 received shutdown ready request
2024-02-21T14:47:55.398Z INFO shutdown-manager envoy/shutdown_manager.go:115 initiating graceful drain with 15 second minimum drain period and 600 second timeout
2024-02-21T14:47:55.399Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 10
2024-02-21T14:47:56.401Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 10
2024-02-21T14:47:57.402Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 10
2024-02-21T14:47:58.404Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 10
2024-02-21T14:47:59.408Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 9
2024-02-21T14:48:00.412Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 9
2024-02-21T14:48:01.414Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 9
2024-02-21T14:48:02.421Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 9
2024-02-21T14:48:03.427Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 9
2024-02-21T14:48:04.428Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 5
2024-02-21T14:48:05.438Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 5
2024-02-21T14:48:06.441Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 5
2024-02-21T14:48:07.443Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 5
2024-02-21T14:48:08.454Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 5
2024-02-21T14:48:09.461Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 0
2024-02-21T14:48:10.464Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 0
2024-02-21T14:48:10.464Z INFO shutdown-manager envoy/shutdown_manager.go:140 minimum drain period reached; will exit when total connections is 0
2024-02-21T14:48:10.464Z INFO shutdown-manager envoy/shutdown_manager.go:148 graceful drain sequence completed
2024-02-21T14:48:10.473Z INFO shutdown-manager envoy/shutdown_manager.go:62 received SIGTERM
2024-02-21T14:48:11.386Z INFO shutdown-manager envoy/shutdown_manager.go:97 shutdown readiness detected
Stream closed EOF for envoy-gateway-system/envoy-default-eg-e41e7b31-65bb79b45-mkgdt (shutdown-manager)
example 2 logs from shutdown-manager container
2024-02-21T14:47:55.887Z INFO shutdown-manager envoy/shutdown_manager.go:72 starting shutdown manager
2024-02-21T14:48:56.058Z INFO shutdown-manager envoy/shutdown_manager.go:86 received shutdown ready request
2024-02-21T14:48:56.087Z INFO shutdown-manager envoy/shutdown_manager.go:115 initiating graceful drain with 15 second minimum drain period and 600 second timeout
2024-02-21T14:48:56.089Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 100
2024-02-21T14:48:57.089Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 100
2024-02-21T14:48:58.090Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 100
2024-02-21T14:48:59.091Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 100
2024-02-21T14:49:00.091Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 100
2024-02-21T14:49:01.093Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 74
2024-02-21T14:49:02.094Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 74
2024-02-21T14:49:03.094Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 74
2024-02-21T14:49:04.096Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 74
2024-02-21T14:49:05.096Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 74
2024-02-21T14:49:06.097Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 33
2024-02-21T14:49:07.099Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 33
2024-02-21T14:49:08.100Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 33
2024-02-21T14:49:09.101Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 33
2024-02-21T14:49:10.102Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 33
2024-02-21T14:49:11.103Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 11
2024-02-21T14:49:11.103Z INFO shutdown-manager envoy/shutdown_manager.go:140 minimum drain period reached; will exit when total connections is 0
2024-02-21T14:49:12.105Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 11
2024-02-21T14:49:13.106Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 11
2024-02-21T14:49:14.106Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 11
2024-02-21T14:49:15.107Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 11
2024-02-21T14:49:16.108Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:17.109Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:18.111Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:19.111Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:20.112Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:21.113Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:22.114Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:23.115Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:24.116Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:25.117Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:26.119Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:27.120Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:28.121Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:29.122Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:30.124Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 1
2024-02-21T14:49:31.133Z INFO shutdown-manager envoy/shutdown_manager.go:196 total connections: 0
2024-02-21T14:49:31.133Z INFO shutdown-manager envoy/shutdown_manager.go:148 graceful drain sequence completed
2024-02-21T14:49:31.148Z INFO shutdown-manager envoy/shutdown_manager.go:62 received SIGTERM
2024-02-21T14:49:32.080Z INFO shutdown-manager envoy/shutdown_manager.go:97 shutdown readiness detected
Stream closed EOF for envoy-gateway-system/envoy-default-eg-e41e7b31-65bb79b45-crxbp (shutdown-manager)
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
|
/retest |
| }, | ||
| }, | ||
| { | ||
| Name: "shutdown-manager", |
There was a problem hiding this comment.
should this be a sidecar container ?
- this will ensure that
preStopforshutdown-managerruns beforeenvoy - building off this
- preStop exec
shutdownperforms atouchon the tmp file and exits - preStop exec
/shutdown/readyperforms the envoy shutdown procedure and blocks until complete /shutdown/readycan return a non 200 error code immediately if tmp file doesnt exist
- preStop exec
There was a problem hiding this comment.
Not sure of the shutdown sequencing as I don't see much about it in the announcement blog post or on the docs page other than indications that pod termination is not blocked by sidecar containers.
Update: Found it under pod lifecycle docs:
Beginning with Kubernetes 1.29, if your Pod includes one or more sidecar containers (init containers with an Always restart policy), the kubelet will delay sending the TERM signal to these sidecar containers until the last main container has fully terminated. The sidecar containers will be terminated in the reverse order they are defined in the Pod spec. This ensures that sidecar containers continue serving the other containers in the Pod until they are no longer needed.
I could look into this if we are OK with dropping support for 1.28 where sidecars containers are feature gated only being enabled by default in 1.29. Not sure I'd recommend this though with 1.29 not yet being available as stable/GA on AKS or GKE
There was a problem hiding this comment.
ah good point, we cant rely on order yet, since sidecars just GA'd
then I propose /shutdown/ready sleeps for 1s (ensuring in case of termination, the other preStop had the chance to create the tmp file) before checking whether the tmp file exists or not.
If it doesnt, return 404, else run the shutdown steps and return 200.
This eliminates the need for the for within the http handler.
There was a problem hiding this comment.
at a high level, big +1 with this approach
There was a problem hiding this comment.
then I propose /shutdown/ready sleeps for 1s (ensuring in case of termination, the other preStop had the chance to create the tmp file) before checking whether the tmp file exists or not.
If it doesnt, return 404, else run the shutdown steps and return 200.
Not sure I follow. As soon as the /shutdown/ready request returns (no matter the response code) kubelet will immediately send sigterm to PID 1 (envoy).
Was the concern with the for loop that it could never return? I've modified this to ensure requests don't endlessly poll after the ready-timeout: 4a3b90f since http.TimeoutHandler of course wouldn't interrupt the processing just breaking the connection.
There was a problem hiding this comment.
the concerns I was trying to mitigate were
- using more CPU with the
for - allowing non kubelet clients to hoard the shutdown handler CPU
Since we cannot establish order between the envoy and shutdown-manager container, I was proposing we use a delay (1s / 2s) to ensure that for the preStop case, we can be confident that the other preStop has signaled to us (using a file) that its time to shutdown.
For the non preStop case, it would just be a delay + return 404
|
|
||
| // Reconfigure logger to write to stdout of main process if running in Kubernetes | ||
| if _, k8s := os.LookupEnv("KUBERNETES_SERVICE_HOST"); k8s && os.Getpid() != 1 { | ||
| logger = logging.FileLogger("/proc/1/fd/1", "shutdown-manager", v1alpha1.LogLevelInfo) |
There was a problem hiding this comment.
There's a discussion here about this option maybe not working if running the process as non-root.
Not sure if this concern is applicable in this case...
There was a problem hiding this comment.
writing directly to /proc/1/fd/1 will be problematic if the process drops root privileges but PID 1 is still running as root
I don't think it will apply in this case, as neither PID 1 nor the exec'd process are running as root since the container sets the uid:gid in the build (see here)
| // Setup HTTP handler | ||
| handler := http.NewServeMux() | ||
| handler.HandleFunc("/shutdown/ready", func(w http.ResponseWriter, r *http.Request) { | ||
| shutdownReadyHandler(w, r, ShutdownReadyFile) |
There was a problem hiding this comment.
Can this timeout before the handler actually detects shutdown readdiness due to the server timeouts and/or kubelet timeouts? If so, what would be the behavior?
There was a problem hiding this comment.
It has a configurable timeout (defaulting to 10s longer than the drain timeout) as a safeguard, but it won't timeout prior to this per documented and observed behavior.
If this did return before the exec prestop completed the drain, envoy would be sent a SIGTERM and exit after which you may observe errors such as this one logged until the drain timeout elapses and the shutdown command exits:
shutdown-manager 2024-02-22T20:48:54.259Z ERROR shutdown-manager envoy/shutdown_manager.go:145 error getting total connections {"error": "Get \"http://127.0.0.1:19000//stats?filter=^server\\\\.total_connections$&format=json\": dial tcp 127.0.0.1:19000: connect: connection refused"}
This should never occur with properly configured (staggered) timeouts, so I would hesitate to code around it for an initial implementation preferring to let the errors be logged to get eyes on the potential problem.
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
|
/retest |
| return deployment, nil | ||
| } | ||
|
|
||
| func expectedTerminationGracePeriodSeconds(cfg *egv1a1.ShutdownConfig) *int64 { |
There was a problem hiding this comment.
shouldn't this be equal to DrainTimeout ?
There was a problem hiding this comment.
terminationGracePeriodSeconds needs to be higher than the drain timeout so the pod has time to handle the SIGTERM before being sent a SIGKILL. It doesn't have to be much, but it does need to be marginally higher, and assuming other timeouts are functioning it should never be reached under normal circumstances.
| logger.Error(err, "error getting total connections") | ||
| } | ||
|
|
||
| if elapsedTime > minDrainDuration && !allowedToExit { |
There was a problem hiding this comment.
nit: would it make sense to move minDrainDuration outside the loop and sleep for that duration before performs checks and calls ?
There was a problem hiding this comment.
Probably a matter of personal preference, but I opted for this method so that drain logs make it obvious what is happening and one gets visibility on connection counts changing more immediately after starting the graceful drain vs waiting X seconds silently before starting to log for the drain process.
| minDrainDuration.Seconds(), drainTimeout.Seconds())) | ||
|
|
||
| // Send request to Envoy admin API to initiate the graceful drain sequence | ||
| if resp, err := http.Post(fmt.Sprintf("http://%s:%d/drain_listeners?graceful&skip_exit", |
There was a problem hiding this comment.
will this cause the health check defined in
to fail ?also curious why is skip_exit needed ?
There was a problem hiding this comment.
The healthcheck endpoint will not fail. It is unimportant though as the the ready status of the endpoint will immediately become false when the pod enters the terminating status triggering anything that uses the endpoint to begin deprogramming.
skip_exit is useful to ensure new connections are not unexpectedly blocked if a drain takes than drain-time-s on the Envoy proxy, so it becomes important where long drain timeouts are configured. It's a knob recently added to the drain API to support Istio's graceful drain process as they transition to native sidecars: envoyproxy/envoy#30003 and I thought it made sense to include it so we don't have to also configure drain-time-s on Envoy where longer than 10m drain periods are needed.
There was a problem hiding this comment.
I think its important to fail the healthcheck endpoint, so downstream LBs that are using a active health check mechanism can deprogram the endpoint faster
There was a problem hiding this comment.
referring to calling this endpoint explicitly as the first step of the shutdown process
https://www.envoyproxy.io/docs/envoy/latest/operations/admin#post--healthcheck-fail
cc @guydc
There was a problem hiding this comment.
you're right, fast failure is also a good thing to maintain
updated in f249d73 to first fail the healthcheck and then proceed with a graceful drain. in order to do this, I've configured MODIFY_ONLY drain type on the HTTP listener so the call to /healthcheck/fail will not change the drain strategy used.
There was a problem hiding this comment.
I think that the LB concern is related to a specific deployment type (1 proxy per Node with ExternalTrafficPolicy: Local). This is already handled in K8s 1.26 by: https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/#traffic-loss-from-load-balancers-during-rolling-updates
Signed-off-by: David Alger <davidmalger@gmail.com>
arkodg
left a comment
There was a problem hiding this comment.
thanks for building this out @davidalger, all users will benefit from this work, and not have to deal with custom images or scripts !
have a few non blocking questions, overall LGTM !
Signed-off-by: David Alger <davidmalger@gmail.com>
532d9e0 to
eafa8a5
Compare
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
Signed-off-by: David Alger <davidmalger@gmail.com>
… termination (envoyproxy#2633)" This reverts commit 329aafc.
What this PR does:
This PR is a step in the direction of achieving a hitless upgrade. Some challenges to achieving this have been outlined by @guydc in #1712 (comment) and #2610
This proposed implementation of a PreStop lifecycle hook will allow Envoy proxies to gracefully shutdown when pods are terminated by Kubernetes. This is done by:
a) Initiating a graceful drain sequence via the drain_listeners?graceful admin endpoint.
b) Sleeping for a minimum drain duration of 15 seconds allowing ample time for endpoint deprogramming to occur.
c) Monitors
total.server_connectionsstatistic exiting once total connection count reaches0d) Times out after 600 seconds after which Envoy will be sent SIGTERM by Kubernetes regardless of how many open connections remain. This timeout lines up with the default value for
--drain-time-se) Configures
terminationGracePeriodSecondsto 900 seconds. This provides ample time for Envoy to cleanly exit after connections have been drained.Some prior art w.r.t graceful shutdown for Envoy:
Summary of why we need this:
Caveats:
When TCP listeners are used (rather than HTTP listeners), connections will not drain per the below note from draining architecture documentation. This means in the event of long-living TCP connections, Envoy proxies will simply wait in the PreStop hook for the 600 second timeout to occur after which these connections will be closed when Envoy receives the SIGTERM from Kubernetes.
Reference: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining