-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Title: Hot restart across hot restart versions with SO_REUSEPORT
Description:
I'd like to collect some feedback on a way to do hot restarting across hot restart versions.
Problem
There is no standard way to upgrade when the hot restart version changes (or if there is, I am not aware of it). The current recommendation is for "operations to take cope with this and do a full restart".
Barring somehow making hot restart data-structures backwards compatible forever, the only way to do this would be something like a prolonged cluster drain for upgrades. This takes a considerable amount of effort and time and may be impractical for cases where you do not have elastic infrastructure.
Proposed solution
For systems that support it, we can use SO_REUSEPORT on the listener sockets for a "less hot restart". We'd have a second Envoy process start up using a different shared memory region (base-id). We'd lose stats, but this would be no different than the current solution of doing a full restart.
We would also need some way to shut down the parent process, which is done via shared memory/RPC right now. This could be done either with a wrapper coordinating the shutdown (e.g. having the hot restart wrapper take on a more active role in the restart process) or by telling the new process to shutdown the old process.
In the latter case, this can't be done the current way since that relies on the RPC mechanism. One option would be to have a simplified core for RPC that never changes that enables this. Another option would be to pass the PID in and have Envoy send a signal when it is ready to shutdown the parent process.
In either case, we'd have to enable/disable this behavior depending on the availability of SO_REUSEPORT.
Problem with SO_REUSEPORT
(This is mostly a rehash of issues discussed here: https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/)
There is a problem with SO_REUSEPORT, which is that there are race conditions that exist that may cause traffic to be dropped. Specifically there is a race where a connection is put in the accept queue of the old process before it calls close.
We can either accept the fact that these cases are infrequent and that when these reloads happen there may be traffic dropped, or we could implement one of a few different mitigating solutions. These include (but are not limited to):
- qdisc dance in supervisor process (e.g. hot-reloader) a la https://engineeringblog.yelp.com/2015/04/true-zero-downtime-haproxy-reloads.html
- add a proxy and use unix domain sockets with atomic move semantics a la https://engineeringblog.yelp.com/2017/05/taking-zero-downtime-load-balancing-even-further.html
- use eBPF filters to redirect SYNs to the new process’s listen queue. This was tried and failed in this thread, but maybe someone smart can come along and do this correctly?
- reworking the SO_REUSEPORT_LISTEN_OFF kernel patch from the previously linked thread and attempting to get this merged
Note: This problem apparently is Linux specific. I believe systems like OSX will send new connections to the last-bound socket, which will be the newest Envoy instance. So the fact that most of these solutions are Linux specific is probably not an issue.
Alternative solutions
There are alternatives. One possibility, as I alluded to above, is to freeze the API for socket fd passing forever. We would have some simplified IPC mechanism just for FD passing that never changes and doesn’t depend on things like stats area size. I’m not sure the feasibility of this, but it definitely feels like it would be simplest in implementation.
Another possibility is to use a “socket server” as described in this post. The downside of this is that it is another external process to coordinate. The upside is that it provides a level of separation of concerns
So what do people think? Are there any solutions that aren't discussed here?