Skip to content

RFC: Hot restart across hot restart versions with SO_REUSEPORT #3804

@bplotnick

Description

@bplotnick

Title: Hot restart across hot restart versions with SO_REUSEPORT

Description:
I'd like to collect some feedback on a way to do hot restarting across hot restart versions.

Problem

There is no standard way to upgrade when the hot restart version changes (or if there is, I am not aware of it). The current recommendation is for "operations to take cope with this and do a full restart".

Barring somehow making hot restart data-structures backwards compatible forever, the only way to do this would be something like a prolonged cluster drain for upgrades. This takes a considerable amount of effort and time and may be impractical for cases where you do not have elastic infrastructure.

Proposed solution

For systems that support it, we can use SO_REUSEPORT on the listener sockets for a "less hot restart". We'd have a second Envoy process start up using a different shared memory region (base-id). We'd lose stats, but this would be no different than the current solution of doing a full restart.

We would also need some way to shut down the parent process, which is done via shared memory/RPC right now. This could be done either with a wrapper coordinating the shutdown (e.g. having the hot restart wrapper take on a more active role in the restart process) or by telling the new process to shutdown the old process.

In the latter case, this can't be done the current way since that relies on the RPC mechanism. One option would be to have a simplified core for RPC that never changes that enables this. Another option would be to pass the PID in and have Envoy send a signal when it is ready to shutdown the parent process.

In either case, we'd have to enable/disable this behavior depending on the availability of SO_REUSEPORT.

Problem with SO_REUSEPORT

(This is mostly a rehash of issues discussed here: https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/)

There is a problem with SO_REUSEPORT, which is that there are race conditions that exist that may cause traffic to be dropped. Specifically there is a race where a connection is put in the accept queue of the old process before it calls close.

We can either accept the fact that these cases are infrequent and that when these reloads happen there may be traffic dropped, or we could implement one of a few different mitigating solutions. These include (but are not limited to):

Note: This problem apparently is Linux specific. I believe systems like OSX will send new connections to the last-bound socket, which will be the newest Envoy instance. So the fact that most of these solutions are Linux specific is probably not an issue.

Alternative solutions

There are alternatives. One possibility, as I alluded to above, is to freeze the API for socket fd passing forever. We would have some simplified IPC mechanism just for FD passing that never changes and doesn’t depend on things like stats area size. I’m not sure the feasibility of this, but it definitely feels like it would be simplest in implementation.

Another possibility is to use a “socket server” as described in this post. The downside of this is that it is another external process to coordinate. The upside is that it provides a level of separation of concerns


So what do people think? Are there any solutions that aren't discussed here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    design proposalNeeds design doc/proposal before implementationhelp wantedNeeds help!

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions