Skip to content

Make server shutdown wait on any Detached handler futures#702

Merged
jgallagher merged 1 commit into
mainfrom
detached-graceful-shutdown
Jun 15, 2023
Merged

Make server shutdown wait on any Detached handler futures#702
jgallagher merged 1 commit into
mainfrom
detached-graceful-shutdown

Conversation

@jgallagher

Copy link
Copy Markdown
Contributor

I ended up using a WaitGroup to track outstanding handler futures instead of trying to accumulate them into something like a FuturesUnordered or JoinSet, because we don't actually care about their results: we just want to know when they're all done.

Testing this was a little weird: we have to force a situation where we have a detached handler future still running after server shutdown has been requested. It uses a similar pattern to the tests added #701, where the handler will wait for a message on a channel from the test driver before returning, allowing us to construct a client request and then cancel it, leaving the detached handler still running.

.await
.is_ok()
{
panic!("server shutdown returned while handler running");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the addition of

handler_waitgroup.wait().await;

when constructing join_handle, the test fails here.

release_endpoint_tx.send(()).await.unwrap();

// Now we can finish waiting for server shutdown.
teardown_fut.await;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the explict

mem::drop(self.app_state);

in close(), this test hangs forever here, because the wait group is still waiting for the last worker (the one held in DropshotState) to be dropped.

@smklein smklein self-assigned this Jun 14, 2023

@smklein smklein left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the solid test - always quirky to test these ordering things, but I appreciate that you put in the effort to make this more robust.

Comment thread dropshot/src/server.rs
use super::ProbeRegistration;

use async_stream::stream;
use debug_ignore::DebugIgnore;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, this crate seems useful!

@smklein smklein removed their assignment Jun 14, 2023

@davepacheco davepacheco left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, now the websocket case makes me wonder whether it's a problem that we don't cancel these things now? Do you happen to know what used to happen before #701 if you shut down a server with request handlers running? What if there was a websocket stream attached?

Comment thread dropshot/Cargo.toml
slog-term = "2.9.0"
tokio-rustls = "0.24.0"
toml = "0.7.4"
waitgroup = "0.1.2"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. I see this pulls in two deps. It's a shame if there's nothing in tokio or futures that can already do this. 🤷

Comment thread dropshot/src/server.rs
api,
private,
log,
handler_waitgroup.worker(),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not something to worry about right now, but: I wonder how debuggable this is, either in situ or post mortem. Like if you walk up to a server that's shutting down and stuck waiting on one of these, would you have any way to know which request it was waiting on? I imagine eventually we'll want to elevate this to an API but that's probably way down the road.

Just to show what I mean, in the past I built something like this:
https://github.com/TritonDataCenter/node-vasync#barrier-coordinate-multiple-concurrent-operations
In that thing, each outstanding operation has a distinct name. You can take the whole object and expose that over a debug HTTP API to ask the server "what are the operations you're waiting on". This proved incredibly useful. But that was more for stuff like our own RSS, where you've got complicated long-running things that could get stuck somewhere. This particular case seems less likely to be a problem. It would just be neat to have a thing like waitgroup but where it was easy to ask it what it was waiting on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not something to worry about right now, but: I wonder how debuggable this is, either in situ or post mortem. Like if you walk up to a server that's shutting down and stuck waiting on one of these, would you have any way to know which request it was waiting on?

Probably not! Internally the waitgroup is just a wrapper around an AtomicWaker, which itself is just a glorified AtomicUsize; I think you could pretty quickly find the count, but assuming it's something like 1, I don't know how you'd find the guilty party.

@jgallagher

Copy link
Copy Markdown
Contributor Author

Ugh, now the websocket case makes me wonder whether it's a problem that we don't cancel these things now? Do you happen to know what used to happen before #701 if you shut down a server with request handlers running? What if there was a websocket stream attached?

I think we did the right thing, at least as far as we can - the web socket upgrade handler happens within the normal handler's future, right? And graceful shutdown waits for all handler futures to complete already.

If in practice most websocket handlers do their own tokio::spawn() and move the upgraded websocket off to a background task, we wouldn't (and still can't) do anything to wait on that.

@davepacheco

Copy link
Copy Markdown
Collaborator

Ugh, now the websocket case makes me wonder whether it's a problem that we don't cancel these things now? Do you happen to know what used to happen before #701 if you shut down a server with request handlers running? What if there was a websocket stream attached?

I think we did the right thing, at least as far as we can - the web socket upgrade handler happens within the normal handler's future, right? And graceful shutdown waits for all handler futures to complete already.

If in practice most websocket handlers do their own tokio::spawn() and move the upgraded websocket off to a background task, we wouldn't (and still can't) do anything to wait on that.

The thing I'm (low-level) worried about is that someone starts up a WebSocket for something like a serial console (just as an example of something that's a best-effort background thing) and now the server cannot be shut down. It seems like the consumer would probably want to cancel that rather than wait for the client to determine when they can shut down. The same applies to ordinary HTTP requests but those are usually short-lived and should usually have aggressive timers on them so they seem less likely to cause a problem.

If we previously blocked on WebSockets closing, there's no behavior change here, and this is definitely fine.

Base automatically changed from option-to-spawn-handlers to main June 15, 2023 20:46
@jgallagher jgallagher force-pushed the detached-graceful-shutdown branch from d0d8b37 to 9816cb2 Compare June 15, 2023 20:50
@jgallagher jgallagher merged commit fc0ffd5 into main Jun 15, 2023
@jgallagher jgallagher deleted the detached-graceful-shutdown branch June 15, 2023 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants