cargo-shuttle: fix address in use error when service panicked in a previous run#805
Conversation
64275c4 to
b04ff51
Compare
oddgrd
left a comment
There was a problem hiding this comment.
Thanks @iulianbarbu! I'm wondering if this test failure in runtime is spurious or due to the change in proto, did you see it fail locally too?
I am curious if they will fail again now. I am running them now locally as well. |
|
It seems like the However, on my machine, the The |
25c80dd to
c5bf4f0
Compare
|
@oddgrd @chesedo , it seems like the previously failing tests pass now but this time |
| if signal_received { | ||
| let stop_request = StopRequest {}; | ||
| trace!( | ||
| ?stop_request, | ||
| "stopping service because it received a signal" | ||
| ); | ||
| let _response = | ||
| rt.1.stop(tonic::Request::new(stop_request)) | ||
| .or_else(|err| async { | ||
| provisioner_server.abort(); | ||
| rt.0.kill().await?; | ||
| Err(err) | ||
| }) | ||
| .await? | ||
| .into_inner(); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Have we not already handled this case above? Ie if signal_received was false above then it will still be false here
There was a problem hiding this comment.
The previous singal_received handling refers to receiving a signal during the runtimes initializations. If signal_received is true it should stop all the existing runtimes and return.
Otherwise, it moves to the logic of waiting on each runtime in a for loop, namely this case we're discussing. While waiting on these runtimes we must handle signals as well, hence we're checking the signal_received before actually waiting on a runtime in a new iteration. If it's true, then we've received a signal in a previous iteration and we must stop all the runtimes which we haven't waited on instead on continuing to wait on them.
6b5baa5 to
a501e6c
Compare
There was a problem hiding this comment.
Thanks for figuring this out Iulian, I understand why they have a warning about this being tricky in the tokio::signal docs 😄 I left a few comments but mostly nits, and a note about us considering doing this for windows too in the future.
We might also want to test this on windows to see that it doesn't cause any regressions, IIRC importing something that's only available on unix may be an issue. But I can check this quickly tomorrow. Edit: It looks like all uses of unix signals is inside cfg(unix), so this should be fine.
| if !response.success { | ||
| error!(error = response.message, "failed to load your service"); | ||
| exit(1); | ||
| // If prior signal received is set to true we must stop all the existing runtimes and |
There was a problem hiding this comment.
Doesn't this loop stop the services (unless the stop request fails), and then the runtimes are stopped on return (drop)? I may be misreading this 😅
There was a problem hiding this comment.
Hmm, yes, in a way, only that runtimes being dropped doesn't mean necessarily that the underlying processes are killed (https://docs.rs/tokio/latest/tokio/process/struct.Child.html#caveats). What I wanted to say is that during the previous loop of initializing the runtimes if a signal is received then we should stop the existing runtimes (in a graceful way, through the StopRequest or kill them only if that fails) and then we can return from the local_run as an abrupt termination.
There was a problem hiding this comment.
An addendum for the above comment: I had to adjust a bit this logic. Now, if a service panics during initialization then all the workspace runtimes must be killed and cargo-shuttle exits with code 1. Because we forcefully exit, the destructors for the runtime are not called and we must forcefully kill the runtime besides suggesting it to stop in a graceful way. The rest still applies.
...of service panics.
a501e6c to
3a7d277
Compare
e6eb1fb to
3ee158d
Compare
4351ebb to
3bf6a0d
Compare
cargo-shuttle/src/lib.rs
Outdated
| .stop(tonic::Request::new(stop_request)) | ||
| .or_else(|err| async { | ||
| runtime.kill().await?; | ||
| error!(status = ?err, "killed the runtime by force because stopping it errored out"); |
There was a problem hiding this comment.
What if we do trace! here, then the user won't see an error (which won't be relevant to them in the event of CTRL C since as far as they are concerned it was a success, not an error, right?)?
…evious run (#805) * runtime & proto: teardown idle runtime in case... ...of service panics. * cargo-shuttle: abort all runtimes when one fails during a... ...local run. * cargo-shuttle: added sigint and sigterm handlers * proto: remove kill_on_drop flag * cargo-shuttle: fix comments * runtime: remove leftovers * cargo-shuttle: extract stop_runtime logic in a function * proto: add kill_on_drop flag again * cargo-shuttle: fix again one service panic case * runtime(tests): fix bind_panic * cargo-shuttle: handle stop requests error gracefully * cargo-shuttle: handle stopping error gracefully... ...for service that panics * cargo-shuttle: clarified logging in case runtime stopping errors * cargo-shuttle: change log level
Description of change
Running user services locally through
cargo shuttle runwill run a runtime that will load/start a service as a tokio task. If the task panics, the shuttle-runtime will live on because of tokio behavior: tokio-rs/tokio#2002.This is not the case anymore. I haven't seen this making any difference for our use cases. Removing it to have the
bind_panictest succeed.Extra: set to true thekill_on_dropflag for started shuttle-runtimes to make sure we either.wait()or.kill()on them or if dropped, thekillis automatically issued: https://docs.rs/tokio/latest/tokio/process/struct.Command.html#method.kill_on_drop.How Has This Been Tested (if applicable)?
Run locally the
hello-world-rocket-appexample with panic in it. Consequent local runs without the fix hang with theAddress in useerror. The fix ensures the runtime and provisioner are killed in case the runtime response is not successful.Also, if
cargo-shuttleprocess receivesSIGINTorSIGTERMit's successfully stopping the existing runtimes.