[support bundle] Monitor for cancellation without accidental task-blocking#9268
Conversation
hawkw
left a comment
There was a problem hiding this comment.
Overall, this fix looks correct and I'm happy to merge it as-is. I had a few small notes, hopefully some of them are useful!
| // Returns if the bundle has been cancelled explicitly, or if we | ||
| // cannot successfully check the bundle state. |
There was a problem hiding this comment.
hmm, this makes me wonder if we actually want to have the cancellation checking task bail out if it encounters a DB error. might we want to just try again in another yield_interval seconds, in that case?
i suppose we really would want to handle different error types there, so that it bails out on e.g. NotFounds.
| // Returns if the bundle has been cancelled explicitly, or if we | ||
| // cannot successfully check the bundle state. | ||
| why = &mut check_for_cancellation => { | ||
| let why = why.expect("Should not cancel the bundle-checking task without returning"); |
There was a problem hiding this comment.
turbo nit: can we wrap this string at the 80th column?
There was a problem hiding this comment.
Done (actually removed, but, done)
| }, | ||
| // Otherwise, keep making progress on the collection itself. | ||
| report = &mut collection => { | ||
| check_for_cancellation.abort(); |
There was a problem hiding this comment.
nit, take it or leave it: might we consider wrapping check_for_cancellation in an AbortOnDropHandle, rather than doing this explicitly?
There was a problem hiding this comment.
So, I ended up taking a slightly different path here. Looking at this again, I kinda asked myself:
Why is this work in w tokio task, but the collection work isn't?
I restructured this code slightly:
- Both cancellation and collection are "just futures" now
- When we select on them, we don't select on "&mut future" anymore (which is good - if we take one branch, we'll cancel the other immediately)
- Now there's no need to "abort" the other task, nor pin either future
| return why; | ||
| }, | ||
| // Otherwise, keep making progress on the collection itself. | ||
| report = &mut collection => { |
There was a problem hiding this comment.
Nit: since there's now only one select!, rather than a select! in a loop, we don't need the &mut here (though we do need the &mut check_for_cancellation in the other select arm, so that we can cancel it here, but using AbortOnDropHandle would obviate that...)
| report = &mut collection => { | |
| report = collection => { |
There was a problem hiding this comment.
Done - for both futures, now!
| // We run this task to "check for cancellation" as a whole tokio task | ||
| // for a critical, but subtle reason: After the tick timer yields, | ||
| // we may then try to "await" a database function. | ||
| // | ||
| // This, at a surface-level glance seems innocent enough. However, there | ||
| // is something potentially insidious here: if calling a datastore | ||
| // function - such as "support_bundle_get" - blocks on acquiring access | ||
| // to a connection from the connection pool, while creating the | ||
| // collection ALSO potentially blocks on acquiring access to the | ||
| // connection pool, it is possible for: | ||
| // | ||
| // 1. The "&mut collection" arm to have created a future, currently | ||
| // yielded, which wants access to this underlying resource. | ||
| // 2. The current task of execution, in "support_bundle_get", to be | ||
| // blocked "await-ing" for this same underlying resource. | ||
| // | ||
| // In this specific case, the connection pool would be attempting to | ||
| // yield to the "&mut collection" arm, which cannot run, if we were | ||
| // blocking on the body of a different async select arm. This would | ||
| // result in a deadlock. | ||
| // | ||
| // In the future, we may attempt to make access to the connection pool | ||
| // safer from concurrent asynchronous access - it is unsettling that | ||
| // multiple concurrent ".claim()" functions can cause this behavior - | ||
| // but in the meantime, we spawn this cancellation check in an entirely | ||
| // new tokio task. Because of this separation, each task (the one | ||
| // checking for cancellation, and the main thread attempting to collect | ||
| // the bundle) do not risk preventing the other from being polled | ||
| // indefinitely. |
There was a problem hiding this comment.
Thanks for writing this up, this is excellent. I have a couple very small nits:
- I know we often colloquially refer to async tasks that are waiting for something as 'blocking" on that thing, but I might avoid that here to make sure the reader doesn't confuse this with the notion of actually blocking the worker thread
- I changed a use of the word "task", which I think refers to the general concept of "a thing we are doing", to "operation", to make it clear that we are not referring to a Tokio task there
- It might be worth including a link to Nexus node timing out on API requests #9259?
| // We run this task to "check for cancellation" as a whole tokio task | |
| // for a critical, but subtle reason: After the tick timer yields, | |
| // we may then try to "await" a database function. | |
| // | |
| // This, at a surface-level glance seems innocent enough. However, there | |
| // is something potentially insidious here: if calling a datastore | |
| // function - such as "support_bundle_get" - blocks on acquiring access | |
| // to a connection from the connection pool, while creating the | |
| // collection ALSO potentially blocks on acquiring access to the | |
| // connection pool, it is possible for: | |
| // | |
| // 1. The "&mut collection" arm to have created a future, currently | |
| // yielded, which wants access to this underlying resource. | |
| // 2. The current task of execution, in "support_bundle_get", to be | |
| // blocked "await-ing" for this same underlying resource. | |
| // | |
| // In this specific case, the connection pool would be attempting to | |
| // yield to the "&mut collection" arm, which cannot run, if we were | |
| // blocking on the body of a different async select arm. This would | |
| // result in a deadlock. | |
| // | |
| // In the future, we may attempt to make access to the connection pool | |
| // safer from concurrent asynchronous access - it is unsettling that | |
| // multiple concurrent ".claim()" functions can cause this behavior - | |
| // but in the meantime, we spawn this cancellation check in an entirely | |
| // new tokio task. Because of this separation, each task (the one | |
| // checking for cancellation, and the main thread attempting to collect | |
| // the bundle) do not risk preventing the other from being polled | |
| // indefinitely. | |
| // We run this task to "check for cancellation" as a whole tokio task | |
| // for a critical, but subtle reason: After the tick timer yields, | |
| // we may then try to `await` a database function. | |
| // | |
| // This, at a surface-level glance seems innocent enough. However, there | |
| // is something potentially insidious here: if calling a datastore | |
| // function - such as "support_bundle_get" - awaits acquiring access | |
| // to a connection from the connection pool, while creating the | |
| // collection ALSO potentially awaits acquiring access to the | |
| // connection pool, it is possible for: | |
| // | |
| // 1. The `&mut collection` arm to have created a future, currently | |
| // yielded, which wants access to this underlying resource. | |
| // 2. The current operation executing in `support_bundle_get` to | |
| // be awaiting access to this same underlying resource. | |
| // | |
| // In this specific case, the connection pool would be attempting to | |
| // yield to the `&mut collection` arm, which cannot run, if we were | |
| // awaiting in the body of a different async select arm. This would | |
| // result in a deadlock. | |
| // | |
| // In the future, we may attempt to make access to the connection pool | |
| // safer from concurrent asynchronous access - it is unsettling that | |
| // multiple concurrent `.claim()` functions can cause this behavior - | |
| // but in the meantime, we spawn this cancellation check in an entirely | |
| // new tokio task. Because of this separation, each task (the one | |
| // checking for cancellation, and the main thread attempting to collect | |
| // the bundle) do not risk preventing the other from being polled | |
| // indefinitely. | |
| // | |
| // For more details, see: | |
| // https://github.com/oxidecomputer/omicron/issues/9259 |
There was a problem hiding this comment.
Sounds good, i took a revised version of this, revised because we are not spawning tasks.
| }, | ||
| } | ||
| } | ||
| }; |
There was a problem hiding this comment.
Totally optional style nit - this could maybe move to its own method, which lets this method be entirely focused on "we're selecting on two futures and this is why it's written this way"; e.g.
let collection = self.collect_bundle_as_file(&dir);
let check_for_cancellation = self.wait_for_cancellation();
// ... many comments ...
tokio::select! { .. }| tokio::select! { | ||
| // Returns if the bundle should no longer be collected. | ||
| why = self.check_for_cancellation() => { | ||
| warn!( | ||
| &self.log, | ||
| "Support Bundle cancelled - stopping collection"; | ||
| "bundle" => %self.bundle.id, | ||
| "state" => ?self.bundle.state | ||
| ); | ||
| return Err(why); | ||
| }, | ||
| // Otherwise, keep making progress on the collection itself. | ||
| report = self.collect_bundle_as_file(&dir) => { | ||
| info!( | ||
| &self.log, | ||
| "Bundle Collection completed"; | ||
| "bundle" => %self.bundle.id | ||
| ); | ||
| return report; | ||
| }, |
There was a problem hiding this comment.
interesting, I agree that the new way of doing this here (without spawning a task for check_for_cancellation) should also avoid the deadlock, because both futures are just moved by value into the select! and we're not looping around the select!.
| Err(err) | ||
| if matches!(err, Error::ObjectNotFound { .. }) | ||
| || matches!(err, Error::NotFound { .. }) => |
There was a problem hiding this comment.
isn't this just
| Err(err) | |
| if matches!(err, Error::ObjectNotFound { .. }) | |
| || matches!(err, Error::NotFound { .. }) => | |
| Err(err @ Error::ObjectNotFound { .. } | err @ Error::NotFound { .. } ) => |
| Ok(bundle) => { | ||
| if !matches!(bundle.state, SupportBundleState::Collecting) { | ||
| return anyhow::anyhow!("Support Bundle Cancelled"); | ||
| } | ||
|
|
||
| // Bundle still collecting; continue... | ||
| continue; | ||
| } |
There was a problem hiding this comment.
take it or leave it: this could be:
| Ok(bundle) => { | |
| if !matches!(bundle.state, SupportBundleState::Collecting) { | |
| return anyhow::anyhow!("Support Bundle Cancelled"); | |
| } | |
| // Bundle still collecting; continue... | |
| continue; | |
| } | |
| Ok(SupportBundle { state: SupportBundleState::Collecting }) => { | |
| // Bundle still collecting; continue... | |
| continue; | |
| }, | |
| Ok(_) => return anyhow::anyhow!("Support Bundle Cancelled"), |
Fixes #9259