Spawn a task to run services_ensure#3140
Conversation
dropshot/hyper will cancel an endpoint's task if the call times out or is otherwise cancelled, and this leads to a specific issue where booting all the service zones takes longer than the progenitor client default timeout. The request times out, `services_ensure` gets cancelled, and this leaves zones partially configured but not added to the list of `existing_zones`. When the next `PUT /services` call is made, `services_ensure` will eventually try to bring up the same zone it was interrupted at, leading to configuration issues. Fixes oxidecomputer#3098
| // Spawn a task to run `services_ensure`: dropshot/hyper will cancel an | ||
| // endpoint's task if the call times out or is otherwise cancelled, and this | ||
| // leads to a specific issue where booting all the service zones takes | ||
| // longer than the progenitor client default timeout. The request times out, | ||
| // `services_ensure` gets cancelled, and this leaves zones partially | ||
| // configured but not added to the list of `existing_zones`. When the next | ||
| // `PUT /services` call is made, `services_ensure` will eventually try to | ||
| // bring up the same zone it was interrupted at, leading to configuration | ||
| // issues. See: oxidecomputer/omicron#3098. |
There was a problem hiding this comment.
I'd suggest summarizing this, maybe something like:
| // Spawn a task to run `services_ensure`: dropshot/hyper will cancel an | |
| // endpoint's task if the call times out or is otherwise cancelled, and this | |
| // leads to a specific issue where booting all the service zones takes | |
| // longer than the progenitor client default timeout. The request times out, | |
| // `services_ensure` gets cancelled, and this leaves zones partially | |
| // configured but not added to the list of `existing_zones`. When the next | |
| // `PUT /services` call is made, `services_ensure` will eventually try to | |
| // bring up the same zone it was interrupted at, leading to configuration | |
| // issues. See: oxidecomputer/omicron#3098. | |
| // Spawn a task to run `services_ensure` so that it runs to completion even if | |
| // this future is cancelled (as might happen if the client abandons the request). | |
| // Otherwise, unexpected cancellation of this future could result in leaving the in- | |
| // memory state of this zone invalid. |
There's a lot of other detail there that's pretty specific to the bug you saw.
Tangentially: what I understood happened (which could well be wrong!) is that the client timed out its request, the socket got closed, the corresponding Future got dropped, and so it became cancelled. As far as I know, neither hyper nor dropshot has a policy that would cause a request to be timed out.
There was a problem hiding this comment.
I trimmed it in fb7b9f7, let me know what you think
| // bring up the same zone it was interrupted at, leading to configuration | ||
| // issues. See: oxidecomputer/omicron#3098. | ||
|
|
||
| match tokio::spawn(async move { sa.services_ensure(body_args).await }).await |
There was a problem hiding this comment.
Ideally it would seem good to have a limit to how many of these tasks can be outstanding. I'm not sure this is worth prioritizing now though.
There was a problem hiding this comment.
I thought about changing where services_ensure grabs the lock from a lock to a try_lock, bubbling up the TryLockError as maybe a 429. This means that each client will now have to catch this error and retry on timeout and this new 429. That's not a terrible amount of work but it does mean that all current and future callers will have to do this.
| Ok(result) => result.map_err(|e| Error::from(e))?, | ||
|
|
||
| Err(e) => { | ||
| return Err(HttpError::for_internal_error(e.to_string())); |
There was a problem hiding this comment.
| return Err(HttpError::for_internal_error(e.to_string())); | |
| return Err(HttpError::for_internal_error(&format!("failed to spawn tokio task for services_ensure: {:#}",e)); |
There was a problem hiding this comment.
I don't think the suggested message is right. I don't think tokio::spawn() itself is fallible: if it returns an error, it means the task that was spawned either panicked or was cancelled. I usually .unwrap() tokio::spawn(..).await calls that cannot be canceled (other than by the future calling them itself being canceled, which means you wouldn't get to the unwrap), since it can only panic if another panic has already happened.
There was a problem hiding this comment.
Good call. I didn't read that closely enough. In that case I'd suggest:
return Err(HttpError::for_internal_error(&format!("unexpected failure awaiting \"services_ensure\" task: {:#}",e));I'm not attached to the message. I just don't like assuming that the message provided by the callee is going to be sufficient for someone looking through the logs to figure out what Nexus was trying to do when it ran into this problem.
add context to 500 after the join handle await
dropshot/hyper will cancel an endpoint's task if the call times out or is otherwise cancelled, and this leads to a specific issue where booting all the service zones takes longer than the progenitor client default timeout. The request times out,
services_ensuregets cancelled, and this leaves zones partially configured but not added to the list ofexisting_zones. When the nextPUT /servicescall is made,services_ensurewill eventually try to bring up the same zone it was interrupted at, leading to configuration issues.Fixes #3098