Spawn a task to run services_ensure by jmpesp · Pull Request #3140 · oxidecomputer/omicron

jmpesp · 2023-05-17T21:00:42Z

dropshot/hyper will cancel an endpoint's task if the call times out or is otherwise cancelled, and this leads to a specific issue where booting all the service zones takes longer than the progenitor client default timeout. The request times out, services_ensure gets cancelled, and this leaves zones partially configured but not added to the list of existing_zones. When the next PUT /services call is made, services_ensure will eventually try to bring up the same zone it was interrupted at, leading to configuration issues.

Fixes #3098

dropshot/hyper will cancel an endpoint's task if the call times out or is otherwise cancelled, and this leads to a specific issue where booting all the service zones takes longer than the progenitor client default timeout. The request times out, `services_ensure` gets cancelled, and this leaves zones partially configured but not added to the list of `existing_zones`. When the next `PUT /services` call is made, `services_ensure` will eventually try to bring up the same zone it was interrupted at, leading to configuration issues. Fixes oxidecomputer#3098

davepacheco · 2023-05-17T21:36:11Z

+    // Spawn a task to run `services_ensure`: dropshot/hyper will cancel an
+    // endpoint's task if the call times out or is otherwise cancelled, and this
+    // leads to a specific issue where booting all the service zones takes
+    // longer than the progenitor client default timeout. The request times out,
+    // `services_ensure` gets cancelled, and this leaves zones partially
+    // configured but not added to the list of `existing_zones`. When the next
+    // `PUT /services` call is made, `services_ensure` will eventually try to
+    // bring up the same zone it was interrupted at, leading to configuration
+    // issues. See: oxidecomputer/omicron#3098.


I'd suggest summarizing this, maybe something like:

Suggested change

// Spawn a task to run `services_ensure`: dropshot/hyper will cancel an

// endpoint's task if the call times out or is otherwise cancelled, and this

// leads to a specific issue where booting all the service zones takes

// longer than the progenitor client default timeout. The request times out,

// `services_ensure` gets cancelled, and this leaves zones partially

// configured but not added to the list of `existing_zones`. When the next

// `PUT /services` call is made, `services_ensure` will eventually try to

// bring up the same zone it was interrupted at, leading to configuration

// issues. See: oxidecomputer/omicron#3098.

// Spawn a task to run `services_ensure` so that it runs to completion even if

// this future is cancelled (as might happen if the client abandons the request).

// Otherwise, unexpected cancellation of this future could result in leaving the in-

// memory state of this zone invalid.

There's a lot of other detail there that's pretty specific to the bug you saw.

Tangentially: what I understood happened (which could well be wrong!) is that the client timed out its request, the socket got closed, the corresponding Future got dropped, and so it became cancelled. As far as I know, neither hyper nor dropshot has a policy that would cause a request to be timed out.

I trimmed it in fb7b9f7, let me know what you think

davepacheco · 2023-05-17T21:37:23Z

+    // bring up the same zone it was interrupted at, leading to configuration
+    // issues. See: oxidecomputer/omicron#3098.
+
+    match tokio::spawn(async move { sa.services_ensure(body_args).await }).await


Ideally it would seem good to have a limit to how many of these tasks can be outstanding. I'm not sure this is worth prioritizing now though.

I thought about changing where services_ensure grabs the lock from a lock to a try_lock, bubbling up the TryLockError as maybe a 429. This means that each client will now have to catch this error and retry on timeout and this new 429. That's not a terrible amount of work but it does mean that all current and future callers will have to do this.

davepacheco · 2023-05-17T21:38:33Z

+        Ok(result) => result.map_err(|e| Error::from(e))?,
+
+        Err(e) => {
+            return Err(HttpError::for_internal_error(e.to_string()));


Suggested change

return Err(HttpError::for_internal_error(e.to_string()));

return Err(HttpError::for_internal_error(&format!("failed to spawn tokio task for services_ensure: {:#}",e));

I don't think the suggested message is right. I don't think tokio::spawn() itself is fallible: if it returns an error, it means the task that was spawned either panicked or was cancelled. I usually .unwrap() tokio::spawn(..).await calls that cannot be canceled (other than by the future calling them itself being canceled, which means you wouldn't get to the unwrap), since it can only panic if another panic has already happened.

Good call. I didn't read that closely enough. In that case I'd suggest:

return Err(HttpError::for_internal_error(&format!("unexpected failure awaiting \"services_ensure\" task: {:#}",e));

I'm not attached to the message. I just don't like assuming that the message provided by the callee is going to be sufficient for someone looking through the logs to figure out what Nexus was trying to do when it ran into this problem.

added context in fb7b9f7

add context to 500 after the join handle await

jmpesp requested a review from davepacheco May 17, 2023 21:00

davepacheco reviewed May 17, 2023

View reviewed changes

jmpesp added 2 commits May 18, 2023 16:29

shorten comment above tokio::spawn

fb7b9f7

add context to 500 after the join handle await

fmt

6afcb7e

davepacheco approved these changes May 19, 2023

View reviewed changes

jmpesp merged commit 5280a10 into oxidecomputer:main May 19, 2023

jmpesp deleted the spawn_task_for_services_ensure branch May 19, 2023 13:53

jgallagher mentioned this pull request Jun 9, 2023

Add server-side configuration to tokio::spawn endpoint futures oxidecomputer/dropshot#695

Closed

leftwo mentioned this pull request Jun 13, 2023

Audit crucible for holding mutex across await and other cancelation complications oxidecomputer/crucible#798

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spawn a task to run services_ensure#3140

Spawn a task to run services_ensure#3140
jmpesp merged 3 commits into
oxidecomputer:mainfrom
jmpesp:spawn_task_for_services_ensure

jmpesp commented May 17, 2023

Uh oh!

davepacheco May 17, 2023

Uh oh!

jmpesp May 18, 2023

Uh oh!

davepacheco May 17, 2023

Uh oh!

jmpesp May 18, 2023

Uh oh!

davepacheco May 17, 2023

Uh oh!

jgallagher May 17, 2023

Uh oh!

davepacheco May 17, 2023

Uh oh!

jmpesp May 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	return Err(HttpError::for_internal_error(e.to_string()));
	return Err(HttpError::for_internal_error(&format!("failed to spawn tokio task for services_ensure: {:#}",e));

Conversation

jmpesp commented May 17, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants