Skip to content

Spawn a task to run services_ensure#3140

Merged
jmpesp merged 3 commits into
oxidecomputer:mainfrom
jmpesp:spawn_task_for_services_ensure
May 19, 2023
Merged

Spawn a task to run services_ensure#3140
jmpesp merged 3 commits into
oxidecomputer:mainfrom
jmpesp:spawn_task_for_services_ensure

Conversation

@jmpesp

@jmpesp jmpesp commented May 17, 2023

Copy link
Copy Markdown
Contributor

dropshot/hyper will cancel an endpoint's task if the call times out or is otherwise cancelled, and this leads to a specific issue where booting all the service zones takes longer than the progenitor client default timeout. The request times out, services_ensure gets cancelled, and this leaves zones partially configured but not added to the list of existing_zones. When the next PUT /services call is made, services_ensure will eventually try to bring up the same zone it was interrupted at, leading to configuration issues.

Fixes #3098

dropshot/hyper will cancel an endpoint's task if the call times out or
is otherwise cancelled, and this leads to a specific issue where booting
all the service zones takes longer than the progenitor client default
timeout. The request times out, `services_ensure` gets cancelled, and
this leaves zones partially configured but not added to the list of
`existing_zones`. When the next `PUT /services` call is made,
`services_ensure` will eventually try to bring up the same zone it was
interrupted at, leading to configuration issues.

Fixes oxidecomputer#3098
@jmpesp jmpesp requested a review from davepacheco May 17, 2023 21:00
Comment thread sled-agent/src/http_entrypoints.rs Outdated
Comment on lines +70 to +78
// Spawn a task to run `services_ensure`: dropshot/hyper will cancel an
// endpoint's task if the call times out or is otherwise cancelled, and this
// leads to a specific issue where booting all the service zones takes
// longer than the progenitor client default timeout. The request times out,
// `services_ensure` gets cancelled, and this leaves zones partially
// configured but not added to the list of `existing_zones`. When the next
// `PUT /services` call is made, `services_ensure` will eventually try to
// bring up the same zone it was interrupted at, leading to configuration
// issues. See: oxidecomputer/omicron#3098.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest summarizing this, maybe something like:

Suggested change
// Spawn a task to run `services_ensure`: dropshot/hyper will cancel an
// endpoint's task if the call times out or is otherwise cancelled, and this
// leads to a specific issue where booting all the service zones takes
// longer than the progenitor client default timeout. The request times out,
// `services_ensure` gets cancelled, and this leaves zones partially
// configured but not added to the list of `existing_zones`. When the next
// `PUT /services` call is made, `services_ensure` will eventually try to
// bring up the same zone it was interrupted at, leading to configuration
// issues. See: oxidecomputer/omicron#3098.
// Spawn a task to run `services_ensure` so that it runs to completion even if
// this future is cancelled (as might happen if the client abandons the request).
// Otherwise, unexpected cancellation of this future could result in leaving the in-
// memory state of this zone invalid.

There's a lot of other detail there that's pretty specific to the bug you saw.

Tangentially: what I understood happened (which could well be wrong!) is that the client timed out its request, the socket got closed, the corresponding Future got dropped, and so it became cancelled. As far as I know, neither hyper nor dropshot has a policy that would cause a request to be timed out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trimmed it in fb7b9f7, let me know what you think

// bring up the same zone it was interrupted at, leading to configuration
// issues. See: oxidecomputer/omicron#3098.

match tokio::spawn(async move { sa.services_ensure(body_args).await }).await

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally it would seem good to have a limit to how many of these tasks can be outstanding. I'm not sure this is worth prioritizing now though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about changing where services_ensure grabs the lock from a lock to a try_lock, bubbling up the TryLockError as maybe a 429. This means that each client will now have to catch this error and retry on timeout and this new 429. That's not a terrible amount of work but it does mean that all current and future callers will have to do this.

Comment thread sled-agent/src/http_entrypoints.rs Outdated
Ok(result) => result.map_err(|e| Error::from(e))?,

Err(e) => {
return Err(HttpError::for_internal_error(e.to_string()));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Err(HttpError::for_internal_error(e.to_string()));
return Err(HttpError::for_internal_error(&format!("failed to spawn tokio task for services_ensure: {:#}",e));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the suggested message is right. I don't think tokio::spawn() itself is fallible: if it returns an error, it means the task that was spawned either panicked or was cancelled. I usually .unwrap() tokio::spawn(..).await calls that cannot be canceled (other than by the future calling them itself being canceled, which means you wouldn't get to the unwrap), since it can only panic if another panic has already happened.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I didn't read that closely enough. In that case I'd suggest:

            return Err(HttpError::for_internal_error(&format!("unexpected failure awaiting \"services_ensure\" task: {:#}",e));

I'm not attached to the message. I just don't like assuming that the message provided by the callee is going to be sufficient for someone looking through the logs to figure out what Nexus was trying to do when it ran into this problem.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added context in fb7b9f7

jmpesp added 2 commits May 18, 2023 16:29
add context to 500 after the join handle await
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

crucible pantry zone got the wrong control vnic

3 participants