Sled Agent service initialization cleanup#3712
Conversation
smklein
left a comment
There was a problem hiding this comment.
Thanks for this cleanup, I appreciate this a ton. Hopefully this eliminates most developer-local issues, and makes future ones more obvious.
| _ => { | ||
| // Mismatch between SA's view and reality, let's try to | ||
| // clean up any remanants and try initialize it again | ||
| warn!( | ||
| log, | ||
| "expected to find existing zone in running state"; | ||
| "zone" => &name, | ||
| ); | ||
| if let Err(e) = | ||
| existing_zones.remove(&name).unwrap().stop().await | ||
| { | ||
| error!( | ||
| log, | ||
| "Failed to stop zone"; | ||
| "zone" => &name, | ||
| "error" => %e, | ||
| ); | ||
| } |
There was a problem hiding this comment.
Good call to do this cleanup here
I think so. I'm not sure it ever makes sense to have a request timeout when using TCP keep-alive. We removed a similar one in #3503. |
We're inevitably going to timeout this request with the current 60s timeout and subsequent requests won't make progress anyways until the task spawned from earlier ones finishes. Per @davepacheco's [comment](#3712 (comment)) let's just remove the timeout here.
Roughly 3 changes:
zoneadm)service_putwould return on the server-side.Lack of space was causing service initialization to fail on single-machine deployments using the
create_virtual_hardwarescript. (3) should fix this.Because of
ENOSPCerrors, we were running into some future cancellation issues which #3707 addressed. But actually determining the underlying cause was a bit difficult asRSSwas timing out (*) on theservices_putcall and so never got back the error. (2) should hopefully make this easier to catch in the future, e.g.:(*) speaking of which, do we want to increase that timeout from just 60s? it's not like any subsequent requests will work considering they'll be blocked on a lock from the initial request. and in failure cases like this the subsequent requests will timeout as well, leaving a bunch of tasks in sled-agent all waiting on the same lock to try initializing. might be worth switching to the single processing task model rack/sled initialization use.
Finally, (1) is to address the case where we try to initialize a zone that's already running. That's sometimes fine as we'll eventually realize the zone already exists. But in the case of a zone with an OPTE port, we'll run into errors like so:
This happens because the check for "is zone running" comes after we do all the work necessary to create the zone (e.g. creating an opte port). (1) updates
initialize_services_lockedto cross-reference sled-agent and the system's view of what zones are running so that we don't try to do the unnecessary work.