I'm continuing to peel the onion on sled agent failures caused by instance API stress and have now gotten to a class of issues like the following:
23:31:42.747Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9
instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650
23:31:43.255Z INFO SledAgent (InstanceManager): ensuring instance is registered
instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650
propolis_id = d58859db-b632-4d1a-9479-1a44728a0bf9
23:31:46.012Z INFO SledAgent (InstanceManager): halt_and_remove_logged: Previous zone state: Running
instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650
23:31:46.031Z INFO SledAgent (InstanceManager): State monitoring task complete
instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650
23:31:46.031Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650
state = InstanceRuntimeState { run_state: Starting, sled_id: bb07afed-b435-4260-97b3-0e4e32b499b6, propolis_id: d58859db-b632-4d1a-9479-1a44728a0bf9, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:101::4e]:12400), migration_id: None, propolis_gen: Generation(2), ncpus: InstanceCpuCount(1), memory: By
teCount(1073741824), hostname: "inst1", gen: Generation(8), time_updated: 2023-06-08T23:31:46.031673699Z }
23:31:46.089Z INFO SledAgent (InstanceManager): Configuring new Omicron zone: oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9
instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650
23:31:46.110Z INFO SledAgent (InstanceManager): Installing Omicron zone: oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9
instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650
23:31:46.869Z INFO SledAgent (InstanceManager): ensuring instance is registered instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650 propolis_id = d58859db-b632-4d1a-9479-1a44728a0bf9 23:31:46.869Z INFO SledAgent (InstanceManager): registering new instance instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650 23:31:46.903Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650 state = InstanceRuntimeState { run_state: Starting, sled_id: bb07afed-b435-4260-97b3-0e4e32b499b6, propolis_id: d58859db-b632-4d1a-9479-1a44728a0bf9, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:101::4e]:12400), migration_id: None, propolis_gen: Generation(2), ncpus: InstanceCpuCount(1), memory: ByteCount(1073741824), hostname: "inst1", gen: Generation(9), time_updated: 2023-06-08T23:31:46.903342690Z } 23:31:46.960Z INFO SledAgent (InstanceManager): install_omicron_zone: Found zone: oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9 in state Incomplete instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650 23:31:46.960Z INFO SledAgent (InstanceManager): Invalid state; uninstalling and deleting zone oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9 instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650 23:31:49.662Z INFO SledAgent (InstanceManager): Adding service instance_id = 058db778-8513-40ea-af9f-6309f6af8826 smf_name = svc:/system/illumos/propolis-server:vm-f7b82f0f-4356-4431-aa06-1742eb5c90fb 23:31:49.682Z INFO SledAgent (InstanceManager): Adding service property group 'config' instance_id = 058db778-8513-40ea-af9f-6309f6af8826 23:31:50.949Z INFO SledAgent (InstanceManager): Zone booting instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650 zone = oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9 23:31:50.962Z ERRO SledAgent (InstanceManager): instance setup failed: Err(ZoneBoot(Booting(AdmError { op: Boot, zone: "oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9", err: CommandOutput(CommandOutputError("exit code 1\nstdout:\n\nstderr:\nzoneadm: zone 'oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9': must be installed before boot.")) })))
In the best case, this leaves users with Failed instances. In the worst case, sled agent might unwrap a zoneadm command it expects to succeed (see e.g. #3264) and knock out an entire sled agent.
ILTM like there are a couple of races in sled agent that can lead to this sort of problem. One of them is as follows:
I'm continuing to peel the onion on sled agent failures caused by instance API stress and have now gotten to a class of issues like the following:
In the best case, this leaves users with Failed instances. In the worst case, sled agent might unwrap a
zoneadmcommand it expects to succeed (see e.g. #3264) and knock out an entire sled agent.ILTM like there are a couple of races in sled agent that can lead to this sort of problem. One of them is as follows:
Instance::terminate, which takes the instance lock and then callsInstanceInner::terminate.Instancethat's being torn down), clones a reference to that instance, drops the instance manager lock, and then waits to take the instance lock.InstanceInner::terminatecallsinstance_ticket.terminate(), which removes the instance from the instance manager.Instance::put_stateand takes the instance lock for its generation of the instance. It then callspropolis_ensure, sees that there is norunning_zone(it was consumed by T1 in the previous step), and so starts trying to install a new Propolis zone.