instance ticket teardown seems racy & might be causing unexpected concurrent operations on Propolis zones

I'm continuing to peel the onion on sled agent failures caused by instance API stress and have now gotten to a class of issues like the following:

```
23:31:42.747Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                                          
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
23:31:43.255Z INFO SledAgent (InstanceManager): ensuring instance is registered                                                                                                                                                                                                                                              
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
    propolis_id = d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                                                                                                                       
23:31:46.012Z INFO SledAgent (InstanceManager): halt_and_remove_logged: Previous zone state: Running                                                                                                                                                                                                                         
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
23:31:46.031Z INFO SledAgent (InstanceManager): State monitoring task complete                                                                                                                                                                                                                                               
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
23:31:46.031Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus                                                                                                                                                                                                                                    
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
    state = InstanceRuntimeState { run_state: Starting, sled_id: bb07afed-b435-4260-97b3-0e4e32b499b6, propolis_id: d58859db-b632-4d1a-9479-1a44728a0bf9, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:101::4e]:12400), migration_id: None, propolis_gen: Generation(2), ncpus: InstanceCpuCount(1), memory: By
teCount(1073741824), hostname: "inst1", gen: Generation(8), time_updated: 2023-06-08T23:31:46.031673699Z }                                                                                                                                                                                                                   
23:31:46.089Z INFO SledAgent (InstanceManager): Configuring new Omicron zone: oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                                       
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
23:31:46.110Z INFO SledAgent (InstanceManager): Installing Omicron zone: oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                                            
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
23:31:46.869Z INFO SledAgent (InstanceManager): ensuring instance is registered                                                                                                                                                                                                                                                  instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                           propolis_id = d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                                                                                                                       23:31:46.869Z INFO SledAgent (InstanceManager): registering new instance                                                                                                                                                                                                                                                         instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       23:31:46.903Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus                                                                                                                                                                                                                                        instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                           state = InstanceRuntimeState { run_state: Starting, sled_id: bb07afed-b435-4260-97b3-0e4e32b499b6, propolis_id: d58859db-b632-4d1a-9479-1a44728a0bf9, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:101::4e]:12400), migration_id: None, propolis_gen: Generation(2), ncpus: InstanceCpuCount(1), memory: ByteCount(1073741824), hostname: "inst1", gen: Generation(9), time_updated: 2023-06-08T23:31:46.903342690Z }                                                                                                                                                                                                                   23:31:46.960Z INFO SledAgent (InstanceManager): install_omicron_zone: Found zone: oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9 in state Incomplete                                                                                                                                                                   instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       23:31:46.960Z INFO SledAgent (InstanceManager): Invalid state; uninstalling and deleting zone oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                           instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       23:31:49.662Z INFO SledAgent (InstanceManager): Adding service                                                                                                                                                                                                                                                                   instance_id = 058db778-8513-40ea-af9f-6309f6af8826                                                                                                                                                                                                                                                                           smf_name = svc:/system/illumos/propolis-server:vm-f7b82f0f-4356-4431-aa06-1742eb5c90fb                                                                                                                                                                                                                                   23:31:49.682Z INFO SledAgent (InstanceManager): Adding service property group 'config'                                                                                                                                                                                                                                           instance_id = 058db778-8513-40ea-af9f-6309f6af8826                                                                                                                                                                                                                                                                       23:31:50.949Z INFO SledAgent (InstanceManager): Zone booting                                                                                                                                                                                                                                                                     instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                           zone = oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                                                                                                          23:31:50.962Z ERRO SledAgent (InstanceManager): instance setup failed: Err(ZoneBoot(Booting(AdmError { op: Boot, zone: "oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9", err: CommandOutput(CommandOutputError("exit code 1\nstdout:\n\nstderr:\nzoneadm: zone 'oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9': must be installed before boot.")) })))   
```

In the best case, this leaves users with Failed instances. In the worst case, sled agent might unwrap a `zoneadm` command it expects to succeed (see e.g. #3264) and knock out an entire sled agent.

ILTM like there are a couple of races in sled agent that can lead to this sort of problem. One of them is as follows:

- An instance starts and then stops.
- Thread T1: The instance's Propolis VMM shuts down; state monitor calls `Instance::terminate`, which takes the instance lock and then calls `InstanceInner::terminate`.
- T2: In parallel a call to ensure the instance is running arrives. This call takes the instance manager lock, finds that the instance is registered on the target sled (in the guise of the `Instance` that's being torn down), clones a reference to that instance, drops the instance manager lock, and then waits to take the instance lock.
- T1: `InstanceInner::terminate` calls `instance_ticket.terminate()`, which removes the instance from the instance manager.
- T3: A second parallel call to ensure the instance is running also arrives. This does *not* find the instance in the instance manager and so creates a brand new generation of it.
- T1 finishes cleaning up the unused instance, drops the instance lock, and exits.
- T2 enters `Instance::put_state` and takes the instance lock for its generation of the instance. It then calls `propolis_ensure`, sees that there is no `running_zone` (it was consumed by T1 in the previous step), and so starts trying to install a new Propolis zone.
- T3 does the same thing for its generation of the instance. Note that this is not excluded by the lock that T2 holds because they have distinct Instance structures!
- One of T2 or T3 gets upset because the zone already exists/is in the wrong state.
- Hilarity ensues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instance ticket teardown seems racy & might be causing unexpected concurrent operations on Propolis zones #3325

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

instance ticket teardown seems racy & might be causing unexpected concurrent operations on Propolis zones #3325

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions