Skip to content

instance ticket teardown seems racy & might be causing unexpected concurrent operations on Propolis zones #3325

@gjcolombo

Description

@gjcolombo

I'm continuing to peel the onion on sled agent failures caused by instance API stress and have now gotten to a class of issues like the following:

23:31:42.747Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                                          
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
23:31:43.255Z INFO SledAgent (InstanceManager): ensuring instance is registered                                                                                                                                                                                                                                              
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
    propolis_id = d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                                                                                                                       
23:31:46.012Z INFO SledAgent (InstanceManager): halt_and_remove_logged: Previous zone state: Running                                                                                                                                                                                                                         
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
23:31:46.031Z INFO SledAgent (InstanceManager): State monitoring task complete                                                                                                                                                                                                                                               
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
23:31:46.031Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus                                                                                                                                                                                                                                    
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
    state = InstanceRuntimeState { run_state: Starting, sled_id: bb07afed-b435-4260-97b3-0e4e32b499b6, propolis_id: d58859db-b632-4d1a-9479-1a44728a0bf9, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:101::4e]:12400), migration_id: None, propolis_gen: Generation(2), ncpus: InstanceCpuCount(1), memory: By
teCount(1073741824), hostname: "inst1", gen: Generation(8), time_updated: 2023-06-08T23:31:46.031673699Z }                                                                                                                                                                                                                   
23:31:46.089Z INFO SledAgent (InstanceManager): Configuring new Omicron zone: oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                                       
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
23:31:46.110Z INFO SledAgent (InstanceManager): Installing Omicron zone: oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                                            
    instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       
23:31:46.869Z INFO SledAgent (InstanceManager): ensuring instance is registered                                                                                                                                                                                                                                                  instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                           propolis_id = d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                                                                                                                       23:31:46.869Z INFO SledAgent (InstanceManager): registering new instance                                                                                                                                                                                                                                                         instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       23:31:46.903Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus                                                                                                                                                                                                                                        instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                           state = InstanceRuntimeState { run_state: Starting, sled_id: bb07afed-b435-4260-97b3-0e4e32b499b6, propolis_id: d58859db-b632-4d1a-9479-1a44728a0bf9, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:101::4e]:12400), migration_id: None, propolis_gen: Generation(2), ncpus: InstanceCpuCount(1), memory: ByteCount(1073741824), hostname: "inst1", gen: Generation(9), time_updated: 2023-06-08T23:31:46.903342690Z }                                                                                                                                                                                                                   23:31:46.960Z INFO SledAgent (InstanceManager): install_omicron_zone: Found zone: oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9 in state Incomplete                                                                                                                                                                   instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       23:31:46.960Z INFO SledAgent (InstanceManager): Invalid state; uninstalling and deleting zone oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                           instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                       23:31:49.662Z INFO SledAgent (InstanceManager): Adding service                                                                                                                                                                                                                                                                   instance_id = 058db778-8513-40ea-af9f-6309f6af8826                                                                                                                                                                                                                                                                           smf_name = svc:/system/illumos/propolis-server:vm-f7b82f0f-4356-4431-aa06-1742eb5c90fb                                                                                                                                                                                                                                   23:31:49.682Z INFO SledAgent (InstanceManager): Adding service property group 'config'                                                                                                                                                                                                                                           instance_id = 058db778-8513-40ea-af9f-6309f6af8826                                                                                                                                                                                                                                                                       23:31:50.949Z INFO SledAgent (InstanceManager): Zone booting                                                                                                                                                                                                                                                                     instance_id = f8a059c0-f6f9-499d-b452-8358c4d72650                                                                                                                                                                                                                                                                           zone = oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9                                                                                                                                                                                                                                                          23:31:50.962Z ERRO SledAgent (InstanceManager): instance setup failed: Err(ZoneBoot(Booting(AdmError { op: Boot, zone: "oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9", err: CommandOutput(CommandOutputError("exit code 1\nstdout:\n\nstderr:\nzoneadm: zone 'oxz_propolis-server_d58859db-b632-4d1a-9479-1a44728a0bf9': must be installed before boot.")) })))   

In the best case, this leaves users with Failed instances. In the worst case, sled agent might unwrap a zoneadm command it expects to succeed (see e.g. #3264) and knock out an entire sled agent.

ILTM like there are a couple of races in sled agent that can lead to this sort of problem. One of them is as follows:

  • An instance starts and then stops.
  • Thread T1: The instance's Propolis VMM shuts down; state monitor calls Instance::terminate, which takes the instance lock and then calls InstanceInner::terminate.
  • T2: In parallel a call to ensure the instance is running arrives. This call takes the instance manager lock, finds that the instance is registered on the target sled (in the guise of the Instance that's being torn down), clones a reference to that instance, drops the instance manager lock, and then waits to take the instance lock.
  • T1: InstanceInner::terminate calls instance_ticket.terminate(), which removes the instance from the instance manager.
  • T3: A second parallel call to ensure the instance is running also arrives. This does not find the instance in the instance manager and so creates a brand new generation of it.
  • T1 finishes cleaning up the unused instance, drops the instance lock, and exits.
  • T2 enters Instance::put_state and takes the instance lock for its generation of the instance. It then calls propolis_ensure, sees that there is no running_zone (it was consumed by T1 in the previous step), and so starts trying to install a new Propolis zone.
  • T3 does the same thing for its generation of the instance. Note that this is not excluded by the lock that T2 holds because they have distinct Instance structures!
  • One of T2 or T3 gets upset because the zone already exists/is in the wrong state.
  • Hilarity ensues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Sled AgentRelated to the Per-Sled Configuration and Management

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions