Skip to content

Panic in the Upstairs leaves an instance in a zombie coma #1652

@leftwo

Description

@leftwo

While testing core files for this issue I made crucible upstairs panic just because. After the panic, I did get a core file and I see that the propolis-server service has restarted:

Aug 29 17:15:26.270 INFO accepted connection, remote_addr: [fd00:1122:3344:101::1]:63263, local_addr: [fd00:1122:3344:101::c]:12400
Aug 29 17:15:26.271 INFO request completed, response_code: 101, uri: /instance/serial, method: GET, req_id: a759e184-bbb3-4243-b7d9-0626bac14639, remote_addr: [fd00:1122:3344:101::1]:63263, local_addr: [fd00:1122:3344:101::c]:12400
Aug 29 17:15:55.387 INFO rdmsr, msr: 3221291675, sync_task: vcpu-1, component: vmm
Aug 29 17:15:55.387 INFO rdmsr, msr: 3221291673, sync_task: vcpu-1, component: vmm
Scrub at offset 335616/4194304 sp:335616
thread 'tokio-runtime-worker' panicked at 'We are going to panic now!', /home/alan/ws/crucible/upstairs/src/volume.rs:402:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Aug 29 10:17:39 Stopping because all processes in service exited. ]
[ Aug 29 10:17:39 Executing stop method (:kill). ]
[ Aug 29 10:17:39 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/propolis-server/bin/propolis-server run /var/svc/manifest/site/propolis-server/config.toml [fd00:1122:3344:101::c]:12400 --metric-addr [fd00:1122:3344:101::3]:12221 &"). ]
[ Aug 29 10:17:39 Method "start" exited with status 0. ]
Aug 29 17:17:39.784 ERRO could not query reservoir Os { code: 1, kind: PermissionDenied, message: "Not owner" }
Aug 29 17:17:39.784 INFO Metrics server will use InstanceMetricsConfig { propolis_addr: [fd00:1122:3344:101::c]:12400, metric_addr: [fd00:1122:3344:101::3]:12221 }
Aug 29 17:17:39.784 INFO Starting server...
Aug 29 17:17:39.785 INFO listening, local_addr: [fd00:1122:3344:101::c]:12400

However, my instance did not come back. The console and API think it's running:

alan@atrium:omicron-files$ oxide instance view -o myorg -p myproj debian
 time_run_state_updated | 3 minutes ago                        
 time_modified          | 3 minutes ago                        
 time_created           | 3 minutes ago                        
 run_state              | running                              
 project_id             | 637caf31-5b5b-41d7-ab16-6175dd1b98a5 
 ncpus                  | 2                                    
 memory                 | 1073741824                           
 hostname               | debian                               
 description            | debian                               
 name                   | debian                               
 id                     | f71d0d33-7b1c-47dd-888f-dcf3aa5b5b85 

I attempted to stop it, but fails my stop request:

alan@atrium:omicron-files$ oxide instance stop -o myorg -p myproj debian
Type debian to confirm stop:: debian
✘ Oxide API internal error: Internal Server Error

The propolis log reports:

Aug 29 17:20:25.096 INFO request completed, error_message_external: Internal Server Error, error_message_internal: Server not initialized (no instance), response_code: 500, uri: /instance/state, method: PUT, req_id: 05b58734-2f88-4864-b279-123d33ffc83f, remote_addr: [fd00:1122:3344:101::1]:53026, local_addr: [fd00:1122:3344:101::c]:12400

Another attempt to stop it results in the command hanging (at least I gave up after 10 minutes):

alan@atrium:omicron-files$ oxide instance stop --confirm -o myorg -p myproj debian
⠁ Waiting for instance status to be `stopped`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions