Skip to content

Sled Agent crashed in response to failed instance request #3454

@smklein

Description

@smklein
          The downstairs were wiped out when the sled-agent crashed and restarted:
09:00:01.266Z INFO SledAgent (PortManager): Mapping virtual NIC to physical host
    mapping = SetVirtualNetworkInterfaceHost { virtual_ip: 172.30.0.5, virtual_mac: MacAddr(MacAddr6([168, 64, 37, 243, 152, 219])), physical_host_ip: fd00:1122
:3344:10a::1, vni: Vni(10225803) }
09:00:01.267Z INFO SledAgent (dropshot (SledAgent)): request completed
    local_addr = [fd00:1122:3344:107::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:102::4]:43840
    req_id = bcc4fef9-789c-4e73-a1f7-2f249ecde50c
    response_code = 204
    uri = /v2p/b7e955a4-36e3-4d74-ae3b-ad053ae8097b
09:00:01.816Z INFO SledAgent (InstanceManager): Adding address: Static(V6(Ipv6Network { addr: fd00:1122:3344:107::2a, prefix: 64 }))
    instance_id = f1e6ed32-cb42-4b71-a7ac-893ac46467f1
    zone = oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4
09:00:02.006Z ERRO SledAgent (InstanceManager): instance setup failed: Err(ZoneEnsureAddress(EnsureAddressError(EnsureAddressError { zone: "oxz_propolis-server_
d00f74ec-80ea-4419-80e9-ec9b3bbf83f4", request: Static(V6(Ipv6Network { addr: fd00:1122:3344:107::2a, prefix: 64 })), name: AddrObject { interface: "oxControlIn
stance8", name: "omicron6" }, err: Zone execution error: Command [/usr/sbin/zlogin oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4 /usr/sbin/ipadm crea
te-addr -t -T addrconf oxControlInstance8/ll] executed and failed with status: exit status: 1  stdout: 
      stderr: ipadm: Could not create address: Addrconf already in progress
    
    Caused by:
        Command [/usr/sbin/zlogin oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4 /usr/sbin/ipadm create-addr -t -T addrconf oxControlInstance8/ll] exe
cuted and failed with status: exit status: 1  stdout: 
          stderr: ipadm: Could not create address: Addrconf already in progress })))
    instance_id = f1e6ed32-cb42-4b71-a7ac-893ac46467f1
09:00:02.007Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
    instance_id = f1e6ed32-cb42-4b71-a7ac-893ac46467f1
    state = InstanceRuntimeState { run_state: Failed, sled_id: 7230a95e-44ac-42ef-8dbd-1183d39193c7, propolis_id: d00f74ec-80ea-4419-80e9-ec9b3bbf83f4, dst_prop
olis_id: None, propolis_addr: Some([fd00:1122:3344:107::2a]:12400), migration_id: None, propolis_gen: Generation(1), ncpus: InstanceCpuCount(4), memory: ByteCou
nt(2147483648), hostname: "web-instance-2", gen: Generation(3), time_updated: 2023-06-29T09:00:02.006892393Z }
09:00:02.052Z INFO SledAgent (dropshot (SledAgent)): request completed
    error_message_external = Internal Server Error
    error_message_internal = Failed to create address Static(V6(Ipv6Network { addr: fd00:1122:3344:107::2a, prefix: 64 })) with name oxControlInstance8/omicron6
 in oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4: Zone execution error: Command [/usr/sbin/zlogin oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3b
bf83f4 /usr/sbin/ipadm create-addr -t -T addrconf oxControlInstance8/ll] executed and failed with status: exit status: 1  stdout: \n  stderr: ipadm: Could not c
reate address: Addrconf already in progress
    local_addr = [fd00:1122:3344:107::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:102::4]:38064
    req_id = 6206c886-bea7-4e1f-8126-54c12ea873e0
    response_code = 500
    uri = /instances/f1e6ed32-cb42-4b71-a7ac-893ac46467f1/state
09:00:02.111Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    local_addr = [fd00:1122:3344:107::1]:12345
    remote_addr = [fd00:1122:3344:102::4]:41840
09:00:02.111Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf83f4
    instance_id = f1e6ed32-cb42-4b71-a7ac-893ac46467f1
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: AdmError { op: Uninstall, zone: "oxz_propolis-server_d00f74ec-80ea-4419-
80e9-ec9b3bbf83f4", err: CommandOutput(CommandOutputError("exit code 1\nstdout:\n\nstderr:\nzoneadm: zone 'oxz_propolis-server_d00f74ec-80ea-4419-80e9-ec9b3bbf8
3f4': uninstall operation is invalid for shutting_down zones.")) }', sled-agent/src/instance.rs:535:64
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Jun 29 09:00:10 Stopping because all processes in service exited. ]
[ Jun 29 09:00:10 Executing stop method (:kill). ]
[ Jun 29 09:00:10 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
[ Jun 29 09:00:10 Method "start" exited with status 0. ]
note: configured to log to "/dev/stdout"
09:00:12.732Z INFO SledAgent: Starting mg-ddm service
09:00:12.798Z INFO SledAgent: Importing mg-ddm service
    path = /opt/oxide/mg-ddm/pkg/ddm/manifest.xml
09:00:13.023Z INFO SledAgent: Setting mg-ddm interfaces
    interfaces = ("cxgbe0/ll" "cxgbe1/ll")
09:00:13.044Z INFO SledAgent: Enabling mg-ddm service
09:00:13.070Z INFO SledAgent: setting up bootstrap agent server
09:00:13.166Z INFO SledAgent: Ensuring zfs key directory exists
    path = /var/run/oxide/
09:00:13.582Z INFO SledAgent: Sending prefix to ddmd for advertisement
    DdmAdminClient = [::1]:8000
    prefix = Ipv6Prefix { addr: fdb0:a840:2504:3d5::, len: 64 }
09:00:13.688Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_ntp
09:00:13.703Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_bd5d7d9f-58ca-4350-9083-6a92a6155a65
09:00:13.714Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_47d274ce-f4cb-4bc8-990a-b1460bd918c6
09:00:13.741Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_0cf8b90b-1143-4119-9012-1188c92036f2
09:00:13.756Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_7c1992a0-3f17-4672-b141-61ccab131c16
09:00:13.773Z WARN SledAgent: Deleting existing zone
    zone_name = oxz_crucible_oxp_b155e4f4-facd-4a7b-a464-b965fc8e8cf5
09:00:13.787Z WARN SledAgent: Deleting existing zone
...

Aside from the delete-all-zones behavior (which is already being worked on), we probably also need to deal with the issue that the sled-agent crashed in face of an incompatible state error - "uninstall operation is invalid for shutting_down zones". The error handling can be less catastrophic.

Originally posted by @askfongjojo in #3451 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Sled AgentRelated to the Per-Sled Configuration and Management

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions