Skip to content

"Could not create address: addrconf already in progress" while starting instance caused sled agent to crash #3264

@gjcolombo

Description

@gjcolombo

Repro steps:

  • Stand up a development cluster, assign it some external IPs, and create a project named myproj
  • In a directory containing the Oxide CLI, create the following instance_json.json:
{
  "description": "description",
  "disks": [],
  "external_ips": [],
  "hostname": "hostname",
  "memory": 1073741824,
  "name": "myinst",
  "ncpus": 2,
  "network_interfaces": {
    "type": "none"
  },
  "start": true,
  "user_data": ""
}
  • Alongside it create the following script:
#!/bin/bash

while :;
do
    if [[ -e stopfile ]];
    then
        break
    fi

    ./oxide instance create --project myproj --json-body ./instance_json.json

    while :;
    do
        STATE=$(./oxide api "/v1/instances/${1}?project=myproj"| jq -r .run_state)
        case "${STATE}" in
            "running")
                break
                ;;

            *)
                sleep 0.1
                ;;
        esac
    done

    ./oxide instance stop --project myproj --instance "${1}"

    while :;
    do
        STATE=$(./oxide api "/v1/instances/${1}?project=myproj"| jq -r .run_state)
        case "${STATE}" in
            "stopped")
                break
                ;;

            *)
                sleep 0.1
                ;;
        esac
    done

    ./oxide instance delete --project myproj --instance "${1}"

    sleep 0.1
done
  • Log into the cluster from the environment that will run instances of the script
  • Run two parallel instances of the script with the name of the instance from the JSON body: script.sh myinst

Expected: The instance is repeatedly created, started, stopped, and deleted without incident.
Observed: After a few hours, instance creation fails and causes sled agent to crash:

  10:27:07.278Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
      instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
      state = InstanceRuntimeState { run_state: Starting, sled_id: 7935aaf5-9466-493b-865e-195124db7a70, propolis_id: 3dd4ffe7-2949-4e17-9301-4fdbd036249e, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:101::942]:12400), migration_id: None, propolis_gen: Generation(1), ncpus: InstanceCpuCount(2), memory:   ByteCount(1073741824), hostname: "hostname", gen: Generation(2), time_updated: 2023-05-31T10:27:07.278015636Z }
  10:27:07.342Z INFO SledAgent (InstanceManager): Configuring new Omicron zone: oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e
      instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
  10:27:07.371Z INFO SledAgent (InstanceManager): Installing Omicron zone: oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e
      instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
  10:27:09.238Z INFO SledAgent (InstanceManager): halt_and_remove_logged: Previous zone state: Running
      instance_id = 49ea6a9a-3003-4e08-9af2-2dd9bfe92586
  10:27:09.260Z INFO SledAgent (InstanceManager): State monitoring task complete
      instance_id = 49ea6a9a-3003-4e08-9af2-2dd9bfe92586
  10:27:13.241Z INFO SledAgent (InstanceManager): Zone booting
      instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
      zone = oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e
  10:27:20.944Z INFO SledAgent (InstanceManager): Adding address: Static(V6(Ipv6Network { addr: fd00:1122:3344:101::942, prefix: 64 }))
      instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
      zone = oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e
  10:27:21.061Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
      instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
      state = InstanceRuntimeState { run_state: Failed, sled_id: 7935aaf5-9466-493b-865e-195124db7a70, propolis_id: 3dd4ffe7-2949-4e17-9301-4fdbd036249e, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:101::942]:12400), migration_id: None, propolis_gen: Generation(1), ncpus: InstanceCpuCount(2), memory: B  yteCount(1073741824), hostname: "hostname", gen: Generation(3), time_updated: 2023-05-31T10:27:21.060951893Z }
  10:27:21.137Z INFO SledAgent (dropshot (SledAgent)): request completed
      error_message_external = Internal Server Error
      error_message_internal = Error managing instances: Instance error: Failed to create address Static(V6(Ipv6Network { addr: fd00:1122:3344:101::942, prefix: 64 })) with name oxControlInstance1333/omicron6 in oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e: Zone execution error: Command [/usr/sbin/zlogin   oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e /usr/sbin/ipadm create-addr -t -T addrconf oxControlInstance1333/ll] executed and failed with status: exit status: 1  stdout: \n  stderr: ipadm: Could not create address: Addrconf already in progress
      local_addr = [fd00:1122:3344:101::1]:12345
      method = PUT
      remote_addr = [fd00:1122:3344:101::4]:39086
      req_id = 509270c7-ae17-4485-92fb-063b70905ba5
      response_code = 500
      uri = /instances/6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5/state
  10:27:21.207Z INFO SledAgent (dropshot (SledAgent)): accepted connection
      local_addr = [fd00:1122:3344:101::1]:12345
      remote_addr = [fd00:1122:3344:101::4]:45688
  10:27:21.207Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e
      instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
  thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: AdmError { op: Uninstall, zone: "oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e", err: CommandOutput(CommandOutputError("exit code 1\nstdout:\n\nstderr:\nzoneadm: zone 'oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdb  d036249e': uninstall operation is invalid for shutting_down zones.")) }', sled-agent/src/instance.rs:528:64
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
  [ May 31 10:27:21 Stopping because all processes in service exited. ]
  [ May 31 10:27:21 Executing stop method (:kill). ]
  [ May 31 10:27:21 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]

Seems like there are two problems here:

  • A "Could not create address: addrconf already in progress" error while setting up the Propolis zone
  • Once the zone fails to start, sled agent marks the instance as Failed and tries to tear down the zone, but the call to zoneadm unwraps, which takes down the entire sled agent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Sled AgentRelated to the Per-Sled Configuration and Management

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions