Repro steps:
- Stand up a development cluster, assign it some external IPs, and create a project named
myproj
- In a directory containing the Oxide CLI, create the following
instance_json.json:
{
"description": "description",
"disks": [],
"external_ips": [],
"hostname": "hostname",
"memory": 1073741824,
"name": "myinst",
"ncpus": 2,
"network_interfaces": {
"type": "none"
},
"start": true,
"user_data": ""
}
- Alongside it create the following script:
#!/bin/bash
while :;
do
if [[ -e stopfile ]];
then
break
fi
./oxide instance create --project myproj --json-body ./instance_json.json
while :;
do
STATE=$(./oxide api "/v1/instances/${1}?project=myproj"| jq -r .run_state)
case "${STATE}" in
"running")
break
;;
*)
sleep 0.1
;;
esac
done
./oxide instance stop --project myproj --instance "${1}"
while :;
do
STATE=$(./oxide api "/v1/instances/${1}?project=myproj"| jq -r .run_state)
case "${STATE}" in
"stopped")
break
;;
*)
sleep 0.1
;;
esac
done
./oxide instance delete --project myproj --instance "${1}"
sleep 0.1
done
- Log into the cluster from the environment that will run instances of the script
- Run two parallel instances of the script with the name of the instance from the JSON body:
script.sh myinst
Expected: The instance is repeatedly created, started, stopped, and deleted without incident.
Observed: After a few hours, instance creation fails and causes sled agent to crash:
10:27:07.278Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
state = InstanceRuntimeState { run_state: Starting, sled_id: 7935aaf5-9466-493b-865e-195124db7a70, propolis_id: 3dd4ffe7-2949-4e17-9301-4fdbd036249e, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:101::942]:12400), migration_id: None, propolis_gen: Generation(1), ncpus: InstanceCpuCount(2), memory: ByteCount(1073741824), hostname: "hostname", gen: Generation(2), time_updated: 2023-05-31T10:27:07.278015636Z }
10:27:07.342Z INFO SledAgent (InstanceManager): Configuring new Omicron zone: oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e
instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
10:27:07.371Z INFO SledAgent (InstanceManager): Installing Omicron zone: oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e
instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
10:27:09.238Z INFO SledAgent (InstanceManager): halt_and_remove_logged: Previous zone state: Running
instance_id = 49ea6a9a-3003-4e08-9af2-2dd9bfe92586
10:27:09.260Z INFO SledAgent (InstanceManager): State monitoring task complete
instance_id = 49ea6a9a-3003-4e08-9af2-2dd9bfe92586
10:27:13.241Z INFO SledAgent (InstanceManager): Zone booting
instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
zone = oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e
10:27:20.944Z INFO SledAgent (InstanceManager): Adding address: Static(V6(Ipv6Network { addr: fd00:1122:3344:101::942, prefix: 64 }))
instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
zone = oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e
10:27:21.061Z INFO SledAgent (InstanceManager): Publishing instance state update to Nexus
instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
state = InstanceRuntimeState { run_state: Failed, sled_id: 7935aaf5-9466-493b-865e-195124db7a70, propolis_id: 3dd4ffe7-2949-4e17-9301-4fdbd036249e, dst_propolis_id: None, propolis_addr: Some([fd00:1122:3344:101::942]:12400), migration_id: None, propolis_gen: Generation(1), ncpus: InstanceCpuCount(2), memory: B yteCount(1073741824), hostname: "hostname", gen: Generation(3), time_updated: 2023-05-31T10:27:21.060951893Z }
10:27:21.137Z INFO SledAgent (dropshot (SledAgent)): request completed
error_message_external = Internal Server Error
error_message_internal = Error managing instances: Instance error: Failed to create address Static(V6(Ipv6Network { addr: fd00:1122:3344:101::942, prefix: 64 })) with name oxControlInstance1333/omicron6 in oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e: Zone execution error: Command [/usr/sbin/zlogin oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e /usr/sbin/ipadm create-addr -t -T addrconf oxControlInstance1333/ll] executed and failed with status: exit status: 1 stdout: \n stderr: ipadm: Could not create address: Addrconf already in progress
local_addr = [fd00:1122:3344:101::1]:12345
method = PUT
remote_addr = [fd00:1122:3344:101::4]:39086
req_id = 509270c7-ae17-4485-92fb-063b70905ba5
response_code = 500
uri = /instances/6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5/state
10:27:21.207Z INFO SledAgent (dropshot (SledAgent)): accepted connection
local_addr = [fd00:1122:3344:101::1]:12345
remote_addr = [fd00:1122:3344:101::4]:45688
10:27:21.207Z WARN SledAgent (InstanceManager): Halting and removing zone: oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e
instance_id = 6c35c9c3-6a3b-4b3e-a1d9-6cc454fb60e5
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: AdmError { op: Uninstall, zone: "oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdbd036249e", err: CommandOutput(CommandOutputError("exit code 1\nstdout:\n\nstderr:\nzoneadm: zone 'oxz_propolis-server_3dd4ffe7-2949-4e17-9301-4fdb d036249e': uninstall operation is invalid for shutting_down zones.")) }', sled-agent/src/instance.rs:528:64
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ May 31 10:27:21 Stopping because all processes in service exited. ]
[ May 31 10:27:21 Executing stop method (:kill). ]
[ May 31 10:27:21 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
Seems like there are two problems here:
- A "Could not create address: addrconf already in progress" error while setting up the Propolis zone
- Once the zone fails to start, sled agent marks the instance as Failed and tries to tear down the zone, but the call to
zoneadm unwraps, which takes down the entire sled agent.
Repro steps:
myprojinstance_json.json:script.sh myinstExpected: The instance is repeatedly created, started, stopped, and deleted without incident.
Observed: After a few hours, instance creation fails and causes sled agent to crash:
Seems like there are two problems here:
zoneadmunwraps, which takes down the entire sled agent.