In #1880, the Sled Agent went through this sequence (logs copied from that ticket -- see the ticket for more context):
03:44:30.195Z INFO SledAgent/RSS: creating new filesystem: DatasetEnsureBody { id: 4d08fc19-3d5f-4f6b-9c48-925f8eac7255, zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221 }
03:44:30.204Z INFO SledAgent/StorageManager: add_dataset: NewFilesystemRequest { zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221, responder: Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: true, is_tx_task_set: false } }) } }
03:44:30.207Z INFO SledAgent/StorageManager: Ensuring dataset oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb exists
03:44:30.307Z INFO SledAgent/StorageManager: Ensuring zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb is running
03:44:30.345Z INFO SledAgent/StorageManager: Zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb was not found
03:44:30.419Z INFO SledAgent/StorageManager: Configuring new Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:30.490Z INFO SledAgent/StorageManager: Installing Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:44.692Z INFO SledAgent/StorageManager: Zone booting
zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:53.389Z INFO SledAgent/StorageManager: Adding address: Static(V6(Ipv6Network { addr: fd00:1122:3344:101::2, prefix: 64 }))
zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:54.336Z INFO SledAgent/StorageManager: start_zone: Loading CRDB manifest
03:44:55.060Z INFO SledAgent/StorageManager: start_zone: setting CRDB's config/listen_addr: [fd00:1122:3344:101::2]:32221
03:44:55.170Z INFO SledAgent/StorageManager: start_zone: setting CRDB's config/store
03:44:55.275Z INFO SledAgent/StorageManager: start_zone: setting CRDB's config/join_addrs
03:44:55.384Z INFO SledAgent/StorageManager: start_zone: refreshing manifest
03:44:55.487Z INFO SledAgent/StorageManager: start_zone: enabling CRDB service
03:44:55.577Z INFO SledAgent/StorageManager: start_zone: awaiting liveness of CRDB
03:44:55.599Z WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:55.955Z WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:56.596Z WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:57.223Z WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:59.661Z WARN SledAgent/StorageManager: cockroachdb not yet alive
03:45:04.015Z INFO SledAgent/StorageManager: CRDB is online
03:45:04.030Z INFO SledAgent/StorageManager: Formatting CRDB
03:45:09.661Z INFO SledAgent/StorageManager: halt_and_remove_logged: Previous zone state: Running
zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:45:09.674Z INFO SledAgent/StorageManager: Stopped and uninstalled zone
zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:45:09.704Z INFO SledAgent/dropshot (SledAgent): request completed (req_id=b12c1590-f9c9-43e8-a9d0-761ae0b93e0a, uri=/filesystem, method=PUT, remote_addr=[fd00:1122:3344:101::1]:50697, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
error_message_internal: Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbwipe.sql] executed and failed with status: exit status: 1 stdout: CREATE DATABASE
CREATE ROLE
stderr: ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
SQLSTATE: 23505
DETAIL: Key (id)=(32769) already exists.
CONSTRAINT: primary
ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
SQLSTATE: 23505
DETAIL: Key (id)=(32769) already exists.
CONSTRAINT: primary
Failed running "sql"
03:45:09.727Z WARN SledAgent/RSS: failed to create filesystem
error: Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "b12c1590-f9c9-43e8-a9d0-761ae0b93e0a", "content-length": "124", "date": "Sun, 02 Jan 2000 03:45:09 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "b12c1590-f9c9-43e8-a9d0-761ae0b93e0a" }
Quoting that ticket:
Here, RSS decides to create a dataset for CockroachDB. Sled Agent receives the request and decides to create a zone for it. The zone boots, we wait for CockroachDB to be up, and then we send the "wipe" SQL and get this error.
We go through the whole sequence again starting at 03:45:10.070Z, then 03:45:49.558Z, then 03:46:29.764Z, then 03:47:16.045Z, and so on. I think we're in an exponential backoff. Each time, RSS decides to create a dataset, Sled Agent creates the zone, we get this error from CockroachDB, and then Sled Agent tears down the zone.
I don't think we should retry this error. The software knows it's a bug and not a transient error -- that's why it returned a 500 and not a 503. I think RSS should probably come to rest once this happens with a big red flag (somehow) for the operator saying "we cannot proceed to initialize the rack due to a bug; please contact support" and a message for support that says "we failed to set up CockroachDB because the database reported X".
In #1880, the Sled Agent went through this sequence (logs copied from that ticket -- see the ticket for more context):
Quoting that ticket:
I don't think we should retry this error. The software knows it's a bug and not a transient error -- that's why it returned a 500 and not a 503. I think RSS should probably come to rest once this happens with a big red flag (somehow) for the operator saying "we cannot proceed to initialize the rack due to a bug; please contact support" and a message for support that says "we failed to set up CockroachDB because the database reported X".