Skip to content

failure to set up CockroachDB should not be retried #1886

@davepacheco

Description

@davepacheco

In #1880, the Sled Agent went through this sequence (logs copied from that ticket -- see the ticket for more context):

03:44:30.195Z  INFO SledAgent/RSS: creating new filesystem: DatasetEnsureBody { id: 4d08fc19-3d5f-4f6b-9c48-925f8eac7255, zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221 }
03:44:30.204Z  INFO SledAgent/StorageManager: add_dataset: NewFilesystemRequest { zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221, responder: Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: true, is_tx_task_set: false } }) } }
03:44:30.207Z  INFO SledAgent/StorageManager: Ensuring dataset oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb exists
03:44:30.307Z  INFO SledAgent/StorageManager: Ensuring zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb is running
03:44:30.345Z  INFO SledAgent/StorageManager: Zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb was not found
03:44:30.419Z  INFO SledAgent/StorageManager: Configuring new Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:30.490Z  INFO SledAgent/StorageManager: Installing Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:44.692Z  INFO SledAgent/StorageManager: Zone booting
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:53.389Z  INFO SledAgent/StorageManager: Adding address: Static(V6(Ipv6Network { addr: fd00:1122:3344:101::2, prefix: 64 }))
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:54.336Z  INFO SledAgent/StorageManager: start_zone: Loading CRDB manifest
03:44:55.060Z  INFO SledAgent/StorageManager: start_zone: setting CRDB's config/listen_addr: [fd00:1122:3344:101::2]:32221
03:44:55.170Z  INFO SledAgent/StorageManager: start_zone: setting CRDB's config/store
03:44:55.275Z  INFO SledAgent/StorageManager: start_zone: setting CRDB's config/join_addrs
03:44:55.384Z  INFO SledAgent/StorageManager: start_zone: refreshing manifest
03:44:55.487Z  INFO SledAgent/StorageManager: start_zone: enabling CRDB service
03:44:55.577Z  INFO SledAgent/StorageManager: start_zone: awaiting liveness of CRDB
03:44:55.599Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:55.955Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:56.596Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:57.223Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:59.661Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:45:04.015Z  INFO SledAgent/StorageManager: CRDB is online
03:45:04.030Z  INFO SledAgent/StorageManager: Formatting CRDB
03:45:09.661Z  INFO SledAgent/StorageManager: halt_and_remove_logged: Previous zone state: Running
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:45:09.674Z  INFO SledAgent/StorageManager: Stopped and uninstalled zone
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:45:09.704Z  INFO SledAgent/dropshot (SledAgent): request completed (req_id=b12c1590-f9c9-43e8-a9d0-761ae0b93e0a, uri=/filesystem, method=PUT, remote_addr=[fd00:1122:3344:101::1]:50697, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
    error_message_internal: Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbwipe.sql] executed and failed with status: exit status: 1  stdout: CREATE DATABASE
    CREATE ROLE
      stderr: ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
    SQLSTATE: 23505
    DETAIL: Key (id)=(32769) already exists.
    CONSTRAINT: primary
    ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
    SQLSTATE: 23505
    DETAIL: Key (id)=(32769) already exists.
    CONSTRAINT: primary
    Failed running "sql"
    
03:45:09.727Z  WARN SledAgent/RSS: failed to create filesystem
    error: Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "b12c1590-f9c9-43e8-a9d0-761ae0b93e0a", "content-length": "124", "date": "Sun, 02 Jan 2000 03:45:09 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "b12c1590-f9c9-43e8-a9d0-761ae0b93e0a" }

Quoting that ticket:

Here, RSS decides to create a dataset for CockroachDB. Sled Agent receives the request and decides to create a zone for it. The zone boots, we wait for CockroachDB to be up, and then we send the "wipe" SQL and get this error.

We go through the whole sequence again starting at 03:45:10.070Z, then 03:45:49.558Z, then 03:46:29.764Z, then 03:47:16.045Z, and so on. I think we're in an exponential backoff. Each time, RSS decides to create a dataset, Sled Agent creates the zone, we get this error from CockroachDB, and then Sled Agent tears down the zone.

I don't think we should retry this error. The software knows it's a bug and not a transient error -- that's why it returned a 500 and not a 503. I think RSS should probably come to rest once this happens with a big red flag (somehow) for the operator saying "we cannot proceed to initialize the rack due to a bug; please contact support" and a message for support that says "we failed to set up CockroachDB because the database reported X".

Metadata

Metadata

Assignees

No one assigned

    Labels

    Sled AgentRelated to the Per-Sled Configuration and ManagementbugSomething that isn't working.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions