bad time installing Omicron over Omicron on Gimlets

This ticket describes several issues that a bunch of us (@smklein, @leftwo, @jmpesp, @bnaecker, @Nieuwejaar, and I) debugged today.  We'll file separate tickets for bugs and enhancements that comes out of it and then I expect we'll close this one.  I just wanted a place to record the whole activity.  **I'm sure I've got some of the details here wrong -- please correct me!**  I didn't take good enough notes.  On the plus side, there's a video recording for most of this process.

**Summary:** From initial conditions of a working Omicron on gimlet-sn05, @Nieuwejaar reinstalled Omicron.  Omicron failed to come up because it failed to wipe the database.  The proximate cause appears to us like a Cockroach bug, which [I've filed](https://github.com/cockroachdb/cockroach/issues/90803).  There were a few follow-on bugs that happened after this.  We also identified some enhancements to make this easier to debug in the future.

## Initial symptoms

Nils reported that after having Omicron running on gimlet-sn05, he tried to install Omicron again and it failed to come up.  Sled Agent logged:

```json
{"msg":"request completed","v":0,"name":"SledAgent","level":30,"time":"2000-01-02T03:10:13.792417073Z","hostname":"gimlet-sn05","pid":133494,"uri":"/filesystem","method":"PUT","req_id":"20dcbaaf-bf96-4e8c-9e83-5d759e2268e8","remote_addr":"[fd00:1122:3344:101::1]:53770","local_addr":"[fd00:1122:3344:101::1]:12345","component":"dropshot (SledAgent)","error_message_external":"Internal Server Error","error_message_internal":"Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbwipe.sql] executed and failed with status: exit status: 1  stdout: CREATE DATABASE\nCREATE ROLE\n  stderr: ERROR: job-row-insert: duplicate key value violates unique constraint \"primary\"\nSQLSTATE: 23505\nDETAIL: Key (id)=(32769) already exists.\nCONSTRAINT: primary\nERROR: job-row-insert: duplicate key value violates unique constraint \"primary\"\nSQLSTATE: 23505\nDETAIL: Key (id)=(32769) already exists.\nCONSTRAINT: primary\nFailed running \"sql\"\n","response_code":"500"}
```

formatted:

```
[2000-01-02T03:10:13.792417073Z]  INFO: SledAgent/dropshot (SledAgent)/133494 on gimlet-sn05: request completed (req_id=20dcbaaf-bf96-4e8c-9e83-5d759e2268e8, uri=/filesystem, method=PUT, remote_addr=[fd00:1122:3344:101::1]:53770, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
    error_message_internal: Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbwipe.sql] executed and failed with status: exit status: 1  stdout: CREATE DATABASE
    CREATE ROLE
      stderr: ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
    SQLSTATE: 23505
    DETAIL: Key (id)=(32769) already exists.
    CONSTRAINT: primary
    ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
    SQLSTATE: 23505
    DETAIL: Key (id)=(32769) already exists.
    CONSTRAINT: primary
    Failed running "sql"
```

## Initial investigation: RSS/Sled Agent stuck in a loop

This command is part of the [dbwipe.sql](https://github.com/oxidecomputer/omicron/blob/b4366f1561f710653e4759685632ead91311a8e7/common/src/sql/dbwipe.sql) script that we run while setting up a zone.  It's surprising that this would ever fail, as it's intended to be run against an arbitrary database that may or may not have been set up by Omicron already.  I believe we even have tests that it's idempotent regardless of whether the database has been set up already.  I used `omicron-dev db-run` and `omicron-dev db-wipe` to set up a transient CockroachDB cluster and verify that I could wipe it several times (without repopulating it).

It's not clear to me how the error message above has anything to do with the SQL we were trying to run.  It appears to be complaining about some internal id, not something that was in the query.

Upon logging into gimlet-sn05, we found that there was no CockroachDB zone present.  This is where my summary here diverges from the sequence we actually took.  It will be clearer to present a summary that we put together from the chat log and sled agent log.  I've filtered the raw sled agent log with:

```
grep -v 'failed to notify nexus' sled-agent.log | grep -v 'contacting server nexus' | grep -v 'failed to notify nexus about datasets'
```

to eliminate a bunch of messages irrelevant to this issue.  My excerpts below are formatted with `bunyan -o short`.

Nils reported the problem around 8:10am PT, which I calculate to be about 03:27Z on the test system.  Then he did an "uninstall" and "install" around 8:27 PT, or 03:44Z on this system.  We see the latter in the sled agent log.  The former must have been rotated to a previous log file.  That's okay because we really only care about what happened after this second event.  Sure enough, we have:

```
[ Jan  2 03:42:33 Enabled. ]
[ Jan  2 03:42:33 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
[ Jan  2 03:42:33 Method "start" exited with status 0. ]
```

As Omicron starts up, it eventually goes to set up CockroachDB:

```
03:44:30.195Z  INFO SledAgent/RSS: creating new filesystem: DatasetEnsureBody { id: 4d08fc19-3d5f-4f6b-9c48-925f8eac7255, zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221 }
03:44:30.204Z  INFO SledAgent/StorageManager: add_dataset: NewFilesystemRequest { zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221, responder: Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: true, is_tx_task_set: false } }) } }
03:44:30.207Z  INFO SledAgent/StorageManager: Ensuring dataset oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb exists
03:44:30.307Z  INFO SledAgent/StorageManager: Ensuring zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb is running
03:44:30.345Z  INFO SledAgent/StorageManager: Zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb was not found
03:44:30.419Z  INFO SledAgent/StorageManager: Configuring new Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:30.490Z  INFO SledAgent/StorageManager: Installing Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:44.692Z  INFO SledAgent/StorageManager: Zone booting
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:53.389Z  INFO SledAgent/StorageManager: Adding address: Static(V6(Ipv6Network { addr: fd00:1122:3344:101::2, prefix: 64 }))
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:54.336Z  INFO SledAgent/StorageManager: start_zone: Loading CRDB manifest
03:44:55.060Z  INFO SledAgent/StorageManager: start_zone: setting CRDB's config/listen_addr: [fd00:1122:3344:101::2]:32221
03:44:55.170Z  INFO SledAgent/StorageManager: start_zone: setting CRDB's config/store
03:44:55.275Z  INFO SledAgent/StorageManager: start_zone: setting CRDB's config/join_addrs
03:44:55.384Z  INFO SledAgent/StorageManager: start_zone: refreshing manifest
03:44:55.487Z  INFO SledAgent/StorageManager: start_zone: enabling CRDB service
03:44:55.577Z  INFO SledAgent/StorageManager: start_zone: awaiting liveness of CRDB
03:44:55.599Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:55.955Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:56.596Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:57.223Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:59.661Z  WARN SledAgent/StorageManager: cockroachdb not yet alive
03:45:04.015Z  INFO SledAgent/StorageManager: CRDB is online
03:45:04.030Z  INFO SledAgent/StorageManager: Formatting CRDB
03:45:09.661Z  INFO SledAgent/StorageManager: halt_and_remove_logged: Previous zone state: Running
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:45:09.674Z  INFO SledAgent/StorageManager: Stopped and uninstalled zone
    zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:45:09.704Z  INFO SledAgent/dropshot (SledAgent): request completed (req_id=b12c1590-f9c9-43e8-a9d0-761ae0b93e0a, uri=/filesystem, method=PUT, remote_addr=[fd00:1122:3344:101::1]:50697, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
    error_message_internal: Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbwipe.sql] executed and failed with status: exit status: 1  stdout: CREATE DATABASE
    CREATE ROLE
      stderr: ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
    SQLSTATE: 23505
    DETAIL: Key (id)=(32769) already exists.
    CONSTRAINT: primary
    ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
    SQLSTATE: 23505
    DETAIL: Key (id)=(32769) already exists.
    CONSTRAINT: primary
    Failed running "sql"
    
03:45:09.727Z  WARN SledAgent/RSS: failed to create filesystem
    error: Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "b12c1590-f9c9-43e8-a9d0-761ae0b93e0a", "content-length": "124", "date": "Sun, 02 Jan 2000 03:45:09 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "b12c1590-f9c9-43e8-a9d0-761ae0b93e0a" }
```

Here, RSS decides to create a dataset for CockroachDB.  Sled Agent receives the request and decides to create a zone for it.  The zone boots, we wait for CockroachDB to be up, and then we send the "wipe" SQL and get this error.  

We go through the whole sequence again starting at 03:45:10.070Z, then 03:45:49.558Z, then 03:46:29.764Z, then 03:47:16.045Z, and so on.  I think we're in an exponential backoff.  Each time, RSS decides to create a dataset, Sled Agent creates the zone, we get this error from CockroachDB, _and then Sled Agent tears down the zone_.  This explains why when we got there, there was no zone present.

## Attempting to reproduce the problem elsewhere

This felt like the sort of bug where we really wanted to interactive with a live, broken database to investigate its state.

As mentioned above, we had already tried to reproduce this problem from a clean slate via a sequence of "populate" and "wipe" operations against the database.  This didn't work.  We thought maybe there was some broken on-disk state.  @jmpesp did a "zfs send" of the underlying ZFS dataset over to atrium, where he spun up an instance of CockroachDB to see if we could reproduce the problem.  We did not reproduce it, though he did see some other strange behavior that I'll let him comment about because I don't have the details.

Much later, I tried the same thing.  Starting from a point where we had a CockroachDB zone running on gimlet-sn21 in a state that could fairly reliably generate this error, I shut down CockroachDB.  I brought it up and verified that I still saw the error.  Then I shut it down again and took a snapshot.  I sent this over to my test machine, ivanova, and used `cargo run --bin=omicron-dev db-run --store-dir /rpool/crdb-90803-repro/ --no-populate` to start up a CockroachDB instance on it.  I was not able to reproduce the error, even running the exact same SQL that _did_ reproduce the problem over on gimlet-sn21.  This seems like a significant data point but I don't yet know what it means.

## Attempting to reproduce the problem back on gimlet-sn05

I don't remember exactly why, but we decided to restart sled-agent on gimlet-sn05 because we believed there was nothing more we could learn from the current state and we thought this would trigger the same code path again.  Then if we could pause things at the point where Sled Agent tried to halt the busted zone, we could log in and investigate.  (I can't remember if I have the timeline right here -- but if this is when it was, then there _was_ no CockroachDB zone running at this point and I'm not sure we realized this retry process was in exponential backoff.  I think maybe I thought it had given up.)  From this SMF log entry:

```
[ Jan  2 04:53:06 Method "start" exited with status 0. ]
```

I think we did this around 04:53Z.  When we did this, we ran into a different problem: the StorageManager inside Sled Agent would find the CockroachDB ZFS dataset and attempt to find information about it inside /var/oxide, but there was none there:

```
04:54:04.634Z  INFO SledAgent/StorageManager: StorageWorker loading fs cockroachdb on zpool oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
04:54:04.681Z  INFO SledAgent/StorageManager: Loading Dataset from /var/oxide/d462a7f7-b628-40fe-80ff-4e4189e2d62b/d6b41137-7fb5-43a1-939a-08302ddfc95e.toml
04:54:04.696Z  WARN SledAgent/StorageManager: StorageWorker Failed to load dataset: Failed to perform I/O: read config for pool oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset DatasetName { pool_name: "oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b", dataset_name: "cockroachdb" } from "/var/oxide/d462a7f7-b628-40fe-80ff-4e4189e2d62b/d6b41137-7fb5-43a1-939a-08302ddfc95e.toml": No such file or directory (os error 2)
```

We'll file a separate bug with more details on this.  The net result was that Sled Agent was not able to set up the CockroachDB zone.  We'd expect at this point that RSS would have been telling it to create that zone.  Having found it, it would have tried to use the one it already had.  In that path, it would _not_ run the dbwipe.sql again.  So this didn't trigger the code path we wanted.

## A bigger hammer

Again, at this point we're still trying to catch a live system in a state where we can poke around and reproduce the SQL error.  At this point, Nils observed that rerunning `omicron-package install` seemed to trigger the problem somewhat reliably, so we decided to do that again.  This would blow away a lot more state than what we'd been doing so far, but what we'd been doing so far wasn't working.

I used this DTrace enabling to try stop every "zoneadm" command on the zone in question:

```
dtrace -w -n 'exec-success/execname == "zoneadm" && strstr(curpsinfo->pr_psargs, "oxz_cockroach") != 0/{ trace(pid); stop(); system("pargs %d", pid); exit(0); }'
```

Some notes:

* This took a bunch of trial and error to get right.
* This traces all successful process execs, filters on the command "zoneadm" with "oxz_cockroach" (a substring of the zonename) in the first 80 characters of the command-line arguments.  When this event happens, we print the pid, stop the process, dump its full arguments, and exit from the D script.
* I nearly always use `exit(0)` with `stop()` because otherwise if I mess up the predicate, I could wind up stopping a zillion other processes (possibly including `dtrace`).
* I wasn't able to filter on the "halt" subcommand because `curpsinfo->pr_psargs` only has the first 80 characters and it wasn't in the first 80 characters.  Instead, whenever this fired, I'd look at the printed command-line arguments.  If it _wasn't_ `halt`, I'd use `prun $pid` to run the process again.  If it was, we're in business: it means we've successfully stopped the process with the broken zone still running and we can log in and investigate.

We got this far and were able to dig in a little deeper.  We wanted to get Nils unblocked, though, so we used the same `omicron-package install` step to reproduce this on gimlet-sn21 and released gimlet-sn05 back to Nils.

The behavior we found at this point seemed like it couldn't be correct CockroachDB behavior so we filed [cockroachdb/cockroach#90803](https://github.com/cockroachdb/cockroach/issues/90803), which summarizes what we saw.  This is when I tried the steps mentioned above trying to reproduce on ivanova.

We didn't dig too much further.  We had the idea that maybe what's going on is unrelated to our query, but just that this query triggers an internal CockroachDB job and something about that mechanism is causing it to big a bogus id.  We poked a bit with "show jobs" and the "jobs" table in the "system" database.  I wanted to explore a bit more with the web console but we ran into cockroachdb/cockroach#79657.

## Issues found

Bugs (possibly debatable):

* #1884
* #1886
* #1879
* #1885

And debugging enhancements:

* oxidecomputer/helios#37
* oxidecomputer/helios#38, oxidecomputer/looker#1, oxidecomputer/looker#2, oxidecomputer/looker#3, oxidecomputer/looker#4, oxidecomputer/looker#5
* #1881
* #1882
* #1598


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bad time installing Omicron over Omicron on Gimlets #1880

Initial symptoms

Initial investigation: RSS/Sled Agent stuck in a loop

Attempting to reproduce the problem elsewhere

Attempting to reproduce the problem back on gimlet-sn05

A bigger hammer

Issues found

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bad time installing Omicron over Omicron on Gimlets #1880

Description

Initial symptoms

Initial investigation: RSS/Sled Agent stuck in a loop

Attempting to reproduce the problem elsewhere

Attempting to reproduce the problem back on gimlet-sn05

A bigger hammer

Issues found

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions