This ticket describes several issues that a bunch of us (@smklein, @leftwo, @jmpesp, @bnaecker, @Nieuwejaar, and I) debugged today. We'll file separate tickets for bugs and enhancements that comes out of it and then I expect we'll close this one. I just wanted a place to record the whole activity. I'm sure I've got some of the details here wrong -- please correct me! I didn't take good enough notes. On the plus side, there's a video recording for most of this process.
Summary: From initial conditions of a working Omicron on gimlet-sn05, @Nieuwejaar reinstalled Omicron. Omicron failed to come up because it failed to wipe the database. The proximate cause appears to us like a Cockroach bug, which I've filed. There were a few follow-on bugs that happened after this. We also identified some enhancements to make this easier to debug in the future.
Initial symptoms
Nils reported that after having Omicron running on gimlet-sn05, he tried to install Omicron again and it failed to come up. Sled Agent logged:
{"msg":"request completed","v":0,"name":"SledAgent","level":30,"time":"2000-01-02T03:10:13.792417073Z","hostname":"gimlet-sn05","pid":133494,"uri":"/filesystem","method":"PUT","req_id":"20dcbaaf-bf96-4e8c-9e83-5d759e2268e8","remote_addr":"[fd00:1122:3344:101::1]:53770","local_addr":"[fd00:1122:3344:101::1]:12345","component":"dropshot (SledAgent)","error_message_external":"Internal Server Error","error_message_internal":"Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbwipe.sql] executed and failed with status: exit status: 1 stdout: CREATE DATABASE\nCREATE ROLE\n stderr: ERROR: job-row-insert: duplicate key value violates unique constraint \"primary\"\nSQLSTATE: 23505\nDETAIL: Key (id)=(32769) already exists.\nCONSTRAINT: primary\nERROR: job-row-insert: duplicate key value violates unique constraint \"primary\"\nSQLSTATE: 23505\nDETAIL: Key (id)=(32769) already exists.\nCONSTRAINT: primary\nFailed running \"sql\"\n","response_code":"500"}
formatted:
[2000-01-02T03:10:13.792417073Z] INFO: SledAgent/dropshot (SledAgent)/133494 on gimlet-sn05: request completed (req_id=20dcbaaf-bf96-4e8c-9e83-5d759e2268e8, uri=/filesystem, method=PUT, remote_addr=[fd00:1122:3344:101::1]:53770, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
error_message_internal: Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbwipe.sql] executed and failed with status: exit status: 1 stdout: CREATE DATABASE
CREATE ROLE
stderr: ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
SQLSTATE: 23505
DETAIL: Key (id)=(32769) already exists.
CONSTRAINT: primary
ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
SQLSTATE: 23505
DETAIL: Key (id)=(32769) already exists.
CONSTRAINT: primary
Failed running "sql"
Initial investigation: RSS/Sled Agent stuck in a loop
This command is part of the dbwipe.sql script that we run while setting up a zone. It's surprising that this would ever fail, as it's intended to be run against an arbitrary database that may or may not have been set up by Omicron already. I believe we even have tests that it's idempotent regardless of whether the database has been set up already. I used omicron-dev db-run and omicron-dev db-wipe to set up a transient CockroachDB cluster and verify that I could wipe it several times (without repopulating it).
It's not clear to me how the error message above has anything to do with the SQL we were trying to run. It appears to be complaining about some internal id, not something that was in the query.
Upon logging into gimlet-sn05, we found that there was no CockroachDB zone present. This is where my summary here diverges from the sequence we actually took. It will be clearer to present a summary that we put together from the chat log and sled agent log. I've filtered the raw sled agent log with:
grep -v 'failed to notify nexus' sled-agent.log | grep -v 'contacting server nexus' | grep -v 'failed to notify nexus about datasets'
to eliminate a bunch of messages irrelevant to this issue. My excerpts below are formatted with bunyan -o short.
Nils reported the problem around 8:10am PT, which I calculate to be about 03:27Z on the test system. Then he did an "uninstall" and "install" around 8:27 PT, or 03:44Z on this system. We see the latter in the sled agent log. The former must have been rotated to a previous log file. That's okay because we really only care about what happened after this second event. Sure enough, we have:
[ Jan 2 03:42:33 Enabled. ]
[ Jan 2 03:42:33 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml &"). ]
[ Jan 2 03:42:33 Method "start" exited with status 0. ]
As Omicron starts up, it eventually goes to set up CockroachDB:
03:44:30.195Z INFO SledAgent/RSS: creating new filesystem: DatasetEnsureBody { id: 4d08fc19-3d5f-4f6b-9c48-925f8eac7255, zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221 }
03:44:30.204Z INFO SledAgent/StorageManager: add_dataset: NewFilesystemRequest { zpool_id: d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset_kind: CockroachDb { all_addresses: [[fd00:1122:3344:101::2]:32221] }, address: [fd00:1122:3344:101::2]:32221, responder: Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: true, is_tx_task_set: false } }) } }
03:44:30.207Z INFO SledAgent/StorageManager: Ensuring dataset oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb exists
03:44:30.307Z INFO SledAgent/StorageManager: Ensuring zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb is running
03:44:30.345Z INFO SledAgent/StorageManager: Zone for oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b/cockroachdb was not found
03:44:30.419Z INFO SledAgent/StorageManager: Configuring new Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:30.490Z INFO SledAgent/StorageManager: Installing Omicron zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:44.692Z INFO SledAgent/StorageManager: Zone booting
zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:53.389Z INFO SledAgent/StorageManager: Adding address: Static(V6(Ipv6Network { addr: fd00:1122:3344:101::2, prefix: 64 }))
zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:44:54.336Z INFO SledAgent/StorageManager: start_zone: Loading CRDB manifest
03:44:55.060Z INFO SledAgent/StorageManager: start_zone: setting CRDB's config/listen_addr: [fd00:1122:3344:101::2]:32221
03:44:55.170Z INFO SledAgent/StorageManager: start_zone: setting CRDB's config/store
03:44:55.275Z INFO SledAgent/StorageManager: start_zone: setting CRDB's config/join_addrs
03:44:55.384Z INFO SledAgent/StorageManager: start_zone: refreshing manifest
03:44:55.487Z INFO SledAgent/StorageManager: start_zone: enabling CRDB service
03:44:55.577Z INFO SledAgent/StorageManager: start_zone: awaiting liveness of CRDB
03:44:55.599Z WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:55.955Z WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:56.596Z WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:57.223Z WARN SledAgent/StorageManager: cockroachdb not yet alive
03:44:59.661Z WARN SledAgent/StorageManager: cockroachdb not yet alive
03:45:04.015Z INFO SledAgent/StorageManager: CRDB is online
03:45:04.030Z INFO SledAgent/StorageManager: Formatting CRDB
03:45:09.661Z INFO SledAgent/StorageManager: halt_and_remove_logged: Previous zone state: Running
zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:45:09.674Z INFO SledAgent/StorageManager: Stopped and uninstalled zone
zone: oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
03:45:09.704Z INFO SledAgent/dropshot (SledAgent): request completed (req_id=b12c1590-f9c9-43e8-a9d0-761ae0b93e0a, uri=/filesystem, method=PUT, remote_addr=[fd00:1122:3344:101::1]:50697, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
error_message_internal: Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbwipe.sql] executed and failed with status: exit status: 1 stdout: CREATE DATABASE
CREATE ROLE
stderr: ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
SQLSTATE: 23505
DETAIL: Key (id)=(32769) already exists.
CONSTRAINT: primary
ERROR: job-row-insert: duplicate key value violates unique constraint "primary"
SQLSTATE: 23505
DETAIL: Key (id)=(32769) already exists.
CONSTRAINT: primary
Failed running "sql"
03:45:09.727Z WARN SledAgent/RSS: failed to create filesystem
error: Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "b12c1590-f9c9-43e8-a9d0-761ae0b93e0a", "content-length": "124", "date": "Sun, 02 Jan 2000 03:45:09 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "b12c1590-f9c9-43e8-a9d0-761ae0b93e0a" }
Here, RSS decides to create a dataset for CockroachDB. Sled Agent receives the request and decides to create a zone for it. The zone boots, we wait for CockroachDB to be up, and then we send the "wipe" SQL and get this error.
We go through the whole sequence again starting at 03:45:10.070Z, then 03:45:49.558Z, then 03:46:29.764Z, then 03:47:16.045Z, and so on. I think we're in an exponential backoff. Each time, RSS decides to create a dataset, Sled Agent creates the zone, we get this error from CockroachDB, and then Sled Agent tears down the zone. This explains why when we got there, there was no zone present.
Attempting to reproduce the problem elsewhere
This felt like the sort of bug where we really wanted to interactive with a live, broken database to investigate its state.
As mentioned above, we had already tried to reproduce this problem from a clean slate via a sequence of "populate" and "wipe" operations against the database. This didn't work. We thought maybe there was some broken on-disk state. @jmpesp did a "zfs send" of the underlying ZFS dataset over to atrium, where he spun up an instance of CockroachDB to see if we could reproduce the problem. We did not reproduce it, though he did see some other strange behavior that I'll let him comment about because I don't have the details.
Much later, I tried the same thing. Starting from a point where we had a CockroachDB zone running on gimlet-sn21 in a state that could fairly reliably generate this error, I shut down CockroachDB. I brought it up and verified that I still saw the error. Then I shut it down again and took a snapshot. I sent this over to my test machine, ivanova, and used cargo run --bin=omicron-dev db-run --store-dir /rpool/crdb-90803-repro/ --no-populate to start up a CockroachDB instance on it. I was not able to reproduce the error, even running the exact same SQL that did reproduce the problem over on gimlet-sn21. This seems like a significant data point but I don't yet know what it means.
Attempting to reproduce the problem back on gimlet-sn05
I don't remember exactly why, but we decided to restart sled-agent on gimlet-sn05 because we believed there was nothing more we could learn from the current state and we thought this would trigger the same code path again. Then if we could pause things at the point where Sled Agent tried to halt the busted zone, we could log in and investigate. (I can't remember if I have the timeline right here -- but if this is when it was, then there was no CockroachDB zone running at this point and I'm not sure we realized this retry process was in exponential backoff. I think maybe I thought it had given up.) From this SMF log entry:
[ Jan 2 04:53:06 Method "start" exited with status 0. ]
I think we did this around 04:53Z. When we did this, we ran into a different problem: the StorageManager inside Sled Agent would find the CockroachDB ZFS dataset and attempt to find information about it inside /var/oxide, but there was none there:
04:54:04.634Z INFO SledAgent/StorageManager: StorageWorker loading fs cockroachdb on zpool oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
04:54:04.681Z INFO SledAgent/StorageManager: Loading Dataset from /var/oxide/d462a7f7-b628-40fe-80ff-4e4189e2d62b/d6b41137-7fb5-43a1-939a-08302ddfc95e.toml
04:54:04.696Z WARN SledAgent/StorageManager: StorageWorker Failed to load dataset: Failed to perform I/O: read config for pool oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b, dataset DatasetName { pool_name: "oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b", dataset_name: "cockroachdb" } from "/var/oxide/d462a7f7-b628-40fe-80ff-4e4189e2d62b/d6b41137-7fb5-43a1-939a-08302ddfc95e.toml": No such file or directory (os error 2)
We'll file a separate bug with more details on this. The net result was that Sled Agent was not able to set up the CockroachDB zone. We'd expect at this point that RSS would have been telling it to create that zone. Having found it, it would have tried to use the one it already had. In that path, it would not run the dbwipe.sql again. So this didn't trigger the code path we wanted.
A bigger hammer
Again, at this point we're still trying to catch a live system in a state where we can poke around and reproduce the SQL error. At this point, Nils observed that rerunning omicron-package install seemed to trigger the problem somewhat reliably, so we decided to do that again. This would blow away a lot more state than what we'd been doing so far, but what we'd been doing so far wasn't working.
I used this DTrace enabling to try stop every "zoneadm" command on the zone in question:
dtrace -w -n 'exec-success/execname == "zoneadm" && strstr(curpsinfo->pr_psargs, "oxz_cockroach") != 0/{ trace(pid); stop(); system("pargs %d", pid); exit(0); }'
Some notes:
- This took a bunch of trial and error to get right.
- This traces all successful process execs, filters on the command "zoneadm" with "oxz_cockroach" (a substring of the zonename) in the first 80 characters of the command-line arguments. When this event happens, we print the pid, stop the process, dump its full arguments, and exit from the D script.
- I nearly always use
exit(0) with stop() because otherwise if I mess up the predicate, I could wind up stopping a zillion other processes (possibly including dtrace).
- I wasn't able to filter on the "halt" subcommand because
curpsinfo->pr_psargs only has the first 80 characters and it wasn't in the first 80 characters. Instead, whenever this fired, I'd look at the printed command-line arguments. If it wasn't halt, I'd use prun $pid to run the process again. If it was, we're in business: it means we've successfully stopped the process with the broken zone still running and we can log in and investigate.
We got this far and were able to dig in a little deeper. We wanted to get Nils unblocked, though, so we used the same omicron-package install step to reproduce this on gimlet-sn21 and released gimlet-sn05 back to Nils.
The behavior we found at this point seemed like it couldn't be correct CockroachDB behavior so we filed cockroachdb/cockroach#90803, which summarizes what we saw. This is when I tried the steps mentioned above trying to reproduce on ivanova.
We didn't dig too much further. We had the idea that maybe what's going on is unrelated to our query, but just that this query triggers an internal CockroachDB job and something about that mechanism is causing it to big a bogus id. We poked a bit with "show jobs" and the "jobs" table in the "system" database. I wanted to explore a bit more with the web console but we ran into cockroachdb/cockroach#79657.
Issues found
Bugs (possibly debatable):
And debugging enhancements:
This ticket describes several issues that a bunch of us (@smklein, @leftwo, @jmpesp, @bnaecker, @Nieuwejaar, and I) debugged today. We'll file separate tickets for bugs and enhancements that comes out of it and then I expect we'll close this one. I just wanted a place to record the whole activity. I'm sure I've got some of the details here wrong -- please correct me! I didn't take good enough notes. On the plus side, there's a video recording for most of this process.
Summary: From initial conditions of a working Omicron on gimlet-sn05, @Nieuwejaar reinstalled Omicron. Omicron failed to come up because it failed to wipe the database. The proximate cause appears to us like a Cockroach bug, which I've filed. There were a few follow-on bugs that happened after this. We also identified some enhancements to make this easier to debug in the future.
Initial symptoms
Nils reported that after having Omicron running on gimlet-sn05, he tried to install Omicron again and it failed to come up. Sled Agent logged:
{"msg":"request completed","v":0,"name":"SledAgent","level":30,"time":"2000-01-02T03:10:13.792417073Z","hostname":"gimlet-sn05","pid":133494,"uri":"/filesystem","method":"PUT","req_id":"20dcbaaf-bf96-4e8c-9e83-5d759e2268e8","remote_addr":"[fd00:1122:3344:101::1]:53770","local_addr":"[fd00:1122:3344:101::1]:12345","component":"dropshot (SledAgent)","error_message_external":"Internal Server Error","error_message_internal":"Error managing storage: Error running command in zone 'oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b': Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /opt/oxide/cockroachdb/bin/cockroach sql --insecure --host [fd00:1122:3344:101::2]:32221 --file /opt/oxide/cockroachdb/sql/dbwipe.sql] executed and failed with status: exit status: 1 stdout: CREATE DATABASE\nCREATE ROLE\n stderr: ERROR: job-row-insert: duplicate key value violates unique constraint \"primary\"\nSQLSTATE: 23505\nDETAIL: Key (id)=(32769) already exists.\nCONSTRAINT: primary\nERROR: job-row-insert: duplicate key value violates unique constraint \"primary\"\nSQLSTATE: 23505\nDETAIL: Key (id)=(32769) already exists.\nCONSTRAINT: primary\nFailed running \"sql\"\n","response_code":"500"}formatted:
Initial investigation: RSS/Sled Agent stuck in a loop
This command is part of the dbwipe.sql script that we run while setting up a zone. It's surprising that this would ever fail, as it's intended to be run against an arbitrary database that may or may not have been set up by Omicron already. I believe we even have tests that it's idempotent regardless of whether the database has been set up already. I used
omicron-dev db-runandomicron-dev db-wipeto set up a transient CockroachDB cluster and verify that I could wipe it several times (without repopulating it).It's not clear to me how the error message above has anything to do with the SQL we were trying to run. It appears to be complaining about some internal id, not something that was in the query.
Upon logging into gimlet-sn05, we found that there was no CockroachDB zone present. This is where my summary here diverges from the sequence we actually took. It will be clearer to present a summary that we put together from the chat log and sled agent log. I've filtered the raw sled agent log with:
to eliminate a bunch of messages irrelevant to this issue. My excerpts below are formatted with
bunyan -o short.Nils reported the problem around 8:10am PT, which I calculate to be about 03:27Z on the test system. Then he did an "uninstall" and "install" around 8:27 PT, or 03:44Z on this system. We see the latter in the sled agent log. The former must have been rotated to a previous log file. That's okay because we really only care about what happened after this second event. Sure enough, we have:
As Omicron starts up, it eventually goes to set up CockroachDB:
Here, RSS decides to create a dataset for CockroachDB. Sled Agent receives the request and decides to create a zone for it. The zone boots, we wait for CockroachDB to be up, and then we send the "wipe" SQL and get this error.
We go through the whole sequence again starting at 03:45:10.070Z, then 03:45:49.558Z, then 03:46:29.764Z, then 03:47:16.045Z, and so on. I think we're in an exponential backoff. Each time, RSS decides to create a dataset, Sled Agent creates the zone, we get this error from CockroachDB, and then Sled Agent tears down the zone. This explains why when we got there, there was no zone present.
Attempting to reproduce the problem elsewhere
This felt like the sort of bug where we really wanted to interactive with a live, broken database to investigate its state.
As mentioned above, we had already tried to reproduce this problem from a clean slate via a sequence of "populate" and "wipe" operations against the database. This didn't work. We thought maybe there was some broken on-disk state. @jmpesp did a "zfs send" of the underlying ZFS dataset over to atrium, where he spun up an instance of CockroachDB to see if we could reproduce the problem. We did not reproduce it, though he did see some other strange behavior that I'll let him comment about because I don't have the details.
Much later, I tried the same thing. Starting from a point where we had a CockroachDB zone running on gimlet-sn21 in a state that could fairly reliably generate this error, I shut down CockroachDB. I brought it up and verified that I still saw the error. Then I shut it down again and took a snapshot. I sent this over to my test machine, ivanova, and used
cargo run --bin=omicron-dev db-run --store-dir /rpool/crdb-90803-repro/ --no-populateto start up a CockroachDB instance on it. I was not able to reproduce the error, even running the exact same SQL that did reproduce the problem over on gimlet-sn21. This seems like a significant data point but I don't yet know what it means.Attempting to reproduce the problem back on gimlet-sn05
I don't remember exactly why, but we decided to restart sled-agent on gimlet-sn05 because we believed there was nothing more we could learn from the current state and we thought this would trigger the same code path again. Then if we could pause things at the point where Sled Agent tried to halt the busted zone, we could log in and investigate. (I can't remember if I have the timeline right here -- but if this is when it was, then there was no CockroachDB zone running at this point and I'm not sure we realized this retry process was in exponential backoff. I think maybe I thought it had given up.) From this SMF log entry:
I think we did this around 04:53Z. When we did this, we ran into a different problem: the StorageManager inside Sled Agent would find the CockroachDB ZFS dataset and attempt to find information about it inside /var/oxide, but there was none there:
We'll file a separate bug with more details on this. The net result was that Sled Agent was not able to set up the CockroachDB zone. We'd expect at this point that RSS would have been telling it to create that zone. Having found it, it would have tried to use the one it already had. In that path, it would not run the dbwipe.sql again. So this didn't trigger the code path we wanted.
A bigger hammer
Again, at this point we're still trying to catch a live system in a state where we can poke around and reproduce the SQL error. At this point, Nils observed that rerunning
omicron-package installseemed to trigger the problem somewhat reliably, so we decided to do that again. This would blow away a lot more state than what we'd been doing so far, but what we'd been doing so far wasn't working.I used this DTrace enabling to try stop every "zoneadm" command on the zone in question:
Some notes:
exit(0)withstop()because otherwise if I mess up the predicate, I could wind up stopping a zillion other processes (possibly includingdtrace).curpsinfo->pr_psargsonly has the first 80 characters and it wasn't in the first 80 characters. Instead, whenever this fired, I'd look at the printed command-line arguments. If it wasn'thalt, I'd useprun $pidto run the process again. If it was, we're in business: it means we've successfully stopped the process with the broken zone still running and we can log in and investigate.We got this far and were able to dig in a little deeper. We wanted to get Nils unblocked, though, so we used the same
omicron-package installstep to reproduce this on gimlet-sn21 and released gimlet-sn05 back to Nils.The behavior we found at this point seemed like it couldn't be correct CockroachDB behavior so we filed cockroachdb/cockroach#90803, which summarizes what we saw. This is when I tried the steps mentioned above trying to reproduce on ivanova.
We didn't dig too much further. We had the idea that maybe what's going on is unrelated to our query, but just that this query triggers an internal CockroachDB job and something about that mechanism is causing it to big a bogus id. We poked a bit with "show jobs" and the "jobs" table in the "system" database. I wanted to explore a bit more with the web console but we ran into cockroachdb/cockroach#79657.
Issues found
Bugs (possibly debatable):
And debugging enhancements:
jqorjsonhelios#37lookerorbunyanhelios#38, could use help output (even if there are no options) looker#1, looker ignores files on the command line and doesn't say anything looker#2, looker support for merging multiple files looker#3, want "long" mode output looker#4, want filtering by level looker#5