RSS stuck after cockroachdb init took too long

@internet-diglett was deploying Omicron (a branch off "main" but I think that's not relevant here) and ran into a case where RSS was looping forever trying to retry database initialization that had already completed.  There were two sleds: "workstation-1" (running RSS) and "test-station-1". 

On workstation-1, we have a bunch of log messages like this one:

```
[2023-07-05T17:10:03.798208372Z]  INFO: SledAgent/dropshot (SledAgent)/2525 on test-station-1: request completed (req_id=f8557412-0bea-4465-ad3f-68218bb4ba50, uri=/cockroachdb, method=POST, remote_addr=[fd00:1122:3344:102::1]:62144, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
    error_message_internal: Failed to get address: Failed to get address for name oxControlService3/omicron6 in oxz_cockroachdb_oxp_f3393d6c-25e2-4435-9602-b34bc8f42d43: Zone execution error: Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_f3393d6c-25e2-4435-9602-b34bc8f42d43 /usr/sbin/ipadm show-addr -p -o ADDR oxControlService3/omicron6] executed and failed with status: exit status: 1  stdout:  
      stderr: ipadm: Address object not found
```

followed by exactly one message like this:

```
[2023-07-05T17:12:48.114002609Z]  INFO: SledAgent/dropshot (SledAgent)/2525 on test-station-1: request completed (req_id=6ab3e28e-9af9-460c-980a-5890fe4f8b4e, uri=/cockroachdb, method=POST, remote_addr=[fd00:1122:3344:102::1]:62144, local_addr=[fd00:1122:3344:101::1]:12345, response_code=204)
```

followed by a bunch of messages like this:

```
[2023-07-05T17:12:48.304971089Z]  INFO: SledAgent/dropshot (SledAgent)/2525 on test-station-1: request completed (req_id=310d8023-bda5-420c-bcd4-69116f18bea3, uri=/cockroachdb, method=POST, remote_addr=[fd00:1122:3344:102::1]:41882, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
    error_message_internal: Failed to initialize CockroachDb: Error running command in zone 'oxz_cockroachdb_oxp_f3393d6c-25e2-4435-9602-b34bc8f42d43': Command [/opt/oxide/cockroachdb/bin/cockroach init --insecure --host [fd00:1122:3344:101::6]:32221] executed and failed with status: exit status: 1  stdout:   stderr: ERROR: cluster has already been initialized
    Failed running "init"
```

All of these correspond to requests from RSS for `POST /cockroachdb`, which initialize the CockroachDB cluster (runs `cockroach init` and loads a few SQL files) using any of the CockroachDB zones on that sled.  To summarize what's above: initially, these requests fail apparently because the zone is not yet set up.  At some point, one request completes successfully.  Then Sled Agent _keeps_ getting those requests and those fail because this step has already been done.

Over on the RSS side, the only log entries for this I found were:

```
$ grep /cockroachdb sled-agent-1-RSS.log | bunyan
[2023-07-05T17:14:21.843265387Z]  WARN: SledAgent/RSS/8449 on workstation-1: Failed to initialize CockroachDB
    error: Communication Error: error sending request for url (http://[fd00:1122:3344:101::1]:12345/cockroachdb): operation timed out
[2023-07-05T17:15:24.605878005Z]  WARN: SledAgent/RSS/8449 on workstation-1: Failed to initialize CockroachDB
    error: Communication Error: error sending request for url (http://[fd00:1122:3344:101::1]:12345/cockroachdb): operation timed out
```

We discovered at this point that these hosts' clocks are not sync'd up: workstation-1 is about 197 seconds ahead of test-station-2.  Thus, those timeouts happened around 2023-07-05T17:11:04.843Z and 2023-07-05T17:12:07.605Z on test-station-1's clock.

So what happened here?  The "operation timed out" sounds like either an explicit request timeout or else the TCP connection failed because one side didn't ACK within some expected time window.  @internet-diglett found that we are using a 60-second request timeout here:
https://github.com/oxidecomputer/omicron/blob/c3048a1b43b046c284432eba34d0bc1933de4d56/sled-agent/src/rack_setup/service.rs#L932

So here's a plausible sequence of events:

```
17:10:03.798Z on test-station-1, a `POST /cockroachdb` request (call it R1) fails because networking is not yet setup in the zone (direct evidence: the log entry above)
17:10:04.483Z on workstation-1, RSS tries another `POST /cockroachdb` request (call it R2) (indirect evidence: this event is inferred based on the next event + knowing there's a 60-second request timeout -- also consistent with the failure just miliseconds earlier)
17:11:04.483Z on workstation-1, RSS times out request R2 (direct evidence: log entry above, with timestamp adjusted for the clock difference between the two servers).  Critically, _request R2 would still be running_.
17:11:07.605Z on workstation-1, RSS tries another `POST /cockroachdb` request (call it R3) (indirect evidence: similar to above).  This gets stuck behind `SledAgent.inner.zones.lock().await` because this lock is held by R2.
17:12:07.605Z on workstation-1, RSS times out request R3 (direct evidence: similar to above)
17:12:48.114Z on test-station-1, request R2 completes successfully, having taken about 2m43s

Subsequently, all requests by RSS to `POST /cockroachdb` fail because the database is already initialized (by R2).
```

All of this is what we'd expect if the operations done during `POST /cockroachdb` took more than 60s.  So did that happen?  We can't tell for sure, but @internet-diglett ran the two `cockroach sql` steps that are part of db initialization by hand.  Together they took over 2m24s.  So there's every reason to believe this is what happened.

I think there's room for several improvements here:

- Most immediately, I think we should not time out this request on the RSS side.  A connect timeout is fine.  But I don't think any good can come from abandoning the in-flight request that we haven't gotten a response for if we have no reason to think there's a networking problem in the way.  Update: see #3503.
- `POST /cockroachdb` is not idempotent.  Update: I've filed #3499 for this.
- I think it'd be worthwhile for RSS to log all requests that it makes, either at "debug" or "trace" level.  This would have made it easier (and higher confidence) to construct the above timeline.  Update: I'm not sure how best to do this.  Filed #3502 for it.
- The "request completed" log entries could include the request latency.  That would have helped confirm that the `POST /cockroachdb` that succeded _did_ take longer than 60s.  Update: this is oxidecomputer/dropshot#692.
- The warning log message on the SledAgent side could include how long before it's going to retry.  This information is available at the time we log this message.  This would make it easier (and higher confidence) to construct the above timeline.  Update: see #3503.

I'm going to try to roll most of those into a PR shortly and file separate issues for the rest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RSS stuck after cockroachdb init took too long #3498

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RSS stuck after cockroachdb init took too long #3498

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions