You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@internet-diglett was deploying Omicron (a branch off "main" but I think that's not relevant here) and ran into a case where RSS was looping forever trying to retry database initialization that had already completed. There were two sleds: "workstation-1" (running RSS) and "test-station-1".
On workstation-1, we have a bunch of log messages like this one:
[2023-07-05T17:10:03.798208372Z] INFO: SledAgent/dropshot (SledAgent)/2525 on test-station-1: request completed (req_id=f8557412-0bea-4465-ad3f-68218bb4ba50, uri=/cockroachdb, method=POST, remote_addr=[fd00:1122:3344:102::1]:62144, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
error_message_internal: Failed to get address: Failed to get address for name oxControlService3/omicron6 in oxz_cockroachdb_oxp_f3393d6c-25e2-4435-9602-b34bc8f42d43: Zone execution error: Command [/usr/sbin/zlogin oxz_cockroachdb_oxp_f3393d6c-25e2-4435-9602-b34bc8f42d43 /usr/sbin/ipadm show-addr -p -o ADDR oxControlService3/omicron6] executed and failed with status: exit status: 1 stdout:
stderr: ipadm: Address object not found
[2023-07-05T17:12:48.304971089Z] INFO: SledAgent/dropshot (SledAgent)/2525 on test-station-1: request completed (req_id=310d8023-bda5-420c-bcd4-69116f18bea3, uri=/cockroachdb, method=POST, remote_addr=[fd00:1122:3344:102::1]:41882, local_addr=[fd00:1122:3344:101::1]:12345, error_message_external="Internal Server Error", response_code=500)
error_message_internal: Failed to initialize CockroachDb: Error running command in zone 'oxz_cockroachdb_oxp_f3393d6c-25e2-4435-9602-b34bc8f42d43': Command [/opt/oxide/cockroachdb/bin/cockroach init --insecure --host [fd00:1122:3344:101::6]:32221] executed and failed with status: exit status: 1 stdout: stderr: ERROR: cluster has already been initialized
Failed running "init"
All of these correspond to requests from RSS for POST /cockroachdb, which initialize the CockroachDB cluster (runs cockroach init and loads a few SQL files) using any of the CockroachDB zones on that sled. To summarize what's above: initially, these requests fail apparently because the zone is not yet set up. At some point, one request completes successfully. Then Sled Agent keeps getting those requests and those fail because this step has already been done.
Over on the RSS side, the only log entries for this I found were:
$ grep /cockroachdb sled-agent-1-RSS.log | bunyan
[2023-07-05T17:14:21.843265387Z] WARN: SledAgent/RSS/8449 on workstation-1: Failed to initialize CockroachDB
error: Communication Error: error sending request for url (http://[fd00:1122:3344:101::1]:12345/cockroachdb): operation timed out
[2023-07-05T17:15:24.605878005Z] WARN: SledAgent/RSS/8449 on workstation-1: Failed to initialize CockroachDB
error: Communication Error: error sending request for url (http://[fd00:1122:3344:101::1]:12345/cockroachdb): operation timed out
We discovered at this point that these hosts' clocks are not sync'd up: workstation-1 is about 197 seconds ahead of test-station-2. Thus, those timeouts happened around 2023-07-05T17:11:04.843Z and 2023-07-05T17:12:07.605Z on test-station-1's clock.
So what happened here? The "operation timed out" sounds like either an explicit request timeout or else the TCP connection failed because one side didn't ACK within some expected time window. @internet-diglett found that we are using a 60-second request timeout here:
17:10:03.798Z on test-station-1, a `POST /cockroachdb` request (call it R1) fails because networking is not yet setup in the zone (direct evidence: the log entry above)
17:10:04.483Z on workstation-1, RSS tries another `POST /cockroachdb` request (call it R2) (indirect evidence: this event is inferred based on the next event + knowing there's a 60-second request timeout -- also consistent with the failure just miliseconds earlier)
17:11:04.483Z on workstation-1, RSS times out request R2 (direct evidence: log entry above, with timestamp adjusted for the clock difference between the two servers). Critically, _request R2 would still be running_.
17:11:07.605Z on workstation-1, RSS tries another `POST /cockroachdb` request (call it R3) (indirect evidence: similar to above). This gets stuck behind `SledAgent.inner.zones.lock().await` because this lock is held by R2.
17:12:07.605Z on workstation-1, RSS times out request R3 (direct evidence: similar to above)
17:12:48.114Z on test-station-1, request R2 completes successfully, having taken about 2m43s
Subsequently, all requests by RSS to `POST /cockroachdb` fail because the database is already initialized (by R2).
All of this is what we'd expect if the operations done during POST /cockroachdb took more than 60s. So did that happen? We can't tell for sure, but @internet-diglett ran the two cockroach sql steps that are part of db initialization by hand. Together they took over 2m24s. So there's every reason to believe this is what happened.
I think there's room for several improvements here:
Most immediately, I think we should not time out this request on the RSS side. A connect timeout is fine. But I don't think any good can come from abandoning the in-flight request that we haven't gotten a response for if we have no reason to think there's a networking problem in the way. Update: see do not time out database init requests #3503.
I think it'd be worthwhile for RSS to log all requests that it makes, either at "debug" or "trace" level. This would have made it easier (and higher confidence) to construct the above timeline. Update: I'm not sure how best to do this. Filed log all RSS HTTP requests by default #3502 for it.
The warning log message on the SledAgent side could include how long before it's going to retry. This information is available at the time we log this message. This would make it easier (and higher confidence) to construct the above timeline. Update: see do not time out database init requests #3503.
I'm going to try to roll most of those into a PR shortly and file separate issues for the rest.
@internet-diglett was deploying Omicron (a branch off "main" but I think that's not relevant here) and ran into a case where RSS was looping forever trying to retry database initialization that had already completed. There were two sleds: "workstation-1" (running RSS) and "test-station-1".
On workstation-1, we have a bunch of log messages like this one:
followed by exactly one message like this:
followed by a bunch of messages like this:
All of these correspond to requests from RSS for
POST /cockroachdb, which initialize the CockroachDB cluster (runscockroach initand loads a few SQL files) using any of the CockroachDB zones on that sled. To summarize what's above: initially, these requests fail apparently because the zone is not yet set up. At some point, one request completes successfully. Then Sled Agent keeps getting those requests and those fail because this step has already been done.Over on the RSS side, the only log entries for this I found were:
We discovered at this point that these hosts' clocks are not sync'd up: workstation-1 is about 197 seconds ahead of test-station-2. Thus, those timeouts happened around 2023-07-05T17:11:04.843Z and 2023-07-05T17:12:07.605Z on test-station-1's clock.
So what happened here? The "operation timed out" sounds like either an explicit request timeout or else the TCP connection failed because one side didn't ACK within some expected time window. @internet-diglett found that we are using a 60-second request timeout here:
omicron/sled-agent/src/rack_setup/service.rs
Line 932 in c3048a1
So here's a plausible sequence of events:
All of this is what we'd expect if the operations done during
POST /cockroachdbtook more than 60s. So did that happen? We can't tell for sure, but @internet-diglett ran the twocockroach sqlsteps that are part of db initialization by hand. Together they took over 2m24s. So there's every reason to believe this is what happened.I think there's room for several improvements here:
POST /cockroachdbis not idempotent. Update: I've filed Sled AgentPOST /cockroachdbis not idempotent #3499 for this.POST /cockroachdbthat succeded did take longer than 60s. Update: this is "request completed" log message should include response latency dropshot#692.I'm going to try to roll most of those into a PR shortly and file separate issues for the rest.