Here's the CI failure:
https://github.com/oxidecomputer/omicron/pull/5962/checks?check_run_id=26826135569
This PR is non-trivial and it's conceivable that my changes introduced this but I don't yet see how. The helios-deploy job failed in the usual way (timing out after 600s trying to log in):
2024-06-28 22:19:52.051692811 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: error:0A000419:SSL routines:ssl3_read_bytes:tlsv1 alert access denied:ssl/record/rec_layer_s3.c:1605:SSL alert number 49
Error: logging in
Caused by:
timed out after 600.598279588s
That TLS error can reflect that there are no certificates, and so a problem with the "external endpoints" task. I went to the Nexus logs to look for errors and found that one Nexus failed some of the first-time-setup steps:
2024-06-28T22:10:18.588Z ERRO nexus (DataLoader): gave up trying to populate built-in PopulateBuiltinVpcs
error_message = InternalError { internal_message: "Unknown diesel error creating VpcRouter called \\"system\\": Record not found" }
file = nexus/src/populate.rs:126
2024-06-28T22:10:18.588Z ERRO nexus: populate failed
file = nexus/src/app/mod.rs:528
2024-06-28T22:10:18.588Z ERRO nexus: saga request channel closed!
file = nexus/src/app/mod.rs:544
One possibility here is that before my PR, we would still have run the background tasks that eventually configure TLS, whereas now we won't. But at best the old code would have been papering over this problem.
Here's the CI failure:
https://github.com/oxidecomputer/omicron/pull/5962/checks?check_run_id=26826135569
This PR is non-trivial and it's conceivable that my changes introduced this but I don't yet see how. The helios-deploy job failed in the usual way (timing out after 600s trying to log in):
That TLS error can reflect that there are no certificates, and so a problem with the "external endpoints" task. I went to the Nexus logs to look for errors and found that one Nexus failed some of the first-time-setup steps:
One possibility here is that before my PR, we would still have run the background tasks that eventually configure TLS, whereas now we won't. But at best the old code would have been papering over this problem.