Skip to content

Nexus busted after its initial startup raced with another Nexus populating "system" VpcRouter #5980

@davepacheco

Description

@davepacheco

Here's the CI failure:
https://github.com/oxidecomputer/omicron/pull/5962/checks?check_run_id=26826135569

This PR is non-trivial and it's conceivable that my changes introduced this but I don't yet see how. The helios-deploy job failed in the usual way (timing out after 600s trying to log in):

2024-06-28 22:19:52.051692811 UTC: login failed: logging in: error sending request for url (https://recovery.sys.oxide.test/v1/login/recovery/local): error trying to connect: error:0A000419:SSL routines:ssl3_read_bytes:tlsv1 alert access denied:ssl/record/rec_layer_s3.c:1605:SSL alert number 49
Error: logging in

Caused by:
    timed out after 600.598279588s

That TLS error can reflect that there are no certificates, and so a problem with the "external endpoints" task. I went to the Nexus logs to look for errors and found that one Nexus failed some of the first-time-setup steps:

2024-06-28T22:10:18.588Z	ERRO	nexus (DataLoader): gave up trying to populate built-in PopulateBuiltinVpcs
error_message = InternalError { internal_message: "Unknown diesel error creating VpcRouter called \\"system\\": Record not found" }
file = nexus/src/populate.rs:126
2024-06-28T22:10:18.588Z	ERRO	nexus: populate failed
file = nexus/src/app/mod.rs:528
2024-06-28T22:10:18.588Z	ERRO	nexus: saga request channel closed!
file = nexus/src/app/mod.rs:544

One possibility here is that before my PR, we would still have run the background tasks that eventually configure TLS, whereas now we won't. But at best the old code would have been papering over this problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Test FlakeTests that work. Wait, no. Actually yes. Hang on. Something is broken.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions