[nexus] Resilience to arbitrary boot order#3365
Conversation
| let address = loop { | ||
| match resolver | ||
| .lookup_socket_v6(ServiceName::Cockroach) | ||
| .await | ||
| { | ||
| Ok(address) => break address, | ||
| Err(e) => { | ||
| warn!( | ||
| log, | ||
| "Failed to lookup cockroach address: {e}" | ||
| ); | ||
| tokio::time::sleep(std::time::Duration::from_secs( | ||
| 1, | ||
| )) | ||
| .await; | ||
| } | ||
| } | ||
| }; |
There was a problem hiding this comment.
I'd kinda like to land somewhere that:
- Whenever we want to access a service, we could look up the address again through DNS
- Whenever we want to access DNS, we could access any of the DNS server
- If we time-out or error accessing anything, we can customize a back-off policy
... which sorta seems like a Rust version of cueball. Maybe that's a natural follow-up from here?
There was a problem hiding this comment.
Yes, I think so. Related to that: we want to proactively resolve stuff before we need it so that DNS resolution latency/availability does not affect latency/availability of the stuff that needs it so much.
Maybe file an issue and add a reference both here and in the Dendrite loop (so that people don't, like, keep copying and pasting this pattern)?
| Err(e) => { | ||
| warn!(log, "Failed to access Dendrite address: {e}"); | ||
| tokio::time::sleep(std::time::Duration::from_secs(1)) | ||
| .await; | ||
| } |
There was a problem hiding this comment.
This seems like something we probably want to revisit. (Nexus can't come up at all without Dendrite? It seems like it'd be better to just construct a lazily-resolving DPD client and let the rest of Nexus come up.)
But this PR is a clear improvement from what's there now.
| let address = loop { | ||
| match resolver | ||
| .lookup_socket_v6(ServiceName::Cockroach) | ||
| .await | ||
| { | ||
| Ok(address) => break address, | ||
| Err(e) => { | ||
| warn!( | ||
| log, | ||
| "Failed to lookup cockroach address: {e}" | ||
| ); | ||
| tokio::time::sleep(std::time::Duration::from_secs( | ||
| 1, | ||
| )) | ||
| .await; | ||
| } | ||
| } | ||
| }; |
There was a problem hiding this comment.
Yes, I think so. Related to that: we want to proactively resolve stuff before we need it so that DNS resolution latency/availability does not affect latency/availability of the stuff that needs it so much.
Maybe file an issue and add a reference both here and in the Dendrite loop (so that people don't, like, keep copying and pasting this pattern)?
…3364) This change is necessary for #3365 , which tests arbitrary boot order for Nexus. ### What changes do we need? - To test the behavior we want, it must be possible to launch Nexus before launching CockroachDB/Dendrite, to validate that "nothing bad happens, Nexus just waits". - Furthermore, we must be able to run a test where Nexus looks up CockroachDB's address from the internal DNS system (it hasn't launched beforehand, so the address is not known). This has a couple implications... - First, the CockroachDB address must be parsed after it is launched, and passed into the internal DNS server as a record. - Second, Nexus-under-test must be able to access an internal DNS server that actually runs, and it must be able to parse records from that server. This means, unlike Nexus-in-production, we shouldn't "infer" the address of the DNS server from a subnet - we need to actually launch and manage per-test DNS servers. ### Why are these changes useful? - Customization of boot order is a useful tool generally for testing - Accessing services through the DNS subsystem allows the system-under-test to more accurately align with the system-under-production. Setting up a DNS server for each test requires some additional metadata tracking, but avoids having distinct "access the address under test" vs "access the resolver under prod" pathways. ### Previously - `test_setup_with_config`, from `nexus/test-utils/src/lib.rs`, was used as the "one-stop-shop" for setting up all state before running tests. This includes launching CockroachDB, then launching Nexus. For most tests, this is useful, but for our explicitly-testing-weird-order initialization test, this function is too opinionated to call directly. ### Now - We introduce a new structure called `ControlPlaneTestContextBuilder`. This adds a "builder pattern" to the construction of ControlPlaneTestContext, using a bunch of Options to store intermediate state. - We add an `InMemoryServer` to the dns-server crate, which the tests can use for a dynamic DNS server. - We allow Nexus to refer to the DNS servers "by subnet" (previous default) or "by a specific address", which is useful for systems-under-test where the DNS server addresses may be dependent on test-specific ports. - We populate an InMemoryServer DNS server with values created by the ControlPlaneTestContextBuilder. - We cleanup the simulated sled agent a bit, by removing the DNS server from the "sim::server::Server".
Fixes #2614