Skip to content

[nexus] Resilience to arbitrary boot order#3365

Merged
smklein merged 25 commits into
mainfrom
resilient-nexus-boot-order
Jun 28, 2023
Merged

[nexus] Resilience to arbitrary boot order#3365
smklein merged 25 commits into
mainfrom
resilient-nexus-boot-order

Conversation

@smklein

@smklein smklein commented Jun 16, 2023

Copy link
Copy Markdown
Collaborator

Fixes #2614

@smklein smklein marked this pull request as ready for review June 16, 2023 18:59
Comment thread nexus/src/context.rs
Comment on lines +161 to +178
let address = loop {
match resolver
.lookup_socket_v6(ServiceName::Cockroach)
.await
{
Ok(address) => break address,
Err(e) => {
warn!(
log,
"Failed to lookup cockroach address: {e}"
);
tokio::time::sleep(std::time::Duration::from_secs(
1,
))
.await;
}
}
};

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd kinda like to land somewhere that:

  • Whenever we want to access a service, we could look up the address again through DNS
  • Whenever we want to access DNS, we could access any of the DNS server
  • If we time-out or error accessing anything, we can customize a back-off policy

... which sorta seems like a Rust version of cueball. Maybe that's a natural follow-up from here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Related to that: we want to proactively resolve stuff before we need it so that DNS resolution latency/availability does not affect latency/availability of the stuff that needs it so much.

Maybe file an issue and add a reference both here and in the Dendrite loop (so that people don't, like, keep copying and pasting this pattern)?

@david-crespo david-crespo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread nexus/src/app/mod.rs
Comment on lines +190 to +194
Err(e) => {
warn!(log, "Failed to access Dendrite address: {e}");
tokio::time::sleep(std::time::Duration::from_secs(1))
.await;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like something we probably want to revisit. (Nexus can't come up at all without Dendrite? It seems like it'd be better to just construct a lazily-resolving DPD client and let the rest of Nexus come up.)

But this PR is a clear improvement from what's there now.

Comment thread nexus/src/context.rs
Comment on lines +161 to +178
let address = loop {
match resolver
.lookup_socket_v6(ServiceName::Cockroach)
.await
{
Ok(address) => break address,
Err(e) => {
warn!(
log,
"Failed to lookup cockroach address: {e}"
);
tokio::time::sleep(std::time::Duration::from_secs(
1,
))
.await;
}
}
};

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Related to that: we want to proactively resolve stuff before we need it so that DNS resolution latency/availability does not affect latency/availability of the stuff that needs it so much.

Maybe file an issue and add a reference both here and in the Dendrite loop (so that people don't, like, keep copying and pasting this pattern)?

Base automatically changed from resilient-nexus-builder to main June 28, 2023 03:38
smklein added a commit that referenced this pull request Jun 28, 2023
…3364)

This change is necessary for
#3365 , which tests
arbitrary boot order for Nexus.

### What changes do we need?

- To test the behavior we want, it must be possible to launch Nexus
before launching CockroachDB/Dendrite, to validate that "nothing bad
happens, Nexus just waits".
- Furthermore, we must be able to run a test where Nexus looks up
CockroachDB's address from the internal DNS system (it hasn't launched
beforehand, so the address is not known). This has a couple
implications...
- First, the CockroachDB address must be parsed after it is launched,
and passed into the internal DNS server as a record.
- Second, Nexus-under-test must be able to access an internal DNS server
that actually runs, and it must be able to parse records from that
server. This means, unlike Nexus-in-production, we shouldn't "infer" the
address of the DNS server from a subnet - we need to actually launch and
manage per-test DNS servers.

### Why are these changes useful?

- Customization of boot order is a useful tool generally for testing
- Accessing services through the DNS subsystem allows the
system-under-test to more accurately align with the
system-under-production. Setting up a DNS server for each test requires
some additional metadata tracking, but avoids having distinct "access
the address under test" vs "access the resolver under prod" pathways.

### Previously

- `test_setup_with_config`, from `nexus/test-utils/src/lib.rs`, was used
as the "one-stop-shop" for setting up all state before running tests.
This includes launching CockroachDB, then launching Nexus. For most
tests, this is useful, but for our explicitly-testing-weird-order
initialization test, this function is too opinionated to call directly.

### Now

- We introduce a new structure called `ControlPlaneTestContextBuilder`.
This adds a "builder pattern" to the construction of
ControlPlaneTestContext, using a bunch of Options to store intermediate
state.
- We add an `InMemoryServer` to the dns-server crate, which the tests
can use for a dynamic DNS server.
- We allow Nexus to refer to the DNS servers "by subnet" (previous
default) or "by a specific address", which is useful for
systems-under-test where the DNS server addresses may be dependent on
test-specific ports.
- We populate an InMemoryServer DNS server with values created by the
ControlPlaneTestContextBuilder.
- We cleanup the simulated sled agent a bit, by removing the DNS server
from the "sim::server::Server".
@smklein smklein enabled auto-merge (squash) June 28, 2023 03:41
@smklein smklein merged commit d4fa465 into main Jun 28, 2023
@smklein smklein deleted the resilient-nexus-boot-order branch June 28, 2023 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nexus crashes when it can't reach internal DNS on startup

3 participants