Skip to content

Test setup as builder, pull DNS server out of simulated sled agent#3364

Merged
smklein merged 15 commits into
mainfrom
resilient-nexus-builder
Jun 28, 2023
Merged

Test setup as builder, pull DNS server out of simulated sled agent#3364
smklein merged 15 commits into
mainfrom
resilient-nexus-builder

Conversation

@smklein

@smklein smklein commented Jun 16, 2023

Copy link
Copy Markdown
Collaborator

This change is necessary for #3365 , which tests arbitrary boot order for Nexus.

What changes do we need?

  • To test the behavior we want, it must be possible to launch Nexus before launching CockroachDB/Dendrite, to validate that "nothing bad happens, Nexus just waits".
  • Furthermore, we must be able to run a test where Nexus looks up CockroachDB's address from the internal DNS system (it hasn't launched beforehand, so the address is not known). This has a couple implications...
    • First, the CockroachDB address must be parsed after it is launched, and passed into the internal DNS server as a record.
    • Second, Nexus-under-test must be able to access an internal DNS server that actually runs, and it must be able to parse records from that server. This means, unlike Nexus-in-production, we shouldn't "infer" the address of the DNS server from a subnet - we need to actually launch and manage per-test DNS servers.

Why are these changes useful?

  • Customization of boot order is a useful tool generally for testing
  • Accessing services through the DNS subsystem allows the system-under-test to more accurately align with the system-under-production. Setting up a DNS server for each test requires some additional metadata tracking, but avoids having distinct "access the address under test" vs "access the resolver under prod" pathways.

Previously

  • test_setup_with_config, from nexus/test-utils/src/lib.rs, was used as the "one-stop-shop" for setting up all state before running tests. This includes launching CockroachDB, then launching Nexus. For most tests, this is useful, but for our explicitly-testing-weird-order initialization test, this function is too opinionated to call directly.

Now

  • We introduce a new structure called ControlPlaneTestContextBuilder. This adds a "builder pattern" to the construction of ControlPlaneTestContext, using a bunch of Options to store intermediate state.
  • We add an InMemoryServer to the dns-server crate, which the tests can use for a dynamic DNS server.
  • We allow Nexus to refer to the DNS servers "by subnet" (previous default) or "by a specific address", which is useful for systems-under-test where the DNS server addresses may be dependent on test-specific ports.
  • We populate an InMemoryServer DNS server with values created by the ControlPlaneTestContextBuilder.
  • We cleanup the simulated sled agent a bit, by removing the DNS server from the "sim::server::Server".

Comment thread common/src/nexus_config.rs
Comment thread dns-server/src/lib.rs Outdated
Comment thread nexus/src/context.rs
Comment thread nexus/test-utils/src/lib.rs
Comment thread sled-agent/src/sim/server.rs
Comment thread sled-agent/src/sim/server.rs
Comment thread test-utils/src/dev/db.rs
Comment thread nexus/test-utils/src/lib.rs
Comment thread dns-server/src/lib.rs Outdated
Comment thread nexus/examples/config.toml Outdated
Comment on lines +54 to +56
[deployment.internal_dns]
type = "from_subnet"
subnet.net = "fd00:1122:3344:0100::/56"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably not what we want in the example config. The example config is used in the how-to-simulated instructions for people running the components by hand. It seems like we need to update those in this PR to have you start an internal DNS server (probably on a known port on localhost) and then have this config file point at that. That's what we already do for the other components in those steps.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the documentation - I believe if the how-to-run-simulated instructions are followed, and internal DNS server is already being spun up, and this value is overwritten: 65e0473

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have updated the "quick start" pieces in that commit (thanks!) but if somebody follows the "Running the pieces by hand" section, won't they be missing the internal DNS server?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting the simulated sled agent still starts a crucible pantry and an internal DNS server, as it did before this PR. It certainly isn't getting populated with all the records we want, but I think that's the same as before this PR -- it only adds a record for the Crucible Pantry.

I think it would certainly be an improvement to start the Internal DNS server separately from the simulated sled agent, and have more of the configs point to it (rather than using hard-coded addresses, as they currently do!), but I don't think that would be a small change.

(Regardless, thanks for pointing this out - it prompted me to run through all the "simulated sled agent" manually, where I patched an unrelated bug.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this configuration file is also used directly in the how-to-run-simulated instructions. It still points Nexus at a specific IP address which is not right for the user's configuration. Right? It seems like there's a missing step in these instructions which is to customize this file somehow? (and the next question is whether there's any way we could avoid that -- e.g., maybe running the simulated sled agent by hand should use a fixed IP and port for the DNS server so that we can put that directly in this file? That's what we do for the other components there.)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to tackle this in e745c76 - that lets callers customize the internal DNS, DNS (not http) address which can then be propagated to Nexus.

This means that the value in the Nexus config file will be respected, as long as the same address is passed to the simulated sled agent to start a DNS server listening on it too.

Comment thread nexus/src/context.rs Outdated
Comment thread nexus/src/context.rs
Comment thread nexus/src/context.rs
Comment thread nexus/src/lib.rs
Comment thread nexus/tests/config.test.toml
Comment thread sled-agent/src/sim/server.rs Outdated
Comment thread nexus/test-utils/src/lib.rs
Comment thread dns-server/src/lib.rs Outdated
Comment thread nexus/examples/config.toml Outdated
Comment on lines +54 to +56
[deployment.internal_dns]
type = "from_subnet"
subnet.net = "fd00:1122:3344:0100::/56"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have updated the "quick start" pieces in that commit (thanks!) but if somebody follows the "Running the pieces by hand" section, won't they be missing the internal DNS server?

Comment thread dns-server/src/lib.rs Outdated

@davepacheco davepacheco left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It'd be good to know that the updated "how-to-run" instructions work as-is.

@smklein

smklein commented Jun 28, 2023

Copy link
Copy Markdown
Collaborator Author

Thanks! It'd be good to know that the updated "how-to-run" instructions work as-is.

Just manually ran it, seems to work: Nexus comes up (and can respond to requests) using the provided docs.

@smklein smklein merged commit acd7631 into main Jun 28, 2023
@smklein smklein deleted the resilient-nexus-builder branch June 28, 2023 03:38
smklein added a commit that referenced this pull request Jun 28, 2023
- Ensures that Nexus can boot safely, even if CockroachDB, Dendrite, or
Internal DNS boot lazily
- Adds tests for this behavior (relying on
#3364)

Fixes #2614
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants