[1/n] create a nexus-lockstep service#8983
Conversation
jgallagher
left a comment
There was a problem hiding this comment.
I focused mostly on the OmicronZoneType (and related) changes and skimmed the rest - looks good to me. I'm always a bit nervous changing these bits - probably worth testing a before -> mupdate -> after to make sure we didn't miss anything w.r.t. serialized ledgers? (Also fine to defer that testing to one of the follow PRs that build on this, if we want to stack a few and land them all together.)
| bind_address = "10.1.2.3:4568" | ||
| default_request_body_max_bytes = 1024 | ||
| [deployment.dropshot_debug] | ||
| bind_address = "10.1.2.3:4569" |
There was a problem hiding this comment.
This allows a full socketaddr, but the DNS helper function only allows supplying a debug port. Is the expectation that the IP for this must match dropshot_internal's? (Do we check that at runtime? Still reading the PR...)
There was a problem hiding this comment.
This object is a dropshot::ConfigDropshot, so yeah unfortunately that’s the way it is. I think I check that they have the same IP at runtime but I should double-check all the calls.
There was a problem hiding this comment.
OK, Nexus now should fail to start if the IPs don't match.
| use sled_hardware_types::{Baseboard, SledCpuFamily}; | ||
|
|
||
| /// Identity and basic status information about this sled agent | ||
| #[derive(Deserialize, Serialize, JsonSchema)] |
There was a problem hiding this comment.
Oooof on this whole file. Thanks for going through this tedium.
| assert_eq!(new_srv.target, new_zone_host.fqdn()); | ||
| assert_eq!(changed.len(), 2); | ||
| assert_eq!(changed[0].0, ServiceName::NexusDebug.dns_name()); | ||
| assert_eq!(changed[1].0, ServiceName::Nexus.dns_name()); |
There was a problem hiding this comment.
Is the order of changed guaranteed?
There was a problem hiding this comment.
Yep, the underlying map is a BTreeMap. (As I wrote this my sense for flakes was tingling as well so I checked!)
| /// Dropshot configuration for lockstep API server. | ||
| #[schemars(skip)] // TODO we're protected against dropshot changes | ||
| pub dropshot_lockstep: ConfigDropshot, |
There was a problem hiding this comment.
I'm afraid this might be a flag day between Nexus and Sled Agent (because it writes out this config file). But I guess it's okay because Sled Agent is always updated before Nexus, so it will just start supplying this.
There was a problem hiding this comment.
Yeah, it would be. I think if we're not planning on running Nexus-driven update on dogfood before this lands, we're fine, but otherwise I could make this an Option and bubble that all the way through? Would be obnoxious but theoretically fine.
There was a problem hiding this comment.
I think it's basically fine as-is because we always update sled agent before Nexus and the change is backwards-compatible.
| internal_address: SocketAddrV6, | ||
| /// The port at which the internal lockstep server is reachable. This | ||
| /// shares the same IP address with `internal_address`. | ||
| #[serde(default = "default_nexus_lockstep_port")] |
There was a problem hiding this comment.
I think we've usually avoided this sort of thing. I guess to avoid it you'd need to define a separate struct here with a From impl, which may not be worth it. But can we at least figure out when it would be safe to remove this (if ever?), document that, and file an issue?
There was a problem hiding this comment.
We similarly used #[serde(default)] to add the image_source field to OmicronZoneConfig as a non-breaking change.
It would be nice to clean both up but I'm unsure when it's safe. Presumably after an upgrade to R17 and setting the target version, all sled-agent ledgers will gain the field?
There was a problem hiding this comment.
I'm not positive. Would you mind filing an issue and raising it on the update call today? (Hopefully a brief discussion.)
There was a problem hiding this comment.
What's this file, and is it a problem that it's changed? I'm guessing it's okay because there's a default?
There was a problem hiding this comment.
I don’t really have any idea what this file is.
There was a problem hiding this comment.
This is the JSON schema for the on-disk, ledgered zones config from before we did the work to combine zones+datasets+disks into OmicronSledConfig. I think this change is fine:
a) it has a default so we can still deserialize old versions missing this field
b) I think we've updated all systems past the point where we might have any of these left behind anyway?
The only use of this is in a test for the config-reconciler's "convert from legacy ledgers" stuff.
First step of #8902. It's enough work to get Nexus to stand up another HTTP service that this is worth its own PR ahead of moving APIs out of nexus-internal and into nexus-lockstep.
First step of #8902. It's enough work to get Nexus to stand up another HTTP service that this is worth its own PR ahead of moving APIs out of nexus-internal and into nexus-lockstep.