Skip to content

Zone network configuration service#4677

Merged
karencfv merged 21 commits into
oxidecomputer:mainfrom
karencfv:zone-networking-service
Jan 15, 2024
Merged

Zone network configuration service#4677
karencfv merged 21 commits into
oxidecomputer:mainfrom
karencfv:zone-networking-service

Conversation

@karencfv

@karencfv karencfv commented Dec 12, 2023

Copy link
Copy Markdown
Contributor

As part of the work for self assembling zones, it was suggested to break the network configuration out into a separate service.

Implementation

This PR introduces a new SMF service oxide/zone-network-setup, which sets up the common initial zone networking configuration for each self assembled zone.

Each of the "self assembled zone" services will now depend on this new service to run, and all properties relating to zone network configuration have been removed from these services.

The executable which does the actual zone networking setup, is built as a tiny CLI. It takes advantage of clap's parsing validation to make sure we have all of the properties present, and in the format they are intended to be.

Caveats

There are two remaining self assembled zones that don't depend on this new service yet (crucible and crucible-pantry). As these two zones need coordinated PRs with the crucible repo, I'd like to implement these in a follow up PR once this one is approved and merged.

Comment thread smf/cockroachdb/method_script.sh Outdated
@@ -18,15 +18,17 @@ if [[ $DATALINK == unknown ]] || [[ $GATEWAY == unknown ]]; then
fi

# TODO remove when https://github.com/oxidecomputer/stlouis/issues/435 is addressed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmpesp I see this issue has been closed, are we OK to remove this part now?

@jclulow

jclulow commented Dec 14, 2023

Copy link
Copy Markdown
Collaborator

I have had a quick squiz at the PR, and I think adding a Rust program to do the work is indeed much better than a shell script! Thanks for digging into this. Before looking too closely at the code, I have some architectural thoughts.

Separate SMF service

I think we should do the networking setup in a wholly separate SMF service. In the PR as it stands, we've replaced some direct ipadm invocations in the method script for the Cockroach DB SMF service -- but the moving parts are still essentially a part of the Cockroach DB service.

What I would sort of expect is to add a new SMF service, with just a default instance; e.g., svc:/oxide/network-setup:default. This would be enabled by default. It should be of the transient duration (see svc.startd(8)), exiting successfully when it completes setup. It should be idempotent, so that if it is restarted (whether after interruption or not) it does the right thing. It probably does not need to tear things down in a stop method, and it may not presently need to provide a functional refresh method at this time.

To the extent that we need to communicate any information to this service from outside (e.g., addresses or subnets or interface names) we would do so via properties in the default service instance. The name, value, type, and behaviour of each of those properties (and indeed the FMRI of the whole service instance) would essentially become a cross-version stable interface; if sled agent needs to populate them, we should commit to them so that the version of sled agent and the version of the zone image aren't locked together.

I suspect this new service would depend on the existing svc:/network/physical service that ships with the operating system. We would probably then make the OS milestone service, svc:/milestone/network, depend on our new service here. Then we would redo any other services, like Cockroach DB, that probably don't want to start until the network is configured, such that they depend on the network milestone. That's roughly what we do with a similar piece of software, the illumos metadata agent, except in a different context -- the metadata agent is a replacement for cloud-init, and runs in the global zone of virtualised guests on cloud platforms. In this case, the network setup service is running inside a non-global zone on top of an Omicron-managed system.

No shell at all

I don't think we need to use a shell script as the start/stop method for the new service. I would restructure the new Rust program to be invoked as the start method directly, rather than invoked several times from a new shell script. I also wouldn't do any argument construction or parsing, except for perhaps passing the method name (see %m in Method Tokens of smf_method(7)), but instead just use the smf crate (which uses libscf(3LIB) directly) to access any properties that we need.


There is, I suspect, a lot to digest here. I've tried to link some manual pages that contain some of the details. When reading and writing SMF manifests, it may also be worth looking at the service_bundle(5) XML DTD, which is shipped in the OS and also in the source base; it contains a lot of comments about the structure of those files.

Definitely happy to answer any questions or talk more about this whenever it would be helpful! Let me know!

@karencfv

Copy link
Copy Markdown
Contributor Author

Hey @jclulow thanks a bunch for the detailed feedback! There's definitely a lot of Illumos knowledge that I'm missing here. I'm going to read the links you left me here, and I'll definitely take you up on that offer to catch up as I'll most likely have a lot of questions 😄

@karencfv

Copy link
Copy Markdown
Contributor Author

I think we should do the networking setup in a wholly separate SMF service. In the PR as it stands, we've replaced some direct ipadm invocations in the method script for the Cockroach DB SMF service -- but the moving parts are still essentially a part of the Cockroach DB service.

@jclulow, as I understand it, as it stands today, a new SMF service would have to be created by RSS. This would mean that we cannot have a new service running on the current racks until zones are provisioned by Nexus instead of RSS 😞. Hopefully I'm wrong about this, WDYT?

@jclulow

jclulow commented Dec 14, 2023

Copy link
Copy Markdown
Collaborator

I don't know much about RSS, but an SMF service is defined by an XML file that can just be shipped in the correct place in the zone image. If you put it in /lib/svc/manifest/site it ought, I expect, to be imported early enough to impose itself as a dependency of the network milestone. I believe manifests in that directory are imported by the early-manifest-import service at zone boot.

@jgallagher

Copy link
Copy Markdown
Contributor

@jclulow, as I understand it, as it stands today, a new SMF service would have to be created by RSS. This would mean that we cannot have a new service running on the current racks until zones are provisioned by Nexus instead of RSS 😞. Hopefully I'm wrong about this, WDYT?

There's a bit of term overloading here - we can't currently start new control plane services / service zones on deployed racks because only RSS sets them up today, but SMF services are a lower level thing. Sled agent already configures and starts various SMF services within many of the zones it starts (e.g., all the services within the switch zone, where the list of switch zone SMF services has grown over time without having to deal with RSS or control plane service management). I think the hard line is probably "does sled-agent have all the information it needs to configure this {SMF,control plane} service", which is usually/always true for "SMF service" and false for "control plane service".

@karencfv

Copy link
Copy Markdown
Contributor Author

Ah, perfect! Thanks for the clarification @jclulow @jgallagher 🙇‍♀️

@karencfv

karencfv commented Dec 21, 2023

Copy link
Copy Markdown
Contributor Author

@jclulow thanks a bunch for your help! I've updated the PR to reflect the feedback you gave me.

If this architecture makes more sense, I'll update the rest of the self assembled zones to follow this pattern and move the PR out of draft mode. Update: It wasn't much so I've updated everything and the PR is ready

Let me know what you think :)

@karencfv karencfv marked this pull request as ready for review December 21, 2023 23:37
@karencfv karencfv requested a review from citrus-it December 21, 2023 23:38
Comment thread smf/zone-network-setup/manifest.xml Outdated

@citrus-it citrus-it left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, it's going to make things a lot cleaner around self-assembly zones.
I have a few comments, some of which are just suggestions (Github doesn't make it easy to differentiate). If anything isn't clear or you don't agree, please let me know.

(and apologies for you now having to re-base this on top of my disable-sshd PR that just merged!)

Comment thread illumos-utils/src/route.rs
Comment thread illumos-utils/src/route.rs Outdated
Comment thread sled-agent/src/services.rs Outdated
Comment thread smf/zone-network-setup/manifest.xml Outdated
Comment thread smf/zone-network-setup/method_script.sh Outdated
@karencfv

karencfv commented Jan 9, 2024

Copy link
Copy Markdown
Contributor Author

Thanks for the review @citrus-it ! I think I've addressed all of your comments, let me know what you think :)

@karencfv karencfv requested a review from citrus-it January 9, 2024 06:25
Comment thread package-manifest.toml Outdated
@citrus-it

Copy link
Copy Markdown
Contributor

Thanks for the review @citrus-it ! I think I've addressed all of your comments, let me know what you think :)

This looks good, thanks. Just added a comment about the package-manifest.toml.

I think it's fine to just have a single static address for now, since there's no consumer for multiple addresses.

@karencfv karencfv requested a review from citrus-it January 9, 2024 20:57

@citrus-it citrus-it left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, this is looking good to me.

@karencfv

Copy link
Copy Markdown
Contributor Author

Thanks for the review @citrus-it!

Tiny ping for @jclulow , @davepacheco @jgallagher or @smklein to review on the rust side of things, and I'll merge this :)

@jclulow

jclulow commented Jan 10, 2024

Copy link
Copy Markdown
Collaborator

I will take a look soon!

@karencfv

Copy link
Copy Markdown
Contributor Author

heyo! 👋

Sorry for the pressure here I know there's so much going on! Would it be possible to get a review on the rust side of things? @citrus-it kindly went over the Illumos side of this PR, so there's less to review!

The thing is, I need this approved so I can move on with the rest of the work for the self assembled zones.

@smklein smklein left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Comment thread illumos-utils/src/route.rs Outdated
// If the entry is not found in the table,
// the exit status of the command will be 3 (ESRCH).
// When that is the case, we'll add the route.
Some(3) => {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a nitpick -- errno values aren't necessarily the same between systems. I think this value of 3 "happens to be" right, but could we use the constant from https://docs.rs/libc/0.2.152/libc/constant.ESRCH.html instead?

Comment on lines +1422 to +1439
fn zone_network_setup_install(
info: &SledAgentInfo,
zone: &InstalledZone,
static_addr: &String,
) -> Result<ServiceBuilder, Error> {
let datalink = zone.get_control_vnic_name();
let gateway = &info.underlay_address.to_string();

let mut config_builder = PropertyGroupBuilder::new("config");
config_builder = config_builder
.add_property("datalink", "astring", datalink)
.add_property("gateway", "astring", gateway)
.add_property("static_addr", "astring", static_addr);

Ok(ServiceBuilder::new("oxide/zone-network-setup")
.add_property_group(config_builder)
.add_instance(ServiceInstanceBuilder::new("default")))
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rad, this looks great (totally makes sense to have this be a static config in zones before they start)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙇‍♀️

@karencfv karencfv merged commit 3b4b935 into oxidecomputer:main Jan 15, 2024
@karencfv karencfv deleted the zone-networking-service branch January 15, 2024 23:18
@karencfv

Copy link
Copy Markdown
Contributor Author

Thanks everyone!

karencfv added a commit to oxidecomputer/crucible that referenced this pull request Feb 20, 2024
oxidecomputer/omicron#4677 implements a new zone network configuration setup service so control plane services don't have to set this up themselves. This PR updates crucible to use said services.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants