I have an at-home test rack of three machines that are in an Omicron cluster together. The cluster is running my pending live migration changes with some minor modifications to rack service planning to guarantee that Nexus always ends up on a specific machine in the cluster. This branch is currently based off of commit f83c7c6.
The machines in question here are
- augustine: hosts Nexus and internal DNS; connected to the rest of the home network; has extra NICs that are linked directly to the other two machines; has its
sled_mode set to "scrimlet" in config.toml
- albert/catherine: generic compute hosts; networked only to augustine; config.toml has
sled_mode set to "gimlet"
Their Omicron packages were configured with --switch=stub (i.e. without SoftNPU) since I've previously tested these machines with a stub switch and haven't adapted their configuration to softnpu yet.
I can reach Nexus consistently from my workstation and create instances there. When those instances land on albert or catherine, their sled agents occasionally fail DNS resolution in places where it's clear they're trying to reach Nexus. Errors include the following:
{"msg":"State monitoring task failed: Error resolving DNS name: request timed out" ... from a Propolis state monitoring task
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: Resolve(ResolveError { kind: Timeout })', sled-agent/src/instance.rs:810:66 (source link)
The resolution failures aren't consistent, but they happen often enough that I can probably hit one reliably just by starting an instance and playing around with it a bit/trying to migrate it/etc.
Other observations/notes:
- albert and catherine can both ping augustine's Nexus, although catherine occasionally throws up an
ICMPv6 Address Unreachable from gateway fd00:1122:3344:103::1 a couple of times first before deciding Nexus is alive
cargo run -p dns-server --bin dnsadm -- -a '[fd00:1122:3344:1::1]:5353' list-records runs to completion on all three machines and shows the expected DNS entries
- this is the first time I've run these machines in this configuration since rebasing from b4b4b1f to f83c7c6
- I have previously run augustine (the Nexus/DNS box) as a single-sled machine with a package built with
--switch=softnpu; I haven't rebooted this machine between those runs and this one; possibly I misconfigured something or held something incorrectly while trying to get it set up for the multi-machine test
I can hold these machines in this cluster for a short while and try to gather any logs etc. that would be interesting for debugging, though I'll probably need to reclaim them in a day or two.
I have an at-home test rack of three machines that are in an Omicron cluster together. The cluster is running my pending live migration changes with some minor modifications to rack service planning to guarantee that Nexus always ends up on a specific machine in the cluster. This branch is currently based off of commit f83c7c6.
The machines in question here are
sled_modeset to "scrimlet" in config.tomlsled_modeset to "gimlet"Their Omicron packages were configured with
--switch=stub(i.e. without SoftNPU) since I've previously tested these machines with a stub switch and haven't adapted their configuration to softnpu yet.I can reach Nexus consistently from my workstation and create instances there. When those instances land on albert or catherine, their sled agents occasionally fail DNS resolution in places where it's clear they're trying to reach Nexus. Errors include the following:
{"msg":"State monitoring task failed: Error resolving DNS name: request timed out" ...from a Propolis state monitoring taskthread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: Resolve(ResolveError { kind: Timeout })', sled-agent/src/instance.rs:810:66(source link)The resolution failures aren't consistent, but they happen often enough that I can probably hit one reliably just by starting an instance and playing around with it a bit/trying to migrate it/etc.
Other observations/notes:
ICMPv6 Address Unreachable from gateway fd00:1122:3344:103::1a couple of times first before deciding Nexus is alivecargo run -p dns-server --bin dnsadm -- -a '[fd00:1122:3344:1::1]:5353' list-recordsruns to completion on all three machines and shows the expected DNS entries--switch=softnpu; I haven't rebooted this machine between those runs and this one; possibly I misconfigured something or held something incorrectly while trying to get it set up for the multi-machine testI can hold these machines in this cluster for a short while and try to gather any logs etc. that would be interesting for debugging, though I'll probably need to reclaim them in a day or two.