Intermittent DNS resolution failures from sleds in an at-home multi-machine Omicron cluster

I have an at-home test rack of three machines that are in an Omicron cluster together. The cluster is running my [pending live migration changes](https://github.com/oxidecomputer/omicron/compare/main...gjcolombo:omicron:gjcolombo/lets-migrate/8-migration-end) with some minor modifications to rack service planning to guarantee that Nexus always ends up on a specific machine in the cluster. This branch is currently based off of commit f83c7c65a8e54f03713ea226b86984f2d3096cfb.

The machines in question here are

- augustine: hosts Nexus and internal DNS; connected to the rest of the home network; has extra NICs that are linked directly to the other two machines; has its `sled_mode` set to "scrimlet" in config.toml
- albert/catherine: generic compute hosts; networked only to augustine; config.toml has `sled_mode` set to "gimlet"

Their Omicron packages were configured with `--switch=stub` (i.e. without SoftNPU) since I've previously tested these machines with a stub switch and haven't adapted their configuration to softnpu yet.

I can reach Nexus consistently from my workstation and create instances there. When those instances land on albert or catherine, their sled agents occasionally fail DNS resolution in places where it's clear they're trying to reach Nexus. Errors include the following:

- `{"msg":"State monitoring task failed: Error resolving DNS name: request timed out" ...` from a Propolis state monitoring task
- ```thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: Resolve(ResolveError { kind: Timeout })', sled-agent/src/instance.rs:810:66``` ([source link](https://github.com/gjcolombo/omicron/blob/2a4dbd548f905ab087a4fd3969ece546fff5b591/sled-agent/src/instance.rs#L810))

The resolution failures aren't consistent, but they happen often enough that I can probably hit one reliably just by starting an instance and playing around with it a bit/trying to migrate it/etc.

Other observations/notes:

- albert and catherine can both ping augustine's Nexus, although catherine occasionally throws up an `ICMPv6 Address Unreachable from gateway fd00:1122:3344:103::1` a couple of times first before deciding Nexus is alive
- `cargo run -p dns-server --bin dnsadm -- -a '[fd00:1122:3344:1::1]:5353' list-records` runs to completion on all three machines and shows the expected DNS entries
- this is the first time I've run these machines in this configuration since rebasing from b4b4b1f to f83c7c6
- I have previously run augustine (the Nexus/DNS box) as a single-sled machine with a package built with `--switch=softnpu`; I haven't rebooted this machine between those runs and this one; possibly I misconfigured something or held something incorrectly while trying to get it set up for the multi-machine test

I can hold these machines in this cluster for a short while and try to gather any logs etc. that would be interesting for debugging, though I'll probably need to reclaim them in a day or two.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent DNS resolution failures from sleds in an at-home multi-machine Omicron cluster #2726

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Intermittent DNS resolution failures from sleds in an at-home multi-machine Omicron cluster #2726

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions