Skip to content

Intermittent DNS resolution failures from sleds in an at-home multi-machine Omicron cluster #2726

@gjcolombo

Description

@gjcolombo

I have an at-home test rack of three machines that are in an Omicron cluster together. The cluster is running my pending live migration changes with some minor modifications to rack service planning to guarantee that Nexus always ends up on a specific machine in the cluster. This branch is currently based off of commit f83c7c6.

The machines in question here are

  • augustine: hosts Nexus and internal DNS; connected to the rest of the home network; has extra NICs that are linked directly to the other two machines; has its sled_mode set to "scrimlet" in config.toml
  • albert/catherine: generic compute hosts; networked only to augustine; config.toml has sled_mode set to "gimlet"

Their Omicron packages were configured with --switch=stub (i.e. without SoftNPU) since I've previously tested these machines with a stub switch and haven't adapted their configuration to softnpu yet.

I can reach Nexus consistently from my workstation and create instances there. When those instances land on albert or catherine, their sled agents occasionally fail DNS resolution in places where it's clear they're trying to reach Nexus. Errors include the following:

  • {"msg":"State monitoring task failed: Error resolving DNS name: request timed out" ... from a Propolis state monitoring task
  • thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: Resolve(ResolveError { kind: Timeout })', sled-agent/src/instance.rs:810:66 (source link)

The resolution failures aren't consistent, but they happen often enough that I can probably hit one reliably just by starting an instance and playing around with it a bit/trying to migrate it/etc.

Other observations/notes:

  • albert and catherine can both ping augustine's Nexus, although catherine occasionally throws up an ICMPv6 Address Unreachable from gateway fd00:1122:3344:103::1 a couple of times first before deciding Nexus is alive
  • cargo run -p dns-server --bin dnsadm -- -a '[fd00:1122:3344:1::1]:5353' list-records runs to completion on all three machines and shows the expected DNS entries
  • this is the first time I've run these machines in this configuration since rebasing from b4b4b1f to f83c7c6
  • I have previously run augustine (the Nexus/DNS box) as a single-sled machine with a package built with --switch=softnpu; I haven't rebooted this machine between those runs and this one; possibly I misconfigured something or held something incorrectly while trying to get it set up for the multi-machine test

I can hold these machines in this cluster for a short while and try to gather any logs etc. that would be interesting for debugging, though I'll probably need to reclaim them in a day or two.

Metadata

Metadata

Assignees

Labels

developmentBugs, paper cuts, feature requests, or other thoughts on making omicron development better

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions