Skip to content

[update] add missing sleds to contact_support checks#10476

Merged
karencfv merged 5 commits into
oxidecomputer:mainfrom
karencfv:find-missing-sleds
May 28, 2026
Merged

[update] add missing sleds to contact_support checks#10476
karencfv merged 5 commits into
oxidecomputer:mainfrom
karencfv:find-missing-sleds

Conversation

@karencfv

Copy link
Copy Markdown
Contributor

Follow up to #10271

Closes #4745

@karencfv karencfv marked this pull request as draft May 21, 2026 09:15
Comment thread nexus/src/app/update.rs
/// Build a map of version strings to the number of components on that
/// version
async fn component_version_counts(
async fn get_internal_update_status(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the work was already done here, I just extracted the bit that was creating internal_views::UpdateStatus and used that to determine missing sleds.

@karencfv karencfv marked this pull request as ready for review May 21, 2026 09:41

@sunshowers sunshowers left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Just a few questions and comments.

Comment thread nexus/src/app/update.rs Outdated
.iter()
.filter(|sled| {
// `unknown()` returns the represenation of the update status
// for a given sled ID that isn't present in inventory.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case also hit when a sled is present in inventory but there's no last reconciliation status?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

let Some(inv) = sa.last_reconciliation.as_ref() else {
return Some(SledAgentUpdateStatus::unknown(sa.sled_id));
};

I can update the comment to explain this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great -- thanks!

Comment thread nexus/src/app/update.rs
Comment on lines +131 to +135
**sled
== internal_views::SledAgentUpdateStatus::unknown(
sled.sled_id,
)
})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looking at SledAgentUpdateStatus, the unknown state isn't recorded as an enum variant. This is pre-existing but I'm a little worried about the fragility of things like string equality here:

host_phase_2: HostPhase2Status {
boot_disk: Err("unknown".to_string()),
slot_a_version: TufRepoVersion::Unknown,
slot_b_version: TufRepoVersion::Unknown,
},

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that this fix may be a bit more involved than it seems at a first glance #10271 (comment)

Do you mind if I open up a follow up issue for this work?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread nexus/src/app/update.rs
Comment thread nexus/src/app/update.rs
Comment thread nexus/src/app/update.rs
}
};

let missing_sleds: BTreeSet<SledUuid> = self

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does omdb need to show this information? Seems like for support it would be better than grepping through logs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that'd be really nice. There was a discussion about that here -> #9412 (comment)

I think this work may need its own issue though. Thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a followup sounds good.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread nexus/src/app/update.rs
Comment on lines 712 to 718
let expected_sleds = self
.datastore()
.sled_list_all_batched(
opctx,
SledFilter::SpsUpdatedByReconfigurator,
)
.await?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so this will do a live db query, but inventory is cached. Does this mean that when a sled is newly commissioned, contact_support might be true for a short period of time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably. This is also true for other checks in various circumstances. I'll add these caveats to the support runbooks. It's worth mentioning that users would only hit this case if they wish to perform an update or check the update status immediately after a sled is newly commissioned, which is unlikely?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds good. I do feel like in the limit this is worth fixing, but it's pretty far down the list.

Comment thread nexus/src/app/update.rs Outdated
Comment on lines +748 to +751
async fn component_version_counts(
&self,
status: internal_views::UpdateStatus,
) -> Result<BTreeMap<String, usize>, Error> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes:

  1. This doesn't need to either be async or take &self, I think.
  2. Small nit: thoughts on reworking this to take a reference to UpdateStatus?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. I extracted the bits that I needed but didn't check the signature to see if it still made sense. I'll fix this

Comment thread nexus/src/app/update.rs Outdated
.sleds
.iter()
.filter(|sled| {
// `unknown()` returns the represenation of the update status

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// `unknown()` returns the represenation of the update status
// `unknown()` returns the representation of the update status

@karencfv karencfv left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look! I think I've addressed all your comments, let me know if I missed anything :)

Comment thread nexus/src/app/update.rs Outdated
.iter()
.filter(|sled| {
// `unknown()` returns the represenation of the update status
// for a given sled ID that isn't present in inventory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

let Some(inv) = sa.last_reconciliation.as_ref() else {
return Some(SledAgentUpdateStatus::unknown(sa.sled_id));
};

I can update the comment to explain this

Comment thread nexus/src/app/update.rs
}
};

let missing_sleds: BTreeSet<SledUuid> = self

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that'd be really nice. There was a discussion about that here -> #9412 (comment)

I think this work may need its own issue though. Thoughts?

Comment thread nexus/src/app/update.rs
Comment on lines +131 to +135
**sled
== internal_views::SledAgentUpdateStatus::unknown(
sled.sled_id,
)
})

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that this fix may be a bit more involved than it seems at a first glance #10271 (comment)

Do you mind if I open up a follow up issue for this work?

Comment thread nexus/src/app/update.rs Outdated
Comment on lines +748 to +751
async fn component_version_counts(
&self,
status: internal_views::UpdateStatus,
) -> Result<BTreeMap<String, usize>, Error> {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right. I extracted the bits that I needed but didn't check the signature to see if it still made sense. I'll fix this

Comment thread nexus/src/app/update.rs
@karencfv karencfv merged commit c6af9c9 into oxidecomputer:main May 28, 2026
18 checks passed
@karencfv karencfv deleted the find-missing-sleds branch May 28, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Good-enough (mvp) tool for rack health checks to support automated updates

2 participants