[reconfigurator] Remove DatasetIdsBackfillFromDb by jgallagher · Pull Request #7350 · oxidecomputer/omicron

jgallagher · 2025-01-15T18:58:01Z

I believe this is safe to land now that R12 is out the door: any IDs in dataset should have been assigned. #7229 is related, but in that case the dataset table itself is also missing entries (so there's no ID to backfill).

Fixes #6645. I believe this is safe to land now that R12 is out the door: any IDs in `dataset` should have been assigned. #7229 is related, but in that case the `dataset` table itself is also missing entries (so there's no ID to backfill).

jgallagher · 2025-01-15T19:06:44Z

+            // dataset ID we should be using? That would require tracking it in
+            // the relevant parent blueprint structures (zones for zone-related
+            // datasets, presumably disks for the top-level debug/zone root
+            // datasets?).


FWIW: the concern described in this note is not new, but as I was reading this code it came to mind so figured I'd write it down. If this is wrong I'd be happy to edit or remove it.

When we ensure datasets in the sled agent, we call:

omicron/sled-storage/src/manager.rs

Lines 1032 to 1037 in c03eb51

async fn dataset_ensure_internal(

&self,

log: &Logger,

config: &DatasetConfig,

old_dataset: Option<&DatasetProperties>,

) -> DatasetManagementStatus {

This calls:

omicron/sled-storage/src/manager.rs

Lines 1045 to 1052 in c03eb51

match Self::should_skip_dataset_ensure(&log, config, old_dataset) {

Ok(true) => return status,

Ok(false) => (),

Err(err) => {

status.err = Some(err.to_string());

return status;

}

}

To see if there is actually work to do.

One of the errors here is a UUID mismatch:

omicron/sled-storage/src/manager.rs

Lines 1002 to 1008 in c03eb51

if old_id != config.id {

return Err(Error::UuidMismatch {

name: config.name.full_name(),

old: old_id.into_untyped_uuid(),

new: config.id.into_untyped_uuid(),

});

}

To put another way: we do currently pass the dataset UUID to the sled agent, at least, so if there's a UUID mismatch in the sled it should be detected and throw an error.

... anyway, I think it makes sense to track, in the blueprint, "this zone is using this dataset UUID". Then we could probably add a DatasetUuid to the PartialDatasetConfig.

But this doesn't need to be a blocker for this PR, either.

I'm not sure any of that is relevant to where this comment is, because this code is all in the planner, right? I'm worried about a bug in the planner where we have a situation like this:

blueprint has an in-service Nexus dataset on sled N, zpool Z

planner (incorrectly!) tries to place a new Nexus zone on sled N, zpool Z

this particular code will not create a new dataset for that Nexus zone, but will instead reuse the ID from the existing in-service Nexus dataset on that zpool

At that point, we'd tell the sled-agent to launch a new Nexus zone but reuse the existing Nexus dataset, which presumably would be fine as far as all the checks you pointed out are concerned?

I don't think (?) this is a situation we can actually hit in practice, given other constraints in the planner, but I'm not 100% sure of that and would strongly prefer that we could tell locally whether we're doing the right thing.

Oops, Nexus is a bad example because it doesn't have a durable dataset and filesystem pool kinds are unique per zone since they include the zone ID. So s/Nexus/Clickhouse/ (or any other zone kind that has a durable dataset) in the above comment.

Ah, gotcha, I thought the comment implied that we'd be creating a "new UUID, but re-use the old dataset within the sled agent" - that's why I was pointing out, the sled agent would reject this scenario.

I agree, if we re-use the old dataset UUID, we'd re-use the old dataset. I actually think it's possible there are situations where we want this explicitly -- this gives us the ability to act independently between zones and datasets, and e.g. re-assign old datasets to new zones.

Regardless, I think this is okay as you suggested. With the current arguments to this function, we're asking for a "dataset of a particular type" to exist, not a "dataset with a particular UUID", and it's doing what it's supposed to. But I still agree that we can make the management of datasets more explicit by referencing dataset UUIDs within the blueprint explicitly, when specifying zone config.

smklein

LGTM, I'm assuming https://github.com/oxidecomputer/customer-support/blob/main/runbooks/rack-update/release-12/ledger_check.sh ran successfully?

jgallagher · 2025-01-15T20:46:55Z

LGTM, I'm assuming https://github.com/oxidecomputer/customer-support/blob/main/runbooks/rack-update/release-12/ledger_check.sh ran successfully?

On all the upgrades performed so far, yes. I'm inclined to go ahead and merge and if it fails on a subsequent upgrade we address that at the time (ideally by fixing up the system as needed, less ideally by restoring this code if necessary).

jgallagher requested review from smklein and sunshowers January 15, 2025 18:58

jgallagher commented Jan 15, 2025

View reviewed changes

smklein approved these changes Jan 15, 2025

View reviewed changes

Merge branch 'main' into john/remove-dataset-id-backfill

83387ce

jgallagher merged commit 017830d into main Jan 17, 2025

jgallagher deleted the john/remove-dataset-id-backfill branch January 17, 2025 19:55

jgallagher mentioned this pull request Jan 24, 2025

Change dataset table to only hold Crucible datasets #7386

Merged

jgallagher mentioned this pull request Feb 4, 2025

[reconfigurator] blueprint builder: backfill filesystem_pool #7450

Merged

jgallagher mentioned this pull request Aug 22, 2025

[reconfigurator] builder/planner could handle backwards compatibility around datasets and disks up front #7082

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reconfigurator] Remove DatasetIdsBackfillFromDb#7350

[reconfigurator] Remove DatasetIdsBackfillFromDb#7350
jgallagher merged 2 commits into
mainfrom
john/remove-dataset-id-backfill

jgallagher commented Jan 15, 2025

Uh oh!

jgallagher Jan 15, 2025

Uh oh!

smklein Jan 15, 2025

Uh oh!

jgallagher Jan 15, 2025

Uh oh!

jgallagher Jan 15, 2025 •

edited

Loading

Uh oh!

smklein Jan 15, 2025

Uh oh!

smklein left a comment

Uh oh!

jgallagher commented Jan 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	async fn dataset_ensure_internal(
	&self,
	log: &Logger,
	config: &DatasetConfig,
	old_dataset: Option<&DatasetProperties>,
	) -> DatasetManagementStatus {

	match Self::should_skip_dataset_ensure(&log, config, old_dataset) {
	Ok(true) => return status,
	Ok(false) => (),
	Err(err) => {
	status.err = Some(err.to_string());
	return status;
	}
	}

	if old_id != config.id {
	return Err(Error::UuidMismatch {
	name: config.name.full_name(),
	old: old_id.into_untyped_uuid(),
	new: config.id.into_untyped_uuid(),
	});
	}

Conversation

jgallagher commented Jan 15, 2025

Uh oh!

jgallagher Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

smklein Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smklein Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

jgallagher commented Jan 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jgallagher Jan 15, 2025 •

edited

Loading