Skip to content

[reconfigurator] Remove DatasetIdsBackfillFromDb#7350

Merged
jgallagher merged 2 commits into
mainfrom
john/remove-dataset-id-backfill
Jan 17, 2025
Merged

[reconfigurator] Remove DatasetIdsBackfillFromDb#7350
jgallagher merged 2 commits into
mainfrom
john/remove-dataset-id-backfill

Conversation

@jgallagher

Copy link
Copy Markdown
Contributor

Fixes #6645.

I believe this is safe to land now that R12 is out the door: any IDs in dataset should have been assigned. #7229 is related, but in that case the dataset table itself is also missing entries (so there's no ID to backfill).

Fixes #6645.

I believe this is safe to land now that R12 is out the door: any IDs in
`dataset` should have been assigned. #7229 is related, but in that case
the `dataset` table itself is also missing entries (so there's no ID to
backfill).
// dataset ID we should be using? That would require tracking it in
// the relevant parent blueprint structures (zones for zone-related
// datasets, presumably disks for the top-level debug/zone root
// datasets?).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: the concern described in this note is not new, but as I was reading this code it came to mind so figured I'd write it down. If this is wrong I'd be happy to edit or remove it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we ensure datasets in the sled agent, we call:

async fn dataset_ensure_internal(
&self,
log: &Logger,
config: &DatasetConfig,
old_dataset: Option<&DatasetProperties>,
) -> DatasetManagementStatus {

This calls:

match Self::should_skip_dataset_ensure(&log, config, old_dataset) {
Ok(true) => return status,
Ok(false) => (),
Err(err) => {
status.err = Some(err.to_string());
return status;
}
}

To see if there is actually work to do.

One of the errors here is a UUID mismatch:

if old_id != config.id {
return Err(Error::UuidMismatch {
name: config.name.full_name(),
old: old_id.into_untyped_uuid(),
new: config.id.into_untyped_uuid(),
});
}

To put another way: we do currently pass the dataset UUID to the sled agent, at least, so if there's a UUID mismatch in the sled it should be detected and throw an error.

... anyway, I think it makes sense to track, in the blueprint, "this zone is using this dataset UUID". Then we could probably add a DatasetUuid to the PartialDatasetConfig.

But this doesn't need to be a blocker for this PR, either.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure any of that is relevant to where this comment is, because this code is all in the planner, right? I'm worried about a bug in the planner where we have a situation like this:

  • blueprint has an in-service Nexus dataset on sled N, zpool Z
  • planner (incorrectly!) tries to place a new Nexus zone on sled N, zpool Z
  • this particular code will not create a new dataset for that Nexus zone, but will instead reuse the ID from the existing in-service Nexus dataset on that zpool

At that point, we'd tell the sled-agent to launch a new Nexus zone but reuse the existing Nexus dataset, which presumably would be fine as far as all the checks you pointed out are concerned?

I don't think (?) this is a situation we can actually hit in practice, given other constraints in the planner, but I'm not 100% sure of that and would strongly prefer that we could tell locally whether we're doing the right thing.

@jgallagher jgallagher Jan 15, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, Nexus is a bad example because it doesn't have a durable dataset and filesystem pool kinds are unique per zone since they include the zone ID. So s/Nexus/Clickhouse/ (or any other zone kind that has a durable dataset) in the above comment.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha, I thought the comment implied that we'd be creating a "new UUID, but re-use the old dataset within the sled agent" - that's why I was pointing out, the sled agent would reject this scenario.

I agree, if we re-use the old dataset UUID, we'd re-use the old dataset. I actually think it's possible there are situations where we want this explicitly -- this gives us the ability to act independently between zones and datasets, and e.g. re-assign old datasets to new zones.

Regardless, I think this is okay as you suggested. With the current arguments to this function, we're asking for a "dataset of a particular type" to exist, not a "dataset with a particular UUID", and it's doing what it's supposed to. But I still agree that we can make the management of datasets more explicit by referencing dataset UUIDs within the blueprint explicitly, when specifying zone config.

@smklein smklein left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgallagher

Copy link
Copy Markdown
Contributor Author

LGTM, I'm assuming https://github.com/oxidecomputer/customer-support/blob/main/runbooks/rack-update/release-12/ledger_check.sh ran successfully?

On all the upgrades performed so far, yes. I'm inclined to go ahead and merge and if it fails on a subsequent upgrade we address that at the time (ideally by fixing up the system as needed, less ideally by restoring this code if necessary).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[reconfigurator] Stop reading dataset UUIDs from datastore

2 participants