[reconfigurator] Remove DatasetIdsBackfillFromDb#7350
Conversation
| // dataset ID we should be using? That would require tracking it in | ||
| // the relevant parent blueprint structures (zones for zone-related | ||
| // datasets, presumably disks for the top-level debug/zone root | ||
| // datasets?). |
There was a problem hiding this comment.
FWIW: the concern described in this note is not new, but as I was reading this code it came to mind so figured I'd write it down. If this is wrong I'd be happy to edit or remove it.
There was a problem hiding this comment.
When we ensure datasets in the sled agent, we call:
omicron/sled-storage/src/manager.rs
Lines 1032 to 1037 in c03eb51
This calls:
omicron/sled-storage/src/manager.rs
Lines 1045 to 1052 in c03eb51
To see if there is actually work to do.
One of the errors here is a UUID mismatch:
omicron/sled-storage/src/manager.rs
Lines 1002 to 1008 in c03eb51
To put another way: we do currently pass the dataset UUID to the sled agent, at least, so if there's a UUID mismatch in the sled it should be detected and throw an error.
... anyway, I think it makes sense to track, in the blueprint, "this zone is using this dataset UUID". Then we could probably add a DatasetUuid to the PartialDatasetConfig.
But this doesn't need to be a blocker for this PR, either.
There was a problem hiding this comment.
I'm not sure any of that is relevant to where this comment is, because this code is all in the planner, right? I'm worried about a bug in the planner where we have a situation like this:
- blueprint has an in-service Nexus dataset on sled N, zpool Z
- planner (incorrectly!) tries to place a new Nexus zone on sled N, zpool Z
- this particular code will not create a new dataset for that Nexus zone, but will instead reuse the ID from the existing in-service Nexus dataset on that zpool
At that point, we'd tell the sled-agent to launch a new Nexus zone but reuse the existing Nexus dataset, which presumably would be fine as far as all the checks you pointed out are concerned?
I don't think (?) this is a situation we can actually hit in practice, given other constraints in the planner, but I'm not 100% sure of that and would strongly prefer that we could tell locally whether we're doing the right thing.
There was a problem hiding this comment.
Oops, Nexus is a bad example because it doesn't have a durable dataset and filesystem pool kinds are unique per zone since they include the zone ID. So s/Nexus/Clickhouse/ (or any other zone kind that has a durable dataset) in the above comment.
There was a problem hiding this comment.
Ah, gotcha, I thought the comment implied that we'd be creating a "new UUID, but re-use the old dataset within the sled agent" - that's why I was pointing out, the sled agent would reject this scenario.
I agree, if we re-use the old dataset UUID, we'd re-use the old dataset. I actually think it's possible there are situations where we want this explicitly -- this gives us the ability to act independently between zones and datasets, and e.g. re-assign old datasets to new zones.
Regardless, I think this is okay as you suggested. With the current arguments to this function, we're asking for a "dataset of a particular type" to exist, not a "dataset with a particular UUID", and it's doing what it's supposed to. But I still agree that we can make the management of datasets more explicit by referencing dataset UUIDs within the blueprint explicitly, when specifying zone config.
smklein
left a comment
There was a problem hiding this comment.
LGTM, I'm assuming https://github.com/oxidecomputer/customer-support/blob/main/runbooks/rack-update/release-12/ledger_check.sh ran successfully?
On all the upgrades performed so far, yes. I'm inclined to go ahead and merge and if it fails on a subsequent upgrade we address that at the time (ideally by fixing up the system as needed, less ideally by restoring this code if necessary). |
Fixes #6645.
I believe this is safe to land now that R12 is out the door: any IDs in
datasetshould have been assigned. #7229 is related, but in that case thedatasettable itself is also missing entries (so there's no ID to backfill).