[reconfigurator] Avoid deploying expunged datasets to sled agents#7308
Conversation
|
This does not resolve #6177 , and the issues raised by @davepacheco there are still open (namely, the ordering between {dataset,zone} {add,remove} which could cause issues during zone expungement). However, this solves the more scope-limited issue raised by #7304 : If a blueprint thinks a dataset is expunged, we stop telling the sled agents to instantiate it. |
| .datasets | ||
| .into_iter() | ||
| .map(|(id, d)| (id, d.into())) | ||
| .filter_map(|(id, d)| { |
There was a problem hiding this comment.
This is the actual fix - basically, converting from "Blueprint dataset" -> "Deployable dataset" format will filter out expunged datasets.
If this feels too magical, I can remove impl From here and instead create:
impl BlueprintDatasetsConfig {
fn into_in_service_datasets(self) -> DatasetsConfig {
...
}
}(the implementation would be the same, but this could make it more explicit that we're filtering)
There was a problem hiding this comment.
I like your suggestion, seems like a good refactor :)
I agree that this shouldn't change the net effect of the system, without subsequent PRs.
Not sure I agree with this one. IMO this PR inches us closer to "more correct" behavior because it makes the reconfigurator do the right thing with the APIs it has. I think it'll have the same impact on the Sled Agent as "not requesting expunged datasets" because they aren't getting deleted regardless. I think it would make sense to do the following:
At this point: The reconfigurator would be sending a set of info to the sled agents that's correct, but not being carried out correctly, because things aren't being deleted. Then we could:
|
Agreed!
I agree that sequence should work. My worry here stems mainly from various recent discussions (like the R12/#7265 one) where I feel like we struggled to accurately predict how the system would behave in this area, particularly across various releases. #7309 itself is an example of a non-obvious consequence of a bunch of other reasonable-seeming things. I guess another way to say that is that if we land this fix, then a bug (6177) becomes load-bearing for zone expungement to keep working, which feels like a weird place to be. |
|
So, given the extra context in #7500 , I was wrong about this not impacting the underlying system. In cases where a dataset is expunged, but the disk remains, then yes, this has no effect. However, in cases where a dataset is expunged from a disk which is also expunged, then it's critical to omit these datasets when handing them to a sled agent. Otherwise, the reconfigurator can ask the sled agent to deploy a dataset that should be expunged, on a disk that should be expunged. This appears to hang reconfigurator execution. |
This is followup from #7308 and #7500 (comment): * Remove `From<BlueprintZonesConfig> for OmicronZonesConfig`. This included expunged zones, but thankfully had no callers. * Change `BlueprintZonesConfig::to_omicron_zones_config(filter)` to `BlueprintZonesConfig::into_running_omicron_zones_config()` (no `filter` argument). All the callers of this method were passing `BlueprintZoneFilter::ShouldBeRunning`, and I don't think there's a reason to use any other filter? * Remove `From<BlueprintPhysicalDisksConfig> for OmicronPhysicalDisksConfig` (which included expunged disks), and replace it with `BlueprintPhysicalDisksConfig::into_in_service_disks()`. This one _did_ have callers, including the blueprint executor, but I think we've avoided problems because the planner current [drops disks if the incoming planning input says they're not in service](https://github.com/oxidecomputer/omicron/blob/3ae42bf76cb9b55993705b817157e4f60935b6dd/nexus/reconfigurator/planning/src/blueprint_builder/builder.rs#L1090-L1097). I'm not sure that planner behavior is right, and might change with #7286, so it seemed safer to go ahead and fix this now.
If a blueprint has marked a dataset as expunged, stop telling sled agents to deploy it.
Fixes #7304