[reconfigurator] blueprint builder: backfill filesystem_pool#7450
Conversation
|
To test on a4x2, I built a patch on top of this branch that forced RSS to fill in Then, I used (The zone diff here is not particularly useful! The only field that changed is I did not make this new blueprint the target. Instead, I created a different blueprint by manually editing this one, and changed the
and the sled-agent logs from that sled confirm it: Execution succeeded in sending the new zone configs to the other sleds. On each of them, the first and all subsequent I then used omdb to generate a new blueprint. It corrected the one pantry zone I had manually made incorrect, although the diff is nearly useless (again cc @andrewjstone :)): The new dataset addition is correct; for the other two, I compared the raw blueprints side by side to confirm things looked as expected. The dataset and the zone Making this blueprint the target resulted in successful execution, and we saw the same spate of "accepting new config" sled-agent logs on this sled: Also as expected, that sled still has the now-expunged dataset on disk: |
| // It's possible a sled-agent could restart in between when the | ||
| // inventory collection we used for backfilling was created and when we | ||
| // try to send it the new zone configs, and in doing so it could have | ||
| // changed the zpool it chose for any zone type that doesn't have a | ||
| // durable dataset (e.g., Nexus). To emulate this, mutate our collection | ||
| // and change the zpool for sled 2's Nexus to a different zpool; our | ||
| // backfilling should correct this again, even though the parent | ||
| // blueprint has a non-`None` filesystem pool for this Nexus. |
There was a problem hiding this comment.
Nice job with this test; I know this was a bad "hypothetical", it's great to see it being covered so explicitly.
| } | ||
| } | ||
|
|
||
| /// Temporary method to backfill `filesystem_pool` properties for existing |
There was a problem hiding this comment.
(This doesn't need to hold up this PR, just wondering)
Given that this is a "one-time" fix, and we're labeling some of this code as "temporary", do we have a way to identify when it's removable? Presumably, we'd want to delete basically all the code in this PR at some point in the future, as clean-up. But it also seems hard to know when that'll be the case, and harder to remember as we get further away from this merging.
I bring this up partially because it seems hard - we basically need to know that inventory has been successfully collected, that a new blueprint gets created, and that blueprint execution succeeds for at least one release containing this PR. If we miss any of those requirements, it's possible that we still have filesystems needing back-filling. And even if we do back-to-back updates, it's possible that we miss one of those steps.
As I said, I don't want this discussion to hold up this PR, which does look good - but I'm curious what we could do elsewhere in the reconfigurator / as part of the release + update process to help us know when it would be "safe" to delete this code. I know historically we've done stuff like the "release scripts" to "peek into the system state before applying an update", which we could do (e.g., ensure all filesystem pools are backfilled before upgrading to a release that removes this code). Do you think we'll do something similar here?
As said, feel free to push back on me, but I'm wondering if it's worth considering this question now, while the context is still fresh in our minds.
There was a problem hiding this comment.
I bring this up partially because it seems hard - we basically need to know that inventory has been successfully collected, that a new blueprint gets created, and that blueprint execution succeeds for at least one release containing this PR. If we miss any of those requirements, it's possible that we still have filesystems needing back-filling. And even if we do back-to-back updates, it's possible that we miss one of those steps.
I think we can codify this in the R13 runbook, and we can confirm it before removing this code via the reconfigurator state collected during those update windows, right? Similarly to how we confirmed #7350 was okay via the reconfigurator states collected during the R12 updates.
There was a problem hiding this comment.
Yeah, seems viable. I suppose having the human-in-the-loop step is still necessary as a part of update for now, and probably the most cost-effective path forward here.
466bfb1 to
cba6d0e
Compare
This builds on #7449 and, in combination with it, fixes #7229.
When planning a new blueprint, one of the first things we now do is loop over all in-service zones on all active sleds and check that their
filesystem_poolis consistent with the latest inventory collection from that sled. The expectation here is that we'll be filling infilesystem_pool: Nonevalues from older systems (i.e., #7229). There's also an edge case where we might need to correct our own incorrect backfills, if a sled-agent has restarted since the inventory collection was taken and has chosen a different zpool for some zone.If we can't find what the correct filesystem pool should be (e.g., if the sled isn't present in inventory at all, or none of its zpool/dataset combinations match the zone), we leave the
filesystem_poolvalue untouched.I'll put testing notes from a4x2 in a comment below.