Skip to content

[sled-agent] Allow PUT /omicron-zones to accept filesystem_pool patches#7449

Merged
jgallagher merged 7 commits into
mainfrom
john/backfill-filesystem-pool-1
Feb 4, 2025
Merged

[sled-agent] Allow PUT /omicron-zones to accept filesystem_pool patches#7449
jgallagher merged 7 commits into
mainfrom
john/backfill-filesystem-pool-1

Conversation

@jgallagher

Copy link
Copy Markdown
Contributor

This is the first half of #7229. The second half is to teach Reconfigurator to actually fill in the missing filesystem_pool values; that will be upcoming shortly, and I'll put testing notes in it.

…ches

This is the first half of #7229. The second half is to teach
Reconfigurator to actually fill in the missing filesystem_pool values;
that will be upcoming shortly.
Comment thread sled-agent/src/services.rs
Comment thread sled-agent/src/services.rs
Comment thread sled-agent/src/services.rs
Comment thread sled-agent/src/services.rs Outdated
}

#[test]
fn test_fix_7229_zone_config_reconciliation() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the tests, this looks good

//
// then we don't want to shut the zone down and restart it, because the
// config hasn't actually changed in any meaningful way; this is just
// reconfigurator correcting #7229.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More just out of curiosity than anything: Are there other situations where this is true? Why is #7229 special?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there are any other such situations today, but there could be in the future.

Stepping back for context, if we get to this point, we have a situation where:

  • sled-agent has previously been given an OmicronZoneConfig for a particular zone; in principle, this defines all the information sled-agent needs to configure and run the zone
  • sled-agent has now been given an OmicronZoneConfig for the same zone that differs in some way

Without more information, sled-agent would need to be able to either reconfigure the zone on the fly to adjust for any changes in that config, or (for simplicity) tear the zone down and start it up again with the new configuration. For example, imagine we changed the upstream NTP servers; today, sled-agent has no way of reconfiguring a boundary NTP zone, but we could always shut the zone down and then restart it with the new config.

#7229 is special because changing the config in a way that is consistent with the already-running zone; it's just information that was missing when we started it. If the old OmicronZoneConfig had filesystem_pool: None, sled-agent had to choose some filesystem_pool on its own. The fix for #7229 means the new configs coming in now fill in filesystem_pool: Some(pool_sled_agent_chose). The zone configs are different, but not in a way that necessitates reconfiguring or restarting the zone.

I imagine we'll have future cases similar to this if/when we expand OmicronZoneConfig (e.g., fixing #7214 will add dataset IDs to zone configs in the blueprint; if that makes its way down to OmicronZoneConfig, we'll presumably want to do something similar where if we're just filling in the dataset that the zone is already using, we don't need to reconfigure or restart it).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context. What I was thinking about reading this comment (from the perspective of a future reader) was how I would know what other cases might be special besides 7229. I think the last sentence sort of gets at the more general "why"; maybe that could be moved up into a summary toward the top of the comment, but given that discussion is captured in the PR that's probably fine for now. Your call!

@jgallagher jgallagher merged commit 413dc9a into main Feb 4, 2025
@jgallagher jgallagher deleted the john/backfill-filesystem-pool-1 branch February 4, 2025 21:15
jgallagher added a commit that referenced this pull request Feb 5, 2025
This builds on #7449 and, in combination with it, fixes #7229.

When planning a new blueprint, one of the first things we now do is loop
over all in-service zones on all active sleds and check that their
`filesystem_pool` is consistent with the latest inventory collection
from that sled. The expectation here is that we'll be filling in
`filesystem_pool: None` values from older systems (i.e., #7229). There's
also an edge case where we might need to correct our own incorrect
backfills, if a sled-agent has restarted since the inventory collection
was taken and has chosen a different zpool for some zone.

If we can't find what the correct filesystem pool should be (e.g., if
the sled isn't present in inventory at all, or none of its zpool/dataset
combinations match the zone), we leave the `filesystem_pool` value
untouched.

I'll put testing notes from a4x2 in a comment below.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants