MGS: Flesh out host boot flash update (and rework update delivery mechanism)#1714
Conversation
`update()` is now `start_update()`, and it returns once the SP acks that an update is coming; the actual update delivery is moved onto a background task, and the caller is responsible for polling the SP to check the status of the update it's receiving.
ahl
left a comment
There was a problem hiding this comment.
looks good; annoying suggestions regarding the shape of the API. In particular I'd suggest MGS generate the UUID rather than having nexus (or RSS) do it.
| /// An identifier for this update. | ||
| /// | ||
| /// This ID applies to this single instance of the API call; it is not an | ||
| /// ID of `image` itself. Multiple API calls with the same `image` should | ||
| /// use different IDs. | ||
| pub id: Uuid, |
There was a problem hiding this comment.
why do we want consumers to create a UUID (and thus introduce the confusion you seek to avoid in the comment)? Could MGS instead generate and return a UUID that corresponds to this asynchronous operation?
There was a problem hiding this comment.
MGS generating the UUID was my gut feeling also, but RM and Andrew talked me out of it 😂. I believe the reasoning was that nexus would generate the ID as part of its intent to start the update, record it to cockroach, then send that down to MGS. @andrewjstone am I remembering that correctly?
There was a problem hiding this comment.
Yes, exactly. I saw this comment come in yesterday and was waiting to pleasantly disagree with Adam today 🤣
There was a problem hiding this comment.
To add more to this, I think it's very likely that we'll want to track updates in CockroachDb and be able to query them from there. Otherwise you have to end up hitting all SPs to get a global status. If you create the SP in MGS, you could always write that UUID when the update finishes, but then it makes any coordination of work across Nexii a bit harder. In short, I don't think it's strictly necessary to shape the API this way, but my gut tells me we'll end up wanting to do it this way eventually.
There was a problem hiding this comment.
I see: we want to put the UUID into CRDB before making the initial call to MGS from Nexus, is that right?
2dc7175 to
24d456a
Compare
andrewjstone
left a comment
There was a problem hiding this comment.
Looks really good @jgallagher . This is a vast improvement in behavior!
Only minor comments from me.
| InProgress(UpdateInProgressStatus), | ||
| /// Returned when an update has completed. | ||
| /// | ||
| /// The SP has no concept of time, so we cannot indicate how recently this |
There was a problem hiding this comment.
This is another reason to record the status in CockroachDb.
| /// An identifier for this update. | ||
| /// | ||
| /// This ID applies to this single instance of the API call; it is not an | ||
| /// ID of `image` itself. Multiple API calls with the same `image` should | ||
| /// use different IDs. | ||
| pub id: Uuid, |
This is the companion to the larger MGS change in oxidecomputer/omicron#1714. From the SP's side, the changes are relatively minor: 1. We removed "update prepare status" in favor of the more general "update status" 2. Update messages now always include an update ID, and for "update abort" we only abort if the ID matches our current in-progress update.
* mgmt-gateway: add update status and IDs This is the companion to the larger MGS change in oxidecomputer/omicron#1714. From the SP's side, the changes are relatively minor: 1. We removed "update prepare status" in favor of the more general "update status" 2. Update messages now always include an update ID, and for "update abort" we only abort if the ID matches our current in-progress update.
* mgmt-gateway: add update status and IDs This is the companion to the larger MGS change in oxidecomputer/omicron#1714. From the SP's side, the changes are relatively minor: 1. We removed "update prepare status" in favor of the more general "update status" 2. Update messages now always include an update ID, and for "update abort" we only abort if the ID matches our current in-progress update.
* mgmt-gateway: add update status and IDs This is the companion to the larger MGS change in oxidecomputer/omicron#1714. From the SP's side, the changes are relatively minor: 1. We removed "update prepare status" in favor of the more general "update status" 2. Update messages now always include an update ID, and for "update abort" we only abort if the ID matches our current in-progress update.
Hopefully this PR is not as big as it looks: about half of it by volume is updating either the OpenAPI spec file or developer-focused CLI apps for interacting with MGS. Ignoring those, the changes included are:
a. update-status returns the UUID of the most current update the SP knows about (either the in-progress update, if one is in progress, or the ID of the most recently completed/aborted update); if the update is in progress, it also includes how far along the update is
b. update-abort aborts an update with a given UUID
a. Removed
UpdatePrepareStatusin favor of the more generalUpdateStatusb.
UpdateandUpdateAbortnow take 16-byte IDs (i.e., the UUIDs accepted by the MGS endpoints)Additionally, the delivery of updates from MGS to the SP has changed based on the matrix conversation I had with @andrewjstone and @rmustacc. Briefly summarizing:
When MGS receives a POST to its update endpoint (which must include a client-provided UUID), it tells the SP to begin preparing for an update. If the SP answer's indicates success, MGS spawns a background task to actually deliver the update, and returns successfully immediately to its client. (If the SP's answer indicates an error, that is also returned to the client, but no background task is spawned.) This avoids the problem I was seeing previously where MGS's delivery of an update could be cancelled if the client disconnected while it was underway.
At any time, any client of any MGS instance may ask for the status of an update being delivered to a specific SP. The SP is the source of truth here; MGS always asks the SP what it thinks its update status is (even if it is currently in the process of delivering an update to that SP). As described above, the SP's response includes both the ID of the update (which the client could use to subsequently abort it, if desired) and progress (if the update is still underway, from the SP's point of view). Currently the SP only remembers the most recent update ID: either the currently-in-progress update, or the most recently completed or aborted ID.