Skip to content

MGS: Flesh out host boot flash update (and rework update delivery mechanism)#1714

Merged
jgallagher merged 16 commits into
mainfrom
mgs-update-components
Sep 16, 2022
Merged

MGS: Flesh out host boot flash update (and rework update delivery mechanism)#1714
jgallagher merged 16 commits into
mainfrom
mgs-update-components

Conversation

@jgallagher

Copy link
Copy Markdown
Contributor

Hopefully this PR is not as big as it looks: about half of it by volume is updating either the OpenAPI spec file or developer-focused CLI apps for interacting with MGS. Ignoring those, the changes included are:

  1. New endpoints in MGS:
    a. update-status returns the UUID of the most current update the SP knows about (either the in-progress update, if one is in progress, or the ID of the most recently completed/aborted update); if the update is in progress, it also includes how far along the update is
    b. update-abort aborts an update with a given UUID
  2. Modifications to the MGS <-> SP messages:
    a. Removed UpdatePrepareStatus in favor of the more general UpdateStatus
    b. Update and UpdateAbort now take 16-byte IDs (i.e., the UUIDs accepted by the MGS endpoints)

Additionally, the delivery of updates from MGS to the SP has changed based on the matrix conversation I had with @andrewjstone and @rmustacc. Briefly summarizing:

When MGS receives a POST to its update endpoint (which must include a client-provided UUID), it tells the SP to begin preparing for an update. If the SP answer's indicates success, MGS spawns a background task to actually deliver the update, and returns successfully immediately to its client. (If the SP's answer indicates an error, that is also returned to the client, but no background task is spawned.) This avoids the problem I was seeing previously where MGS's delivery of an update could be cancelled if the client disconnected while it was underway.

At any time, any client of any MGS instance may ask for the status of an update being delivered to a specific SP. The SP is the source of truth here; MGS always asks the SP what it thinks its update status is (even if it is currently in the process of delivering an update to that SP). As described above, the SP's response includes both the ID of the update (which the client could use to subsequently abort it, if desired) and progress (if the update is still underway, from the SP's point of view). Currently the SP only remembers the most recent update ID: either the currently-in-progress update, or the most recently completed or aborted ID.

@ahl ahl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good; annoying suggestions regarding the shape of the API. In particular I'd suggest MGS generate the UUID rather than having nexus (or RSS) do it.

Comment thread gateway/src/http_entrypoints.rs Outdated
Comment on lines +508 to +513
/// An identifier for this update.
///
/// This ID applies to this single instance of the API call; it is not an
/// ID of `image` itself. Multiple API calls with the same `image` should
/// use different IDs.
pub id: Uuid,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want consumers to create a UUID (and thus introduce the confusion you seek to avoid in the comment)? Could MGS instead generate and return a UUID that corresponds to this asynchronous operation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MGS generating the UUID was my gut feeling also, but RM and Andrew talked me out of it 😂. I believe the reasoning was that nexus would generate the ID as part of its intent to start the update, record it to cockroach, then send that down to MGS. @andrewjstone am I remembering that correctly?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. I saw this comment come in yesterday and was waiting to pleasantly disagree with Adam today 🤣

@andrewjstone andrewjstone Sep 16, 2022

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add more to this, I think it's very likely that we'll want to track updates in CockroachDb and be able to query them from there. Otherwise you have to end up hitting all SPs to get a global status. If you create the SP in MGS, you could always write that UUID when the update finishes, but then it makes any coordination of work across Nexii a bit harder. In short, I don't think it's strictly necessary to shape the API this way, but my gut tells me we'll end up wanting to do it this way eventually.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see: we want to put the UUID into CRDB before making the initial call to MGS from Nexus, is that right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup.

Comment thread gateway-messages/src/lib.rs
Comment thread gateway/src/http_entrypoints.rs Outdated
Comment thread gateway/src/http_entrypoints.rs Outdated
Comment thread gateway/src/http_entrypoints.rs Outdated
@jgallagher jgallagher force-pushed the mgs-update-components branch from 2dc7175 to 24d456a Compare September 16, 2022 15:22

@andrewjstone andrewjstone left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good @jgallagher . This is a vast improvement in behavior!

Only minor comments from me.

InProgress(UpdateInProgressStatus),
/// Returned when an update has completed.
///
/// The SP has no concept of time, so we cannot indicate how recently this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another reason to record the status in CockroachDb.

Comment thread gateway-messages/src/lib.rs
Comment thread gateway-sp-comms/src/communicator.rs Outdated
Comment on lines +508 to +513
/// An identifier for this update.
///
/// This ID applies to this single instance of the API call; it is not an
/// ID of `image` itself. Multiple API calls with the same `image` should
/// use different IDs.
pub id: Uuid,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup.

Comment thread gateway-sp-comms/src/single_sp.rs Outdated
Comment thread gateway-sp-comms/src/single_sp.rs Outdated
@jgallagher jgallagher merged commit d8c5b93 into main Sep 16, 2022
@jgallagher jgallagher deleted the mgs-update-components branch September 16, 2022 19:17
jgallagher added a commit to oxidecomputer/hubris that referenced this pull request Sep 19, 2022
This is the companion to the larger MGS change in
oxidecomputer/omicron#1714. From the SP's side,
the changes are relatively minor:

1. We removed "update prepare status" in favor of the more general
   "update status"
2. Update messages now always include an update ID, and for "update
   abort" we only abort if the ID matches our current in-progress
   update.
jgallagher added a commit to oxidecomputer/hubris that referenced this pull request Sep 19, 2022
* mgmt-gateway: add update status and IDs

This is the companion to the larger MGS change in
oxidecomputer/omicron#1714. From the SP's side,
the changes are relatively minor:

1. We removed "update prepare status" in favor of the more general
   "update status"
2. Update messages now always include an update ID, and for "update
   abort" we only abort if the ID matches our current in-progress
   update.
FawazTirmizi pushed a commit to rivosinc/hubris that referenced this pull request Oct 3, 2022
* mgmt-gateway: add update status and IDs

This is the companion to the larger MGS change in
oxidecomputer/omicron#1714. From the SP's side,
the changes are relatively minor:

1. We removed "update prepare status" in favor of the more general
   "update status"
2. Update messages now always include an update ID, and for "update
   abort" we only abort if the ID matches our current in-progress
   update.
FawazTirmizi pushed a commit to rivosinc/hubris that referenced this pull request Oct 3, 2022
* mgmt-gateway: add update status and IDs

This is the companion to the larger MGS change in
oxidecomputer/omicron#1714. From the SP's side,
the changes are relatively minor:

1. We removed "update prepare status" in favor of the more general
   "update status"
2. Update messages now always include an update ID, and for "update
   abort" we only abort if the ID matches our current in-progress
   update.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants