Skip to content

mgmt-gateway: update host flash#762

Merged
jgallagher merged 2 commits into
masterfrom
mgmt-gateway-host-boot-flash
Sep 14, 2022
Merged

mgmt-gateway: update host flash#762
jgallagher merged 2 commits into
masterfrom
mgmt-gateway-host-boot-flash

Conversation

@jgallagher

Copy link
Copy Markdown
Contributor

This tracks the corresponding changes to MGS messaging merged in oxidecomputer/omicron#1684:

  1. UpdateStart has been broken into UpdatePrepare and UpdatePrepareStatus (which MGS will continue to send periodically until we respond that preparation is done), allowing for updates that have a potentially-long running prep step (like updating host flash, which can take up to several minutes to erase!).
  2. Update messages now include a stream-id that we use to correlate related messages; we reject update messages that don't match our current stream ID.
  3. Add handling for the new UpdateAbort abort message to cancel an in-progress update.

By far the most complex bit of this is 1: I've moved setting the system timer out of mgs_gimlet (which previously set it only in relation to flushing serial console uart packets out to MGS) and into main: now the MGS handler only returns the deadline it wants main to set. If we're in the process of prepping for a host flash update (i.e., we need to erase the host flash), we'll set our deadline to 1 tick from now. When it fires, we'll erase 8 sectors (takes about 1 second, worst case), then return and allow main to check for other work. This allows us to continue to be responsive to incoming notifications, importantly network requests (allowing us to respond to the UpdatePrepareStatus messages in a timely way!).

@jgallagher jgallagher requested a review from mkeeter September 8, 2022 20:33
@jgallagher jgallagher force-pushed the mgmt-gateway-host-boot-flash branch from 30c0ca1 to 5129fd1 Compare September 13, 2022 14:32
Comment thread task/mgmt-gateway/src/mgs_gimlet.rs Outdated
most_recent_error: Option<HfError>,
}

impl Default for HostFlashSectorErase {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this Default implementation? It seems a little non-obvious (e.g. I could imagine that the default would be to erase all sectors, instead of none!)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we need something to the effect of it unless I make HostFlashUpdate::sector_erase an Option. That might be clearer; I'll give it a shot.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed Default and ::start() in favor of ::new() with the parent storing an Option<_> of us in 5959e7a. I think that's more clear and obvious - thanks!

stream_id: u64,
) -> Result<(), ResponseError> {
match self.current_update_stream_id {
None => Err(ResponseError::UpdateNotPrepared),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is this the right error code?

Earlier in the PR, UpdateNotPrepared seems to mean "you sent a chunk while the update while still in the UpdatePrepare state", rather than "there is no current update happening at all".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is a little annoying and will be reworked in the next PR; I think this is a reasonable error code given where this is called, but it's not obvious at all. I'm going to leave this as-is and revisit after the upcoming changes.

@mkeeter mkeeter left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments, but generally looks good

This tracks the corresponding changes to MGS messaging merged in
oxidecomputer/omicron#1684:

1. `UpdateStart` has been broken into `UpdatePrepare` and
   `UpdatePrepareStatus` (which MGS will continue to send periodically
   until we respond that preparation is done), allowing for updates that
   have a potentially-long running prep step (like updating host flash,
   which can take up to several minutes to erase!).
2. Update messages now include a stream-id that we use to correlate
   related messages; we reject update messages that don't match our
   current stream ID.
3. Add handling for the new `UpdateAbort` abort message to cancel an
   in-progress update.

By far the most complex bit of this is 1: I've moved setting the system
timer out of `mgs_gimlet` (which previously set it only in relation to
flushing serial console uart packets out to MGS) and into `main`: now
the MGS handler only returns the deadline it wants `main` to set. If
we're in the process of prepping for a host flash update (i.e., we need
to erase the host flash), we'll set our deadline to 1 tick from now.
When it fires, we'll erase 8 sectors (takes about 1 second, worst case),
then return and allow `main` to check for other work. This allows us to
continue to be responsive to incoming notifications, importantly network
requests (allowing us to respond to the `UpdatePrepareStatus` messages
in a timely way!).
@jgallagher jgallagher force-pushed the mgmt-gateway-host-boot-flash branch from 5129fd1 to 5959e7a Compare September 14, 2022 19:27
@jgallagher jgallagher merged commit 64ea520 into master Sep 14, 2022
@jgallagher jgallagher deleted the mgmt-gateway-host-boot-flash branch September 14, 2022 19:55
FawazTirmizi pushed a commit to rivosinc/hubris that referenced this pull request Oct 3, 2022
* mgmt-gateway: update host flash

This tracks the corresponding changes to MGS messaging merged in
oxidecomputer/omicron#1684:

1. `UpdateStart` has been broken into `UpdatePrepare` and
   `UpdatePrepareStatus` (which MGS will continue to send periodically
   until we respond that preparation is done), allowing for updates that
   have a potentially-long running prep step (like updating host flash,
   which can take up to several minutes to erase!).
2. Update messages now include a stream-id that we use to correlate
   related messages; we reject update messages that don't match our
   current stream ID.
3. Add handling for the new `UpdateAbort` abort message to cancel an
   in-progress update.

By far the most complex bit of this is 1: I've moved setting the system
timer out of `mgs_gimlet` (which previously set it only in relation to
flushing serial console uart packets out to MGS) and into `main`: now
the MGS handler only returns the deadline it wants `main` to set. If
we're in the process of prepping for a host flash update (i.e., we need
to erase the host flash), we'll set our deadline to 1 tick from now.
When it fires, we'll erase 8 sectors (takes about 1 second, worst case),
then return and allow `main` to check for other work. This allows us to
continue to be responsive to incoming notifications, importantly network
requests (allowing us to respond to the `UpdatePrepareStatus` messages
in a timely way!).
FawazTirmizi pushed a commit to rivosinc/hubris that referenced this pull request Oct 3, 2022
* mgmt-gateway: update host flash

This tracks the corresponding changes to MGS messaging merged in
oxidecomputer/omicron#1684:

1. `UpdateStart` has been broken into `UpdatePrepare` and
   `UpdatePrepareStatus` (which MGS will continue to send periodically
   until we respond that preparation is done), allowing for updates that
   have a potentially-long running prep step (like updating host flash,
   which can take up to several minutes to erase!).
2. Update messages now include a stream-id that we use to correlate
   related messages; we reject update messages that don't match our
   current stream ID.
3. Add handling for the new `UpdateAbort` abort message to cancel an
   in-progress update.

By far the most complex bit of this is 1: I've moved setting the system
timer out of `mgs_gimlet` (which previously set it only in relation to
flushing serial console uart packets out to MGS) and into `main`: now
the MGS handler only returns the deadline it wants `main` to set. If
we're in the process of prepping for a host flash update (i.e., we need
to erase the host flash), we'll set our deadline to 1 tick from now.
When it fires, we'll erase 8 sectors (takes about 1 second, worst case),
then return and allow `main` to check for other work. This allows us to
continue to be responsive to incoming notifications, importantly network
requests (allowing us to respond to the `UpdatePrepareStatus` messages
in a timely way!).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants