Skip to content

pick up Propolis 50cb28f5#5906

Merged
gjcolombo merged 3 commits into
mainfrom
gjcolombo/propolis-update
Jun 18, 2024
Merged

pick up Propolis 50cb28f5#5906
gjcolombo merged 3 commits into
mainfrom
gjcolombo/propolis-update

Conversation

@gjcolombo

Copy link
Copy Markdown
Contributor

Update to Propolis commit 50cb28f5 to obtain the following updates:

50cb28f server: separate statuses of inbound and outbound migrations (#661)
61e1481 Don't return None from BlockData::block_for_req when attachment is paused (#711)
79e243b Build and clippy fixes for Rust 1.79 (#712)
3c0f998 Modernize fw_cfg emulation
7a65be9 Update and prune dependencies

50cb28f5 changes the Propolis instance status and migration status APIs to report both inbound and outbound migration status. Update sled agent accordingly.

Tests: cargo nextest; will also run a couple of instances on a dev cluster before merging.

@hawkw hawkw self-requested a review June 17, 2024 20:27
Comment thread sled-agent/src/common/instance.rs Outdated
Comment on lines +121 to +138
// Decide which of Propolis's reported migrations to pay attention to.
// If this Propolis is currently a migration target (evidenced by its ID
// appearing in the `dst_propolis_id` field in the instance record),
// look at the current migration in. Otherwise, look at the reported
// migration out.
let role = if instance_runtime
.dst_propolis_id
.is_some_and(|id| id == propolis_id)
{
MigrationRole::Target
} else {
MigrationRole::Source
};

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is going to have to change substantially once sled-agent no longer has an InstanceRuntimeState to make this determination. I think we ought to be able to do this instead using the migration_state in InstanceStates, which should also have the ID of the currently active migration (which would let us check whether the migration in or migration out refers to the current migration).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, thanks! d9b4ae8 should clean this up a bit.

@hawkw

hawkw commented Jun 17, 2024

Copy link
Copy Markdown
Member

these clippy failures should be fixed by #5907, btw, so rebasing should fix that.

@gjcolombo gjcolombo marked this pull request as ready for review June 17, 2024 23:36
@gjcolombo

Copy link
Copy Markdown
Contributor Author

This looks good in local testing (I've gone through a few instance state transitions and checked to make sure that instance reboot works properly now that the fix for oxidecomputer/propolis#709 is in place). Still need to pick up the fix for #5907 in this branch, but otherwise this is ready for review.

@gjcolombo gjcolombo requested a review from hawkw June 17, 2024 23:45
@hawkw

hawkw commented Jun 17, 2024

Copy link
Copy Markdown
Member

looks great! thanks for changing it to use the MigrationState instead of the InstanceRuntimeState — that'll make my life easier later!

@gjcolombo

Copy link
Copy Markdown
Contributor Author

I appear to have upset the migration integration tests; will investigate.

@gjcolombo

Copy link
Copy Markdown
Contributor Author

The problem is an assertion failure caused by d9b4ae8 (newest commit in this PR). The simplest fix may just be to revert that commit, but I'll think about it some more.

The basic issue is that InstanceStates::apply_propolis_observation assumes that the "migration in succeeded" branch of its logic will only be reached once per migration target. (This assumption is encoded in InstanceStates::switch_propolis_id_to_target, which asserts that the new active Propolis ID it's writing isn't None.) Previously, this was ensured by having ObservedPropolisState::new key off of the migration ID in sled agent's copy of the InstanceRuntimeState: once migration succeeds, sled agent clears the instance's migration ID, so subsequent observations will report that no migration is in progress even if Propolis continues to report on the completed migration.

d9b4ae8 breaks this assumption by making ObservedPropolisState use sled agent's MigrationRuntimeState to decide if a migration is in progress. This migration state is not cleared by InstanceStates::clear_migration_ids, so the existing "migration succeeded" sequence no longer closes the door on itself properly. The fix is unfortunately not as simple as clearing more fields in clear_migration_ids--the MigrationRuntimeState needs to be Some at the end of apply_propolis_observation, or sled agent won't send updated migration state to Nexus.

This is, of course, way more complicated than it needs to be: sled agent shouldn't be in the business of updating instance runtime state at all, and if it weren't there would be no assertions to trip. Unfortunately, we're not there quite yet, so we'll need to stitch something together to keep this working in the meantime.

@gjcolombo gjcolombo force-pushed the gjcolombo/propolis-update branch from d9b4ae8 to 26cb441 Compare June 18, 2024 16:25
The instance runtime state machine in sled agent requires that an
inbound migration appear to be completed only once (so that there's only
one "move the destination Propolis ID to the current Propolis ID and
clear the destination ID" operation per successful migration). This
operation (clearing the instance's active migration ID) is itself what
prevents future Propolis state updates from being interpreted as a
completed migration (once the migration ID is cleared from sled agent's
runtime state, it will no longer match the old ID that Propolis is
reporting).

In the brave new world where sled agent doesn't track instance runtime
state at all, substantially all of this code will be removed: there will
be no instance runtime state to transition to anything; sled agent will
just pass migration statuses straight through to Nexus.
@gjcolombo

Copy link
Copy Markdown
Contributor Author

I haven't rebased this PR on top of #5907 yet (so as not to lose my place in the buildomat queue), but I've done so locally, run cargo xtask clippy, and see no errors.

@gjcolombo gjcolombo force-pushed the gjcolombo/propolis-update branch from 26cb441 to ecc9c33 Compare June 18, 2024 17:21
@hawkw

hawkw commented Jun 18, 2024

Copy link
Copy Markdown
Member

As we discussed on Matrix, I'm fine with the change back to using InstanceRuntimeState for now so that everything can keep working. I'll get rid of the use of InstanceRuntimeState when removing it from sled-agent in #5749.

@gjcolombo gjcolombo enabled auto-merge (squash) June 18, 2024 19:21
@gjcolombo gjcolombo merged commit 7fb6cb4 into main Jun 18, 2024
@gjcolombo gjcolombo deleted the gjcolombo/propolis-update branch June 18, 2024 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants