[nexus] add basic support for expunged sled policy and decommissioned sled state#5032
Conversation
Created using spr 1.3.5
Created using spr 1.3.6-beta.1
|
Almost done here, just need to finish up the region allocation tests. (I'll do the db migrations after acceptance.) |
Created using spr 1.3.6-beta.1
andrewjstone
left a comment
There was a problem hiding this comment.
This looks fantastic @sunshowers! Great work. Once you add the DB migrations I'll take a quick look and approve.
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
|
All right, finally ready for review. I had to break up the migration into two, where the second one just drops the column and deletes the type. I believe this is required for idempotence reasons. Would love it if folks had thoughts on the questions as well, but those aren't critical, and otherwise this should be ready to go. Thanks for your patience, I ended up spending a lot of time refactoring and chasing down bugs. |
Created using spr 1.3.6-beta.1
jgallagher
left a comment
There was a problem hiding this comment.
LGTM! Just some minor questions and nits.
Created using spr 1.3.6-beta.1
davepacheco
left a comment
There was a problem hiding this comment.
Very nice! Thanks for doing this. It's a big chunk of work.
| /// This should be updated whenever the schema is changed. For more details, | ||
| /// refer to: schema/crdb/README.adoc | ||
| pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(36, 0, 0); | ||
| pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(37, 0, 1); |
There was a problem hiding this comment.
Why do we have two schema versions in this PR?
Why do we use the micro number? I thought schema versions were always supposed to bump the major.
There was a problem hiding this comment.
The reason we need two schema versions is that we have to make two separate transactions to achieve an idempotent result.
The order of operations we need to perform is:
- Add the new
sled_policyandsled_stateenum types. - Add the new
sled_policyandsled_statecolumns. - Update
sled_policyto the new value based on theprovision_statecolumn. - Remove the
provision_statecolumn. - Remove the
sled_provision_stateenum types.
1 and 2 are inherently idempotent, so those are fine.
3 by itself is also fine to repeat.
But consider what happens if step 4 occurs, then the process fails, and we go back to 1 again. We'd repeat 3, but the corresponding column no longer exists so it will fail.
Solving this can be done in one of two ways:
- Find a way to make 3 only perform the update if the
provision_statecolumn exists. This is possible, but somewhat complicated. - Add a transaction commit between 3 and 4. This is the approach I ended up using.
The reason it's 37.0.0 and 37.0.1 is because both updates are part of the same PR. This follows the pattern in #4261.
Created using spr 1.3.6-beta.1
This PR does a few things:
sled_provision_stateto a newsled_policyenum, which has more information (per RFD 457). This PR implements the expunged state, not the graceful removal state.sled_stateenum, which describes Nexus's view of the sled. This PR adds theactiveanddecommissionedstates.Not done here, but in future PRs (to try and keep this PR a manageable size):
time_deletedbecause it has a lifecycle too complicated to be described that way -- instead, we'll add atime_decommissionedfield: Replace sled's time_deleted with time_decommissioned #5131Questions
sled_listhide decommissioned sleds by default?TODO: