Skip to content

[nexus] add basic support for expunged sled policy and decommissioned sled state#5032

Merged
sunshowers merged 9 commits into
mainfrom
sunshowers/spr/wip-nexus-add-basic-support-for-expunged-sled-policy-and-decommissioned-sled-state
Feb 23, 2024
Merged

[nexus] add basic support for expunged sled policy and decommissioned sled state#5032
sunshowers merged 9 commits into
mainfrom
sunshowers/spr/wip-nexus-add-basic-support-for-expunged-sled-policy-and-decommissioned-sled-state

Conversation

@sunshowers

@sunshowers sunshowers commented Feb 8, 2024

Copy link
Copy Markdown
Contributor

This PR does a few things:

  • Migrates our current sled_provision_state to a new sled_policy enum, which has more information (per RFD 457). This PR implements the expunged state, not the graceful removal state.
  • Adds a sled_state enum, which describes Nexus's view of the sled. This PR adds the active and decommissioned states.
  • Adds internal code to move around between valid states.
  • Makes the blueprint execution code aware of sleds eligible for discretionary services.
  • Adds tests for all of this new stuff, as well as valid and invalid state transitions -- and also makes sure that if we do end up in an invalid state, things don't break down.

Not done here, but in future PRs (to try and keep this PR a manageable size):

Questions

  • Should we make sled_list hide decommissioned sleds by default?
  • Do we need any unique indexes on the sled table -- maybe something like "each non-decommissioned sled ID should have a unique serial number"?

TODO:

  • State transition tests.
  • Blueprint tests.
  • Update all the other places that need the policy.
  • Region allocation tests.
  • Ensure that sled state isn't decommissioned while trying to change sled policy.
  • DB migrations + tests to ensure migrations happen successfully.

Created using spr 1.3.5
Created using spr 1.3.6-beta.1
@sunshowers sunshowers marked this pull request as draft February 17, 2024 03:11
@sunshowers

Copy link
Copy Markdown
Contributor Author

Almost done here, just need to finish up the region allocation tests. (I'll do the db migrations after acceptance.)

Created using spr 1.3.6-beta.1

@andrewjstone andrewjstone left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic @sunshowers! Great work. Once you add the DB migrations I'll take a quick look and approve.

Comment thread nexus/db-model/src/sled.rs Outdated
Comment thread nexus/db-model/src/sled_policy.rs
Comment thread nexus/db-model/src/sled_state.rs
Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/sled.rs
Comment thread nexus/db-queries/src/db/datastore/sled.rs
Comment thread nexus/db-queries/src/db/datastore/sled.rs
Comment thread nexus/db-queries/src/db/datastore/sled.rs
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
@sunshowers sunshowers changed the title [WIP] [nexus] add basic support for expunged sled policy and decommissioned sled state [nexus] add basic support for expunged sled policy and decommissioned sled state Feb 23, 2024
@sunshowers sunshowers marked this pull request as ready for review February 23, 2024 07:53
@sunshowers

Copy link
Copy Markdown
Contributor Author

All right, finally ready for review. I had to break up the migration into two, where the second one just drops the column and deletes the type. I believe this is required for idempotence reasons.

Would love it if folks had thoughts on the questions as well, but those aren't critical, and otherwise this should be ready to go. Thanks for your patience, I ended up spending a lot of time refactoring and chasing down bugs.

@jgallagher jgallagher left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just some minor questions and nits.

Comment thread schema/crdb/37.0.0/up02.sql Outdated
Comment thread nexus/db-model/src/sled_policy.rs
Comment thread nexus/db-queries/src/db/datastore/sled.rs
Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/sled.rs
Comment thread nexus/db-queries/src/db/datastore/test_utils.rs
Comment thread nexus/deployment/src/planner.rs Outdated
Created using spr 1.3.6-beta.1
@sunshowers sunshowers enabled auto-merge (squash) February 23, 2024 19:45

@davepacheco davepacheco left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Thanks for doing this. It's a big chunk of work.

Comment thread nexus/db-queries/src/db/datastore/sled.rs Outdated
Comment thread nexus/deployment/src/planner.rs Outdated
/// This should be updated whenever the schema is changed. For more details,
/// refer to: schema/crdb/README.adoc
pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(36, 0, 0);
pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(37, 0, 1);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have two schema versions in this PR?

Why do we use the micro number? I thought schema versions were always supposed to bump the major.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we need two schema versions is that we have to make two separate transactions to achieve an idempotent result.

The order of operations we need to perform is:

  1. Add the new sled_policy and sled_state enum types.
  2. Add the new sled_policy and sled_state columns.
  3. Update sled_policy to the new value based on the provision_state column.
  4. Remove the provision_state column.
  5. Remove the sled_provision_state enum types.

1 and 2 are inherently idempotent, so those are fine.

3 by itself is also fine to repeat.

But consider what happens if step 4 occurs, then the process fails, and we go back to 1 again. We'd repeat 3, but the corresponding column no longer exists so it will fail.

Solving this can be done in one of two ways:

  1. Find a way to make 3 only perform the update if the provision_state column exists. This is possible, but somewhat complicated.
  2. Add a transaction commit between 3 and 4. This is the approach I ended up using.

The reason it's 37.0.0 and 37.0.1 is because both updates are part of the same PR. This follows the pattern in #4261.

Created using spr 1.3.6-beta.1
@sunshowers sunshowers enabled auto-merge (squash) February 23, 2024 22:50
@sunshowers sunshowers merged commit a6ef7f9 into main Feb 23, 2024
@sunshowers sunshowers deleted the sunshowers/spr/wip-nexus-add-basic-support-for-expunged-sled-policy-and-decommissioned-sled-state branch February 23, 2024 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants