[nexus] During DB schema update, avoid re-applying sub-updates#5293
Conversation
| "observed_target_version" => %found_target_version, | ||
| "attempted_target_version" => %target_version, | ||
| ); | ||
| continue; |
There was a problem hiding this comment.
Do we need to set prior_step_version before continueing? If so, could we rework the tests to fail without doing so? (That might be easier after #5274, when we can pass in the schema versions instead of a directory to read.) This sequence seems like it won't work:
- Update to
N.0.0-step.1is successful - Update to
N.0.0-step.2fails; we bail out - Next attempt, we land here and note that
N.0.0-step.2 > N.0.0-step.1, so wecontinue;prior_step_versionis stillNone - We move on to
N.0.0-step.2, and callprepare_schema_update. Becauseprior_step_versionisNone, itsvalid_prior_targetslist only includesN.0.0, notN.0.0-step.1, so it fails.
There was a problem hiding this comment.
May also be worth factoring out into another function to make it so that prior_step_version only needs to be updated once, and it would still be correct. continue with mutable state like this always makes me a little worried.
There was a problem hiding this comment.
Okay, keep me honest here - I'm on board with a way to make this simpler, but:
I think it's okay to keep the prior_step_version as "the latest version we think our Nexus has applied".
@jgallagher , in the case you mention, once we get to that fourth step (calling prepare_schema_update) , you're right that our prior_step_version is None, but the to_version parameter would still be N.0.0-step.2 (we're trying to incrementally work on that step) so it would be included in valid_prior_targets. That would also be the target version stored in the DB, since that's the last thing we updated before failing on the prior attempt to update the schema.
(I think this was true before my merge as well, but less obvious, because I was mutating target_version. Now, I'm using a different variable named target_step_version to make this a little more clear).
There was a problem hiding this comment.
Also, the test I added below really should be catching cases like this (and has, locally, when I've gotten it wrong) because it does trigger this "Fail a step and force retry" behavior.
There was a problem hiding this comment.
to_versionparameter would still beN.0.0-step.2
Is this guaranteed? What if we succeed in updating to N.0.0-step.1, and then prepare_schema_update fails to set the new target version of N.0.0-step.2?
There was a problem hiding this comment.
If that's the case, we'd re-apply "step 1", redoing the work of N.0.0-step.1, which should be fine since individual steps must be idempotent anyway, right?
Also, regardless: I'm taking this feedback, and refactored a bit of this to try to make it easier to follow -- though I think the functionality is the same.
There was a problem hiding this comment.
Yes, you're right. Thanks for sticking with me on this one, and for the refactor - I find it more clearly correct for sure.
|
Just dropping some notes here from chat -- thanks @jgallagher. Here's a somewhat more concrete example of the problem we're trying to solve: say we're at 46.0.0 and we want to move to 47.0.0. Say that transition has 8 steps and that step 6 invalidates the preconditions for step 2. (This is the key. It does not seem common but it might be easy to accidentally do.) Today if this happens, and then Nexus crashes, then when it resumes it'll try to re-execute all the steps, and so it'll re-execute step 2, but it'll fail (by construction, because step 6 has broken step 2's preconditions). Then it'll be stuck unable to move forward or backward. Previous to this PR, we'd set target_version = 47.0.0 at the start of the migration from 46.0.0 to 47.0.0 Then we'd go apply a bunch of SQL transactions, each conditional on the target being 47.0.0. This PR addresses the problem above by including the step number in the target version (so that we store target version As I understand it we're not preventing the system from applying the same update step twice. That can still happen if two Nexus instances are doing it concurrently. The new guarantee that's being provided is that if you're updating from, say, 46.0.0 to 47.0.0 and you're talking about step 6, this step will only ever be applied to a database in one of two schema states:
(whereas previously it could have been applied to a database at 46.0.0 plus steps 1-7 or steps 1-8 of the 46.0.0 -> 47.0.0 migration). |
Totally agreed with this -- fortunately, we do have tests for "single steps being idempotent" already, just not "going backwards from step 6 to step 2". |
sunshowers
left a comment
There was a problem hiding this comment.
Thanks for doing this! So much easier to reason about.
| "observed_target_version" => %found_target_version, | ||
| "attempted_target_version" => %target_version, | ||
| ); | ||
| continue; |
There was a problem hiding this comment.
May also be worth factoring out into another function to make it so that prior_step_version only needs to be updated once, and it would still be correct. continue with mutable state like this always makes me a little worried.
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn schema_version_subcomponents_save_progress() { |
There was a problem hiding this comment.
Does the versions_have_idempotent_up test need to be updated to test by file rather than by directory?
There was a problem hiding this comment.
I think it's still okay as-is -- that particular test, as well as some of the other schema tests, are "testing the schema changes" more than they are "testing the mechanism to apply schema changes".
I say this referencing the following block of code:
omicron/nexus/tests/integration_tests/schema.rs
Lines 141 to 150 in a655ff2
Which appears to already be re-running each individual step, rather than "the whole directory at once".
There was a problem hiding this comment.
Although maybe that loop needs to be inverted? Seems like we want to apply each step times_to_apply times, not the whole set....
There was a problem hiding this comment.
I updated this in 36fdb61 - I think you're right, the test was checking the wrong thing before, and should be fixed now.
There was a problem hiding this comment.
Okay, this requires more elaboration. When I updated this test, I noticed that it actually started failing.
Specifically, this version threw an error:
omicron/schema/crdb/10.0.0/up05.sql
Line 1 in a655ff2
As, once the value is dropped, it no longer exists.
Postgres supports syntax to ALTER TYPE DROP VALUE IF EXISTS: https://www.postgresql.org/docs/current/sql-altertype.html
But CockroachDB does not: https://www.cockroachlabs.com/docs/stable/alter-type
To mitigate:
- I have filed a bug against CockroachDB as a feature request: ALTER TYPE DROP VALUE does not support "IF EXISTS" suffix cockroachdb/cockroach#120801
- In 28e6876 , I make an exemption for this schema version
Fixes #5292