Skip to content

[db_metadata] Parse schema on migration, await backfill#9878

Merged
smklein merged 21 commits into
mainfrom
migration-await-backfill
Feb 28, 2026
Merged

[db_metadata] Parse schema on migration, await backfill#9878
smklein merged 21 commits into
mainfrom
migration-await-backfill

Conversation

@smklein

@smklein smklein commented Feb 18, 2026

Copy link
Copy Markdown
Collaborator

Non-atomic schema changes ARE real and CAN hurt you

  • Adds CRDB-specific preprocessing and sqlparser parsing to each schema change within Nexus
  • Verifies that each change contains "at most one" schema-modifying DDL statement
  • Adds a secondary transaction to each "apply schema change" step, which verifies that the underlying backfill operation has completed. The objective here is to transform "delayed, async DDL statements" into synchronous operations, so we can know "it succeeded fully" or "it failed" before moving on

Part of #9866
Fixes #9888

@david-crespo

Copy link
Copy Markdown
Contributor

Wild, but also way simpler than I expected.

/// verification queries.
///
/// If verification fails, the error propagates immediately — the outer
/// startup retry loop will re-attempt the entire migration.

@david-crespo david-crespo Feb 18, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth spelling out (at least in the PR, if not in the comment) exactly how this prevents the problem we saw. What prevents this from running into the same error over and over upon re-running the migration?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without @sunshowers ' fix for #9874: this would continue failing over and over again.

This PR is intended to prevent Nexuses from "crossing the barrier" of an individual schema change (basically: trying to help us treat them as synchronous atomic steps again). It makes no changes for "Max memory/batch sizes" nor changes to attempted retries.

I can update the PR description to make this more clear!

Comment thread nexus/db-model/src/schema_versions.rs Outdated
/// detect them via regex here, in the same pass that removes them.
///
/// This is **only for classification** — the original SQL is always what
/// gets executed against the database.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out this line for other reviewers — important!

Comment thread nexus/db-model/src/schema_versions.rs Outdated
// Both are valid PostgreSQL, but sqlparser only handles the
// bracket notation. We match `ARRAY` followed by `,` or end-of-
// definition (not `ARRAY[` which is the array literal syntax).
let type_array_re = Regex::new(r"(?i)(\w+)\s+ARRAY\s*([,)\n])").unwrap();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is impressive, but honestly turns me off of the whole sqlparser plan entirely; I didn't realize it couldn't parse our schema.

Is it worth going back to the idea board for other ways to accomplish this?

@david-crespo david-crespo Feb 18, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is wild, but I would vote against "throw it out" at this point on the grounds that the main failure mode here is a test failure due to SQL not parsing. The scarier failure mode is a modification significant enough to be classified wrong, but I feel like we should be able to rule that out decently well with testing. I also wonder if it's worth considering a different parser like https://crates.io/crates/pg_query (much lower download count but still seems legit) since it's just a dev dep. I'm having the robot try it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not worth it — sounds like we might have to do actual work to get this to work on illumos — but I'll give it a shot and see how it looks.

image

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently working on eliminating a chunk of the regex by just pulling in a more recent version of sqlparser. Will look into the difficulty of creating our own dialect next.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a lot better now. Meanwhile $10 later my pg_query attempt is crap.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna try to tweak this further to basically do the following:

  1. Try to remove all DML
  2. Do the crdb-specific pre-processing, but only for the DDL which could have back-filled operations
  3. Treat all other DDL as "Some other DDL", which we don't pass to sqlparser

This should reduce the regex usage further, and hopefully get it into a more bearable territory. The amount of crdb-specific regex will scale with the amount of "actually async backfilling DDL" operations we're catching, but maybe that will tolerable enough to let us punt on a CRDB-specific SQL parser a little longer...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yep, right about the dep. But the logic of this function is identical between dev and prod because it's based on the hard coded SQL on hand. If you wanted to, you could avoid running this function in production by pre-processing the queries, couldn't you? Like you could do the classification at dev time, write it down, and have a snapshot test that makes sure the classification is up to date with the code.

I believe this is true

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this latest version strips the regex about as far as it can go, by partitioning the world into "DDL we care about parsing" and "Other DDL which we aren't gonna bother parsing at all".

@jgallagher jgallagher Feb 19, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yep, right about the dep. But the logic of this function is identical between dev and prod because it's based on the hard coded SQL on hand. If you wanted to, you could avoid running this function in production by pre-processing the queries, couldn't you? Like you could do the classification at dev time, write it down, and have a snapshot test that makes sure the classification is up to date with the code.

I believe this is true

I like this idea a lot. If we something like an expectorate test that this code filled in to maintain a static list of migration steps that need particular async backfill followup, we'd get:

  • SQL parsing out of prod and into tests - prod can read a compiled form of the static list
  • A chance to confirm in every PR review that the parser is catching all the new indices being created (I wouldn't want to rely on this at all - that's the whole point of tooling - but it seems good that there's an opportunity for us to review)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I went ahead and did that, and updated the docs. sqlparser and regex are now dev-dependencies, and we have an EXPECTORATE test that generates .verify.sql files which sit alongside the migrations.

This also retroactively added checks for all migrations, which I figured would be worthwhile for perusal. However, if we want me to drop that generation (e.g., only generate the .verify.sql files for relatively new migrations) I could do that too.

@smklein smklein force-pushed the migration-await-backfill branch from f792e27 to 0bf9368 Compare February 18, 2026 21:48
@smklein smklein changed the base branch from main to sqlparser February 18, 2026 21:48
@smklein smklein marked this pull request as ready for review February 20, 2026 15:34
Comment on lines +2452 to +2454
// ---------------------------------------------------------------
// Integration tests for schema change verification
// ---------------------------------------------------------------

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to include a fair number of unit tests in this PR, but this sequence of integration tests is pretty dang important for the behavior we're trying to test.

These cover the verification queries:

  • CREATE INDEX IF NOT EXISTS
    • Checking that the verification query passes/fails if the index exists/doesn't exist.
    • Checking that concurrent invocations of index creation can cause some callers to "skip" the backfill, and that verification acts as a back-stop.
  • ALTER TABLE ADD CONSTRAINT
    • Checking that the verification query passes/fails if the constraint exists/doesn't exist
    • Checking that concurrent invocations of constraint creation can cause some callers to "skip" the backfill, and that verification acts as a back-stop.

Additionally, these cover some DDL changes that have backfill, but don't need verification queries:

  • ADD COLUMN IF NOT EXISTS ... NOT NULL DEFAULT
    • Confirms that concurrent callers block each other out, so no verification query is needed
  • ALTER COLUMN ... SET NOT NULL
    • Confirms that concurrent callers block each other out, so no verification query is needed

(an earlier revision of this PR emitted verification queries for these last two cases, but that has since been removed, as they are not necessary)

Base automatically changed from sqlparser to main February 23, 2026 19:11
///
/// Permits only `[a-zA-Z0-9_]` characters, which covers all valid
/// SQL identifiers and our enum variant values.
#[doc(hidden)]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct, and SchemaChangeInfo, are pretty much only used at test-time, but they're used by multiple test modules, so it's handy to have them exposed publicly. I could probably use some cargo manipulation to do a "testing" feature flag, but for the moment I just opted to "use them in cfg(test) stuff, and keep them doc(hidden)".

// NOT VALID constraints are synchronous — no async
// validation job runs, so there is no race to guard
// against and no verification query is needed.
if *not_valid {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have exactly one migration which adds a constraint with the NOT VALID keyword (it was one of yours, @david-crespo !) but it seems like a legit use-case, so I added it here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably because only Claude knows about these arcane SQL features. I would never have come up with it myself.

///
/// Returns trimmed, non-empty statement fragments with comments
/// stripped.
fn split_and_strip_sql(sql: &str) -> Vec<String> {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was "strip comments" and "split statements" as two separate functions, but they kinda need to know about each other.

I know there's a ton of string manipulation going on here, but:

  • I have a lot of tests for it
  • I added proptests for it
  • It'll only be running at test-time (EXPECTORATE time) anyway

@smklein

smklein commented Feb 24, 2026

Copy link
Copy Markdown
Collaborator Author

Merged with #9889 , now using the max_sql_memory_mib argument. Thanks @sunshowers

@david-crespo

david-crespo commented Feb 27, 2026

Copy link
Copy Markdown
Contributor

For posterity, I had gpt-5.3-codex and Opus 4.6 go through the CockroachDB source on our fork and make the argument that the verification queries in this PR work to ensure that a create index operation is complete when we think it's complete. We did this while we were debugging this issue and working out the solution, but I think it's worth documenting the result of that analysis canonically.

It's long, so I'm including it here collapsed so it doesn't take on more importance than it deserves. I think it is a very good account of our understanding of this change at this time. This was produced by ping-ponging back and forth between the two models, having Codex put together the initial argument, then Claude evaluate it and make it more readable, and back and forth a few more times.

Full argument for this change

This PR changes Nexus migrations from "execute DDL step and immediately move on" to "execute DDL step, then run a generated verification query, and only then advance migration state/version." The reason is #9866/#9888: for backfill-prone DDL, Cockroach can acknowledge the statement before asynchronous schema work has fully completed.

The rest of this comment argues that the verification predicates used by the PR are correct — that they become true only after Cockroach has actually finished the relevant async schema-change work, making them valid barriers before Nexus advances migration state. The argument is against release-22.1-oxide; all code links below are permalinks to the current tip of that branch.

Why CREATE INDEX IF NOT EXISTS needs a barrier

CockroachDB schema changes are asynchronous. When you run CREATE INDEX, Cockroach adds the index to the table descriptor as a mutation (an in-progress schema change) and returns. The actual work — backfilling data into the index, validating uniqueness constraints — happens asynchronously in the schema changer.

The problem is that IF NOT EXISTS is too permissive about what counts as "exists." In create_index.go#L659-L669, CREATE INDEX calls FindIndexWithName, which searches with AddMutations: true and DropMutations: true (table_desc.go#L361-L366). If it finds a match — even an index that's mid-backfill — IF NOT EXISTS returns success immediately. So a second Nexus replica running the same migration can get "success" before the index is actually usable.

Why crdb_internal.table_indexes is a correct barrier

The PR verifies index creation by checking crdb_internal.table_indexes. This virtual table is stricter than IF NOT EXISTS: it iterates indexes via catalog.ForEachIndex with default IndexOpts (crdb_internal.go#L2751-L2764), which excludes mutations. ForEachIndex filters out any index where Adding() or Dropped() is true unless you opt in (table_elements.go#L532-L540). So crdb_internal.table_indexes shows only public indexes — those that have graduated from mutation status.

The remaining question is whether an index becomes public only after the backfill and validation work is actually done. It does. The schema changer's lifecycle runs in three phases (schema_changer.go#L1959-L1971): first advance mutation state, then runBackfill (which includes data backfill, merge from a temporary index that captured concurrent writes, and validateIndexesbackfill.go#L1931-L1940, backfill.go#L2072-L2073), and only if that succeeds, done(). If runBackfill errors, done() is never called and the index stays in mutation state.

done() is the function that actually makes the index public: it calls MakeMutationComplete for each mutation (schema_changer.go#L1588), which for index-add mutations calls AddSecondaryIndex to insert the index into the public index list (structured.go#L1613-L1623), then trims the mutation from the descriptor (schema_changer.go#L1729-L1731).

So the chain is: an index appears in crdb_internal.table_indexes only if it's public, it becomes public only in done(), and done() runs only after backfill and validation succeed. Checking this table is a valid barrier for the IF NOT EXISTS race.

Why SHOW CONSTRAINTS ... validated = true is a correct barrier for ADD CONSTRAINT

The same pattern applies to constraint creation. The PR verifies ADD CONSTRAINT by checking SHOW CONSTRAINTS for validated = true.

SHOW CONSTRAINTS projects pg_constraint.convalidated as the validated column (show_table.go#L196-L210). convalidated is set to !con.Unvalidated (pg_catalog.go#L1006-L1014). And the constraint detail code marks any constraint whose Validity is not Validated — including Validating — as Unvalidated (table.go#L360-L362, table.go#L382-L384, table.go#L418-L420). So validated = true requires Validity == Validated, and a constraint that's still in the Validating phase reads as validated = false.

The transition to Validated happens in the same done()MakeMutationComplete path: the function finds constraints in Validating state and flips them to Validated (structured.go#L1629-L1634, structured.go#L1646-L1652, structured.go#L1664-L1670). As with indexes, done() only runs after runBackfill — which includes validateConstraints (backfill.go#L341-L360) — completes successfully.

So for both verification forms used by the PR, the predicate becoming true corresponds to a post-condition stronger than "DDL statement returned OK" — it means completion in the schema changer's state machine sense, which is the property needed to avoid the #9866 / #9888 class of false-progress migration outcomes.

Safety vs. liveness under descriptor caching

The verification query may run on a different CRDB node than the one that executed the DDL. Descriptor lease/cache lag on other nodes can produce temporary false negatives, where a verification query doesn't yet see the now-public index or validated constraint. This affects liveness (Nexus retries the verification) but not safety (Nexus never advances migration state prematurely). The dangerous direction would be a false positive — seeing a public index before backfill completes — which the argument above rules out: an index or constraint can only appear as public after done() atomically commits the updated descriptor, and no stale cache can conjure a public index that was never written.

Scope and caveats

This argument covers the legacy schema changer job path. Our fork defaults use_declarative_schema_changer to on (exec_util.go#L477-L489), which means the declarative path is used for "supported statements in implicit transactions" and falls back to legacy otherwise. Nexus runs migration DDL inside explicit transactions (via transaction_retry_wrapper), so the legacy schema changer is always the active path for these statements. The declarative schema changer uses different code paths and state machines; this comment doesn't address it, but it also isn't exercised here. There are also other MakeMutationComplete call sites for scenarios like in-transaction schema changes on newly created tables and TRUNCATE on empty tables, but those aren't the async existing-table path behind #9866.

One concern worth addressing explicitly: could there be a window during done() where the index has been removed from the mutations list but not yet added to the public list (or vice versa), creating a moment where a verification query sees an inconsistent descriptor? No — done() modifies the descriptor in-memory (both MakeMutationComplete and trimming mutation entries), then writes and commits it in a single transaction via WriteDescToBatch + txn.Run (schema_changer.go#L1729-L1731, schema_changer.go#L1852-L1856). External readers see either the pre-commit descriptor (index still a mutation) or the post-commit descriptor (index public, mutation trimmed), never a half-state.

Finally, this proof covers the two verification predicates the PR actually uses: crdb_internal.table_indexes for index existence and SHOW CONSTRAINTS ... validated = true for constraint validation. If future work adds verification for other backfill-prone DDL types, each new predicate would need its own argument.

@smklein smklein enabled auto-merge (squash) February 27, 2026 21:34
@smklein smklein merged commit 5a06ffa into main Feb 28, 2026
17 checks passed
@smklein smklein deleted the migration-await-backfill branch February 28, 2026 03:40
@smklein smklein mentioned this pull request Mar 2, 2026
smklein added a commit that referenced this pull request Mar 3, 2026
Fixing a conflict between
#9898 and
#9878

(New tests + new postgres -> changes how error messages got exposed)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Asynchronous CRDB schema migrations can break our migration engine

3 participants