Skip to content

Asynchronous CRDB schema migrations can break our migration engine #9888

@smklein

Description

@smklein

See: #9866 for a lot more context, and an example of the specific case here, but:

  • There are some schema-changing CRDB operations, such as CREATE INDEX .... IF NOT EXISTS, which will operate non-atomically, effectively doing two phases: (1) creating metadata, representing the index, and (2) backfilling data to populate that index.
  • When a single Nexus performs this operation, we've observed experimentally that it appears to block, waiting for both phases to occur.
  • However, when multiple Nexuses perform this operation, the first one will "wait for the backfill", but the other ones won't. If phase (1) is complete, but phase (2) is still in-progress, a second Nexus issuing CREATE INDEX ... IF NOT EXISTS will happily return "success!" quickly and move on, even though the index hasn't been fully created (by the first Nexus, who's still awaiting the backfill).
  • This is especially bad because backfilling can fail, and if this happens, the index is reverted.

All this is to say: async, backfilling operations can break our schema changes. They break assumptions we have about our schema changes (they should be transactional! we should only progress past each step once it's complete!) and make it possible for half-broken schema changes to appear, even though the "version scheme" in Nexus might make it appear like everything is happily updated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions