See: #9866 for a lot more context, and an example of the specific case here, but:
- There are some schema-changing CRDB operations, such as
CREATE INDEX .... IF NOT EXISTS, which will operate non-atomically, effectively doing two phases: (1) creating metadata, representing the index, and (2) backfilling data to populate that index.
- When a single Nexus performs this operation, we've observed experimentally that it appears to block, waiting for both phases to occur.
- However, when multiple Nexuses perform this operation, the first one will "wait for the backfill", but the other ones won't. If phase (1) is complete, but phase (2) is still in-progress, a second Nexus issuing
CREATE INDEX ... IF NOT EXISTS will happily return "success!" quickly and move on, even though the index hasn't been fully created (by the first Nexus, who's still awaiting the backfill).
- This is especially bad because backfilling can fail, and if this happens, the index is reverted.
All this is to say: async, backfilling operations can break our schema changes. They break assumptions we have about our schema changes (they should be transactional! we should only progress past each step once it's complete!) and make it possible for half-broken schema changes to appear, even though the "version scheme" in Nexus might make it appear like everything is happily updated.
See: #9866 for a lot more context, and an example of the specific case here, but:
CREATE INDEX .... IF NOT EXISTS, which will operate non-atomically, effectively doing two phases: (1) creating metadata, representing the index, and (2) backfilling data to populate that index.CREATE INDEX ... IF NOT EXISTSwill happily return "success!" quickly and move on, even though the index hasn't been fully created (by the first Nexus, who's still awaiting the backfill).All this is to say: async, backfilling operations can break our schema changes. They break assumptions we have about our schema changes (they should be transactional! we should only progress past each step once it's complete!) and make it possible for half-broken schema changes to appear, even though the "version scheme" in Nexus might make it appear like everything is happily updated.