Skip to content

Feature Request: Online DDL cut-over backoff + forced completion #14530

@shlomi-noach

Description

@shlomi-noach

Feature Description

An Online DDL ALTER TABLE completes by cutting over from the original table to the shadow table. This final step involves holding table locks, and has a timeout.

On very busy tables, the operation will timeout. The Online DDL scheduler will reattempt after 1 minute. Under a sustained load this could mean repetitive attempts over hours at 1 minute intervals. This is both wasteful and harmful. It's harmful because 15sec in every minute will attempt to acquire locks, which means interfering with traffic even more.

We want to offer two opposed changes at the same time:

  1. A backoff mechanism: first retry in 1min, then in, say, 5min, then 10min, 30min, 1hr, and keep at 1h intervals (precise values to change).
  2. A way to require a brute-force cut-over. This involves:
  • A pre-determined brute force cutover duration: counting from the moment of the first cut-over attempt, after given duration the Online DDL attempts a brute-force cut-over (see following)
  • And/or a SQL command such as ALTER VITESS_MIGRATION ... DO THE THING AND BRUTE FORCE CUT OVER NOW PLEASE
  • Brute-force cut-over implemented by identifying any queries + transactions holding locks on migrated table. When in brute-force mode, the cut-over mechanism attempts to kill related queries/connections.

Industry solutions typically attempt to kill any non-replication long-running queries. We want to be smart and only affect relevant queries, as well as identify transactions that are holding locks on the table but not in fact running any specific query on the table at the moment, maybe not running any query at the moment.

Use Case(s)

Online DDL on busy systems

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions