This issue will be the tracking space for all things vitess Online DDL. Note that this issue is created after some substantial work is done:
#6547 served as a long running tracking point; pasting some of #6547 content here for background, purpose and intentions.
TL;DR
Automate away all the complexity of schema migrations. Users issue:
alter with 'gh-ost' table example modify id bigint not null;
alter with 'pt-osc' table example modify id bigint not null
or
$ vtctl -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global \
ApplySchema -sql "alter with 'gh-ost' table example modify id bigint unsigned not null" commerce
$ vtctl -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global \
ApplySchema -sql "alter with 'pt-osc' table example modify id bigint unsigned not null" commerce
(syntax subject to change, see #6782 )
and vitess will schedule an online schema change operation to run on all relevant shards, then proceed to apply the change via gh-ost on all shards.
The ALTER TABLE problem
First, to iterate the problem: schema changes have always been a problem with MySQL; a straight ALTER is a blocking operation; a ONLINE ALTER is only "online" on the master/primary, but is effectively blocking on replicas. Online schema change tools like pt-online-schema-change and gh-ost overcome these limitations by emulating an ALTER on a "ghost" table, which is populated from the original table, then swapped in its space.
Traditionally, online schema changes are considered to be "risky". Trigger based migrations add significant load onto the master server, and their cut-over phase is known to be a dangerous point. gh-ost was created at GitHub to address these concerns, and successfully eliminated concerns for operational risks: with gh-ost the load on the master is low, and well controlled, and the cut-over phase is known to cause no locking issues. gh-ost comes with different risks: it applies data changes programmatically, thus the issue of data integrity is of utmost importance. Another note of concern is data traffic: going out from MySQL into gh-ost and back into MySQL (as opposed to all-in MySQL in pt-online-schema-change).
This way or the other, running an online schema change is typically a manual operation. A human being will schedule the migration, kick it running, monitor it, possibly cut-over. In a sharded environment, a developer's request to ALTER TABLE explodes to n different migrations, each needs to be scheduled, kicked, monitored & tracked.
Sharded environments are obviously common for vitess users and so these users feel the pain more than others.
Schema migration cycle & steps
Schema management is a process that begins with the user designing a schema change, and ends with the schema being applied in production. This is a breakdown of schema management steps as I know them:
- Design code
- Publish changes (pull request)
- Review
- Formalize migration command (the specific
ALTER TABLE or pt-online-schema-change or gh-ost command)
- Locate: where in production should this migration run?
- Schedule
- Execute
- Audit/monitor
- Cut-over/complete
- Cleanup
- Notify user
- Deploy & merge
What we propose to address
Vitess's architecture uniquely positions it to be able to automate away much of the process. Specifically:
- Formalize migration command: turning an
ALTER TABLE statement into a gh-ost/pt-osc invocation is super useful if done by vitess, since vitess can not only validate schema/params, but also can provide credentials, apply throttling logic, can instruct gh-ost on how to communicate progress via hooks, etc.
- Locate: given schema/table,
vitess just knows where the table is located. It knows if the schema is sharded. It knows who the shards are, who the shards masters are. It knows where to run gh-ost. Last, vitess can tell us which replicas we can use for throttling.
- Schedule: vitess is again in a unique position to schedule migrations. The fact someone asks for a migration to run does not mean the migration should start right away. For example, a shard may already be running an earlier migration. Running two migrations at a time is less than ideal, and it's best to wait out the first migration before beginning the second. A scheduling mechanism is both useful to running the migrations in optimal order/sequence, as well as providing feedback to the user ("your migration is on hold because this and that", or "your migration is 2nd in queue to run")
- Execute:
vttablet is the ideal entity to run a migration; can read instructions from topo server and can write progress to topo server. vitess is aware of possible master failovers and can request a re-execute is a migration is so interrupted mid process.
- Audit/monitor:
vtctld API can offer endpoints to track status of a migration (e.g. "in progress on -80, in queue on 80-"). It may offer progress pct and ETA.
- cut-over/complete: in my experience with
gh-ost, the cut-over phase is safe to automate away. If running a migration during a resharding operation, then we may need to coordinate cut-over between upstream and downstream migrations.
- cleanup: the old table needs to be dropped;
vttablet runs a table lifecycle service (aka garbage collector) to clean up those tables.
This issue will be the tracking space for all things vitess Online DDL. Note that this issue is created after some substantial work is done:
release-8.0#6547 served as a long running tracking point; pasting some of #6547 content here for background, purpose and intentions.
TL;DR
Automate away all the complexity of schema migrations. Users issue:
or
$ vtctl -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global \ ApplySchema -sql "alter with 'gh-ost' table example modify id bigint unsigned not null" commerce $ vtctl -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global \ ApplySchema -sql "alter with 'pt-osc' table example modify id bigint unsigned not null" commerce(syntax subject to change, see #6782 )
and vitess will schedule an online schema change operation to run on all relevant shards, then proceed to apply the change via
gh-oston all shards.The ALTER TABLE problem
First, to iterate the problem: schema changes have always been a problem with MySQL; a straight
ALTERis a blocking operation; aONLINE ALTERis only "online" on the master/primary, but is effectively blocking on replicas. Online schema change tools likept-online-schema-changeandgh-ostovercome these limitations by emulating anALTERon a "ghost" table, which is populated from the original table, then swapped in its space.Traditionally, online schema changes are considered to be "risky". Trigger based migrations add significant load onto the master server, and their cut-over phase is known to be a dangerous point.
gh-ostwas created at GitHub to address these concerns, and successfully eliminated concerns for operational risks: withgh-ostthe load on the master is low, and well controlled, and the cut-over phase is known to cause no locking issues.gh-ostcomes with different risks: it applies data changes programmatically, thus the issue of data integrity is of utmost importance. Another note of concern is data traffic: going out from MySQL intogh-ostand back into MySQL (as opposed to all-in MySQL inpt-online-schema-change).This way or the other, running an online schema change is typically a manual operation. A human being will schedule the migration, kick it running, monitor it, possibly cut-over. In a sharded environment, a developer's request to
ALTER TABLEexplodes tondifferent migrations, each needs to be scheduled, kicked, monitored & tracked.Sharded environments are obviously common for
vitessusers and so these users feel the pain more than others.Schema migration cycle & steps
Schema management is a process that begins with the user designing a schema change, and ends with the schema being applied in production. This is a breakdown of schema management steps as I know them:
ALTER TABLEorpt-online-schema-changeorgh-ostcommand)What we propose to address
Vitess's architecture uniquely positions it to be able to automate away much of the process. Specifically:
ALTER TABLEstatement into agh-ost/pt-oscinvocation is super useful if done by vitess, since vitess can not only validate schema/params, but also can provide credentials, apply throttling logic, can instructgh-oston how to communicate progress via hooks, etc.vitessjust knows where the table is located. It knows if the schema is sharded. It knows who the shards are, who the shards masters are. It knows where to rungh-ost. Last,vitesscan tell us which replicas we can use for throttling.vttabletis the ideal entity to run a migration; can read instructions fromtoposerver and can write progress totoposerver.vitessis aware of possible master failovers and can request a re-execute is a migration is so interrupted mid process.vtctldAPI can offer endpoints to track status of a migration (e.g. "in progress on-80, in queue on80-"). It may offer progress pct and ETA.gh-ost, the cut-over phase is safe to automate away. If running a migration during a resharding operation, then we may need to coordinate cut-over between upstream and downstream migrations.vttabletruns a table lifecycle service (aka garbage collector) to clean up those tables.