Online DDL: tracking issue

This issue will be the tracking space for all things vitess Online DDL. Note that this issue is created after some substantial work is done:

- code: https://github.com/vitessio/vitess/pull/6547, available in `release-8.0`
- documentation: https://vitess.io/docs/user-guides/schema-changes/, https://github.com/vitessio/website/pull/510

#6547 served as a long running tracking point; pasting some of #6547 content here for background, purpose and intentions.


# TL;DR

Automate away all the complexity of schema migrations. Users issue:

```sql
alter with 'gh-ost' table example modify id bigint not null;

alter with 'pt-osc' table example modify id bigint not null
```
or
```shell
$ vtctl -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global \
    ApplySchema -sql "alter with 'gh-ost' table example modify id bigint unsigned not null" commerce

$ vtctl -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global \
    ApplySchema -sql "alter with 'pt-osc' table example modify id bigint unsigned not null" commerce
```

(syntax subject to change, see #6782 )

and vitess will schedule an online schema change operation to run on all relevant shards, then proceed to apply the change via `gh-ost` on all shards.


# The ALTER TABLE problem

First, to iterate the problem: schema changes have always been a problem with MySQL; a straight `ALTER` is a blocking operation; a `ONLINE ALTER` is only "online" on the master/primary, but is effectively blocking on replicas. Online schema change tools like `pt-online-schema-change` and `gh-ost` overcome these limitations by emulating an `ALTER` on a "ghost" table, which is populated from the original table, then swapped in its space.

Traditionally, online schema changes are considered to be "risky". Trigger based migrations add significant load onto the master server, and their cut-over phase is known to be a dangerous point. `gh-ost` was created at GitHub to address these concerns, and successfully eliminated concerns for operational risks: with `gh-ost` the load on the master is low, and well controlled, and the cut-over phase is known to cause no locking issues. `gh-ost` comes with different risks: it applies data changes programmatically, thus the issue of data integrity is of utmost importance. Another note of concern is data traffic: going out from MySQL into `gh-ost` and back into MySQL (as opposed to all-in MySQL in `pt-online-schema-change`).

This way or the other, running an online schema change is typically a manual operation. A human being will schedule the migration, kick it running, monitor it, possibly cut-over. In a sharded environment, a developer's request to `ALTER TABLE` explodes to `n` different migrations, each needs to be scheduled, kicked, monitored & tracked.

Sharded environments are obviously common for `vitess` users and so these users feel the pain more than others.


# Schema migration cycle & steps

Schema management is a process that begins with the user designing a schema change, and ends with the schema being applied in production. This is a breakdown of schema management steps as I know them:

1. Design code
2. Publish changes (pull request)
3. Review
4. Formalize migration command (the specific `ALTER TABLE` or `pt-online-schema-change` or `gh-ost` command)
5. Locate: where in production should this migration run?
6. Schedule
7. Execute
8. Audit/monitor
9. Cut-over/complete
10. Cleanup
11. Notify user
12. Deploy & merge


# What we propose to address

Vitess's architecture uniquely positions it to be able to automate away much of the process. Specifically:

- **Formalize migration command**: turning an `ALTER TABLE` statement into a `gh-ost`/`pt-osc` invocation is super useful if done by vitess, since vitess can not only validate schema/params, but also can provide credentials, apply [throttling logic](https://vitess.io/docs/reference/features/tablet-throttler/), can instruct `gh-ost` on how to communicate progress via hooks, etc.
- **Locate**: given schema/table, `vitess` _just knows_ where the table is located. It knows if the schema is sharded. It knows who the shards are, who the shards masters are. It knows _where_ to run `gh-ost`. Last, `vitess` can tell us which replicas we can use for throttling.
- **Schedule**: vitess is again in a unique position to schedule migrations. The fact someone asks for a migration to run does not mean the migration should start right away. For example, a shard may already be running an earlier migration. Running two migrations at a time is less than ideal, and it's best to wait out the first migration before beginning the second. A scheduling mechanism is both useful to running the migrations in optimal order/sequence, as well as providing feedback to the user ("your migration is on hold because this and that", or "your migration is 2nd in queue to run")
- **Execute**: `vttablet` is the ideal entity to run a migration; can read instructions from `topo` server and can write progress to `topo` server. `vitess` is aware of possible master failovers and can request a re-execute is a migration is so interrupted mid process.  
- **Audit/monitor**: `vtctld` API can offer endpoints to track status of a migration (e.g. "in progress on `-80`, in queue on `80-`"). It may offer progress pct and ETA.
- **cut-over/complete**: in my experience with `gh-ost`, the cut-over phase is safe to automate away. If running a migration during a resharding operation, then we may need to coordinate cut-over between upstream and downstream migrations.
- **cleanup**:  the old table needs to be dropped; `vttablet` runs a [table lifecycle](https://vitess.io/docs/reference/features/table-lifecycle/) service (aka garbage collector) to clean up those tables.

---

- As mentioned, substantial initial work done in #6547. 
- Discussion about syntax in #6782 
- Bugfixes in #6910 
- Initial followups to #6547 are in #6901, notably auto-retry migration after master failover


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Online DDL: tracking issue #6926

TL;DR

The ALTER TABLE problem

Schema migration cycle & steps

What we propose to address

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Online DDL: tracking issue #6926

Description

TL;DR

The ALTER TABLE problem

Schema migration cycle & steps

What we propose to address

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions