Describing safe, blocking, pure-mysql cut-over phase

### UPDATE
#82 overrides this. This is no longer in use.

---

_Finally_ here's a blocking cut-over phase that will, at worst case (connections die throughout cut-over), create a table-outage (easily reversed).
#### Here are the steps to a safe solution:

We note different connections as `C1`, `C2`, ... `Cn`
We assume original table is `tbl`, ghost table is `ghost`.

In the below we note `C1`, `C18`, `C19` as out own, controlling connections. We first assume no error in the below flow:
- C1: `lock tables tbl write`
- C2, C3, ..., C17: normal app connections, issuing `insert`, `delete`, `update` on `tbl`. Because of the lock, they are naturally blocked.
- We apply those last event we need to apply onto `ghost`. No new events are coming our way because `tbl` is blocked.
- C18: checking that C1 is still alive, then `rename table tbl to tbl_old`. This gets blocked.
- C19: checking to see that C18's `rename` is in place (via `show processlist`), **and** that C1 is still alive; then issues: `rename table ghost to tbl`. This gets blocked.
- (meanwhile more queries approach `tbl`, it doesn't matter, they all get deprioritized, same as C3...C17)
- C1: `unlock tables`
#### What just happened? Let's first explain some stuff:
- C18's `rename` gets prioritized over the DMLs, even though it came later. That is how MySQL prioritizes queries on metadata-locked tables.
- C18 checks C1 is still alive, but as before, there's always the chance C1 will die just at the wrong time -- we're going to address that.
- C19 is interested to see that C18 began execution, but potentially C18 will crash by the time C19 actually issues its own rename -- we're going to address that
- C19's query sounds weird. At that time `tbl` still exists. You'd expect it to fail immediately -- but it does not. It's valid. This is because `tbl`'s metadata lock is in use.
- C19 gets prioritized over all the DMLs, but is known to be behind C18. The two stay in same order of arrival. So, C18 is known to execute before C19.
- When C1 unlocks, C18 executes first.
- Metadata lock is still in place on `tbl` even though it doesn't actually exist, because of C19.
- C19 operates next.
- Finally all the DMLs execute.
#### What happens on failures?
- If C1 dies just as C18 is about to issue the `rename`, we get an outage: `tbl` is renamed to `tbl_old`, and the queries get released and complain the table is just not there.
  - In such case C19 will not initiate because it is executed _after_ C18 **and** checks that C1 is alive -- which turns to be untrue. So no C19.
  - So we know we have outage, and we quickly `rename tbl_old to tbl;` and go drink coffee, then give the entire process another try.
  - The outage is unfortunate, but does not put our data in danger.
- If C1 happens to die just as C19 is about to issue its `rename`, there's no data integrity issue: at this point we've already asserted the tables are in sync. As C1 dies, C18 will immediately `rename tbl to tbl_old`. An outage will occur, but not for long, because C19 will next issue `rename ghost to tbl`, and close the gap. We suffered a minor outage, but no rollback. We roll forward.
- If C18 happens to die just as C19 is about to issue its `rename`, nothing bad happens: C19 is still blocking for as long as C1 is running. We find out C18 died, and release C1. C19 attempts to `rename ghost to tbl`, but `tbl` exists (we assume C18 failed) and the query fails. The metadata lock is released and all the queries resume operation on the original `tbl`. The queries suffered a short block, but resume operation automatically. The operation failed but without error. We will need to try the entire cycle again.
- If both C1 and C18 fail at the time C19 is about to begin its `rename`, same as above.
- If C18 fails as C19 is already in place, same as above.
- If C1 fails as C19 is already in place, it's as good as having it issue the unlock tables. We're happy.
- If C19 fails at any given point, we suffer outage. We revert by `rename tbl_old to tbl`.

I'm grateful to reviews over this logic.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Describing safe, blocking, pure-mysql cut-over phase #65

UPDATE

Here are the steps to a safe solution:

What just happened? Let's first explain some stuff:

What happens on failures?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Describing safe, blocking, pure-mysql cut-over phase #65

Description

UPDATE

Here are the steps to a safe solution:

What just happened? Let's first explain some stuff:

What happens on failures?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions