-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Closed
Description
UPDATE
#82 overrides this. This is no longer in use.
Finally here's a blocking cut-over phase that will, at worst case (connections die throughout cut-over), create a table-outage (easily reversed).
Here are the steps to a safe solution:
We note different connections as C1, C2, ... Cn
We assume original table is tbl, ghost table is ghost.
In the below we note C1, C18, C19 as out own, controlling connections. We first assume no error in the below flow:
- C1:
lock tables tbl write - C2, C3, ..., C17: normal app connections, issuing
insert,delete,updateontbl. Because of the lock, they are naturally blocked. - We apply those last event we need to apply onto
ghost. No new events are coming our way becausetblis blocked. - C18: checking that C1 is still alive, then
rename table tbl to tbl_old. This gets blocked. - C19: checking to see that C18's
renameis in place (viashow processlist), and that C1 is still alive; then issues:rename table ghost to tbl. This gets blocked. - (meanwhile more queries approach
tbl, it doesn't matter, they all get deprioritized, same as C3...C17) - C1:
unlock tables
What just happened? Let's first explain some stuff:
- C18's
renamegets prioritized over the DMLs, even though it came later. That is how MySQL prioritizes queries on metadata-locked tables. - C18 checks C1 is still alive, but as before, there's always the chance C1 will die just at the wrong time -- we're going to address that.
- C19 is interested to see that C18 began execution, but potentially C18 will crash by the time C19 actually issues its own rename -- we're going to address that
- C19's query sounds weird. At that time
tblstill exists. You'd expect it to fail immediately -- but it does not. It's valid. This is becausetbl's metadata lock is in use. - C19 gets prioritized over all the DMLs, but is known to be behind C18. The two stay in same order of arrival. So, C18 is known to execute before C19.
- When C1 unlocks, C18 executes first.
- Metadata lock is still in place on
tbleven though it doesn't actually exist, because of C19. - C19 operates next.
- Finally all the DMLs execute.
What happens on failures?
- If C1 dies just as C18 is about to issue the
rename, we get an outage:tblis renamed totbl_old, and the queries get released and complain the table is just not there.- In such case C19 will not initiate because it is executed after C18 and checks that C1 is alive -- which turns to be untrue. So no C19.
- So we know we have outage, and we quickly
rename tbl_old to tbl;and go drink coffee, then give the entire process another try. - The outage is unfortunate, but does not put our data in danger.
- If C1 happens to die just as C19 is about to issue its
rename, there's no data integrity issue: at this point we've already asserted the tables are in sync. As C1 dies, C18 will immediatelyrename tbl to tbl_old. An outage will occur, but not for long, because C19 will next issuerename ghost to tbl, and close the gap. We suffered a minor outage, but no rollback. We roll forward. - If C18 happens to die just as C19 is about to issue its
rename, nothing bad happens: C19 is still blocking for as long as C1 is running. We find out C18 died, and release C1. C19 attempts torename ghost to tbl, buttblexists (we assume C18 failed) and the query fails. The metadata lock is released and all the queries resume operation on the originaltbl. The queries suffered a short block, but resume operation automatically. The operation failed but without error. We will need to try the entire cycle again. - If both C1 and C18 fail at the time C19 is about to begin its
rename, same as above. - If C18 fails as C19 is already in place, same as above.
- If C1 fails as C19 is already in place, it's as good as having it issue the unlock tables. We're happy.
- If C19 fails at any given point, we suffer outage. We revert by
rename tbl_old to tbl.
I'm grateful to reviews over this logic.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels