-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
I could use some help with a downtime we encountered using gh-ost today.
First we had no problem with the replica only tests and testing minor tables but we just tried it out on what we thought could be a meaty table. Sitting at 500k rows, it is much closer to one of our smaller tables than our largest. And it's not actively/heavily written to, but it very read heavily. And I think that's what got us into trouble.
gh-ost was able to clone the table and waited for me to drop the postpone flag. Once I did the writer-server's MySQL became unresponsive. Even when I logged into it and tried to connect directly my MySQL connection would hang. Now this is a very beefy machine, and the load average didn't budge. So it wasn't a matter of the hardware stalling, but definitely the way MySQL was processing.
We are using Percona MySQL 5.7 with the thread_handling=pool-of-threads turned on (a result of our upgrade from 5.6), and I think what may have happened according to the description outlined in #82 is that we must have starved the thread pool once gh-ost tried to lock this hot table for renaming.
I don't think the statements prior to gh-ost applying for the lock stalled us, but perhaps all of the queries immediately after gh-ost applied for the lock did. From the outside they began collecting up waiting for gh-ost to execute the atomic rename. That we can see happening by a pile up of our unicorn processes. gh-ost didnt seem to timeout applying for the lock ("INFO Tables locked") but it never renamed the table. It seemed to restart the sync process once more (restarted the "# Migrating ..." block) and that's when I killed it manually.
But I wonder if the prioritization of the actual rename got lost within MySQL and most queries were waiting for that to finish so gh-ost would give up the lock. That is my initial diagnosis anyway.
I need to do some digging to try and figure out if this is perhaps a configuration tuning issue. Perhaps we need to reconsider moving back from pool-of-threads to one-thread-per-connection. But I was hoping I could share that experience and see if you folks had some insight to help point me in the right direction. Maybe someone else has already run into this exact issue and could share their solution.
The threadpool size is certainly one configuration to re-evaluate but I think we may need to look into the thread pool high priority options in https://www.percona.com/doc/percona-server/LATEST/performance/threadpool.html
One direct question I wanted to ask: if we decided to try the two step rename with --cut-over=two-step does it require the same locking that is currently happening in the atomic rename? I know we could error queries between those two steps executing, but I wonder if that is another option worth exploring, or if I should be focusing all of my energy on MySQL's configuration.