reset-master operation: wait for replication to stop#762
reset-master operation: wait for replication to stop#762shlomi-noach merged 10 commits intomasterfrom
Conversation
go/inst/instance_dao.go
Outdated
| ioThreadRunning := (m.GetString("Slave_IO_Running") == "Yes") | ||
| sqlThreadRunning := (m.GetString("Slave_SQL_Running") == "Yes") | ||
| replicationThreadsRunning = ioThreadRunning && sqlThreadRunning | ||
| ioThreadRunning = (m.GetString("Slave_IO_Running") == "Yes") |
There was a problem hiding this comment.
An issue with this is that you're deciding that Slave_*_Running = Yes is the boolean check against which to decide that replication is fully running or fully stopped.
I'm not sure about Slave_SQL_Running's options but Slave_IO_Running can also be "Connecting". So that's at least one case where this check would say that replication is not running even though the IO thread is (or, starting to, or trying to).
There was a problem hiding this comment.
That's a good point.
Co-Authored-By: shlomi-noach <shlomi-noach@github.com>
|
I will merge this PR even though we haven't yet verified the reasoning for the Reiterating an internal issue, @ggunson suggests the retries can be the cause of the crash, as follows:
We're yet to reproduce this reliably, but will not be working on this actively in the short term. |
As pointed out by @ggunson ,
ErrantGTIDResetMaster()issues areset masterimmediately following astop slaveoperation, but without verifying that replication has indeed stopped. e.g. SQL thread could still be busy.We've seen crashes in production at running
reset master.In this PR we actively wait (or timeout) for replication to stop, before running
reset master.