Skip to content
This repository was archived by the owner on Feb 18, 2025. It is now read-only.

Support for FailMasterPromotionOnLagMinutes#1115

Merged
shlomi-noach merged 22 commits intomasterfrom
fail-promotion-lag-seconds
May 2, 2020
Merged

Support for FailMasterPromotionOnLagMinutes#1115
shlomi-noach merged 22 commits intomasterfrom
fail-promotion-lag-seconds

Conversation

@shlomi-noach
Copy link
Collaborator

Fixes #83

Scenario: M -> R topology was running with replica R broken for a few hours without anyone noticing. Then, master M fails.

Curent behavior: orchestrator promotes R. But by this we lose:

  • potentially hours of worth of relay logs obtained by R but not executed (e.g. because of some SQL error), or
  • the ability to recover hours of worth of binary logs from the master M.

New config variable FailMasterPromotionOnLagMinutes tells orchestrator to fail a promotion if, at the time a candidate replica is chosen, it is determined to be lagging too much ( >= FailMasterPromotionOnLagMinutes).

cc @sougou @mcrauwel

@mcrauwel
Copy link
Contributor

mcrauwel commented Apr 6, 2020

thanks for the mention @shlomi-noach! I will be testing it out!

@shlomi-noach
Copy link
Collaborator Author

@shlomi-noach shlomi-noach merged commit 4f6635e into master May 2, 2020
@shlomi-noach shlomi-noach deleted the fail-promotion-lag-seconds branch May 2, 2020 08:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slaves lagging by couple of hours are elected as master by orchestrator

2 participants