-
Notifications
You must be signed in to change notification settings - Fork 948
Failover recovery, graceful takeover don't work w/binlog #824
Description
We set up a test environment with 1 master and 2 slaves. (We were trying, obviously without success, to have a setup where Orchestrator would work "out of the box".)
Here's the Orchestrator setup file: orchestrator.conf.json.txt
We're using ProxySQL to send SELECTs (without update) to the slaves and everything else to the master.
Originally, failovers did not work at all. We added read_only=1 to the MySQL config files, and added the pre-failover hook recommended by Percona and at least failover started to work. We didn't do anything else to tell Orchestrator about ProxySQL. (According to the Percona article, the post-failover hook they give is no longer needed.)
In the config:
"PreGracefulTakeoverProcesses": [
"/tmp/prefailover.sh"
],
and the /tmp/prefailover.sh script (here, 10.42.42.42 is the VIP of keepalived for the 2 ProxySQL instances):
#!/bin/bash
# Variable exposed by Orchestrator
OldMaster=$ORC_FAILED_HOST
PROXYSQL_HOST="10.42.42.42"
# stop accepting connections to old master
(
echo 'UPDATE mysql_servers SET STATUS="OFFLINE_SOFT" WHERE hostname="'"$OldMaster"'";'
echo "LOAD MYSQL SERVERS TO RUNTIME;"
) | mysql -vvv -uivan -p**** -h ${PROXYSQL_HOST} -P6032
# wait while connections are still active and we are in the grace period
CONNUSED=`mysql -uivan -p**** -h ${PROXYSQL_HOST} -P6032 -e 'SELECT IFNULL(SUM(ConnUsed),0) FROM stats_mysql_connection_pool WHERE status="OFFLINE_SOFT" AND srv_host="'"$OldMaster"'"' -B -N 2> /dev/null`
TRIES=0
while [ $CONNUSED -ne 0 -a $TRIES -ne 20 ]
do
CONNUSED=`mysql -uivan -p**** -h ${PROXYSQL_HOST} -P6032 -e 'SELECT IFNULL(SUM(ConnUsed),0) FROM stats_mysql_connection_pool WHERE status="OFFLINE_SOFT" AND srv_host="'"$OldMaster"'"' -B -N 2> /dev/null`
TRIES=$(($TRIES+1))
if [ $CONNUSED -ne "0" ]; then
sleep 0.05
fi
done
Now, if we kill the master, Orchestrator will eventually (5 minutes) promote a slave and get everything working again. When the former master is brought back up, Orchestrator never brings it back into replication; it has to be made a slave manually.
When we try to do a graceful master takeover with a slave from CLI, it refuses, saying ERROR Relocating 1 replicas of stg1wpplatmysql04:3306 below stg1wpplatmysql03:3306 turns to be too complex; please do it manually.
When we try with the GUI (by dragging it "on top of" the master, it also refuses, saying Desginated instance stg1wpplatgarbd02:3306 cannot take over all of its siblings. Error: 2019-03-01 12:13:19 ERROR Relocating 1 replicas of stg1wpplatmysql04:3306 below stg1wpplatgarbd02:3306 turns to be too complex; please do it manually. We also get the following in the log:
Mar 1 10:39:28 stg1wpplatdbmgr01 orchestrator: 2019-03-01 10:39:28 INFO moveReplicasViaGTID: Will move 1 replicas below stg1wpplatmysql03:3306 via GTID
However, we're not using GTID. When we query the Orchestrator with the API, it reports:
# curl -s http://localhost:3000/api/problems | jq
[
{
"Key": {
"Hostname": "stg1wpplatmysql04",
"Port": 3306
},
"InstanceAlias": "",
"Uptime": 78686,
"ServerID": 2,
"ServerUUID": "1dcecf18-3b05-11e9-8732-0050568411f5",
"Version": "5.7.23-23-57-log",
"VersionComment": "Percona XtraDB Cluster (GPL), Release rel23, Revision f5578f0, WSREP version 31.31, wsrep_31.31",
"FlavorName": "Percona",
"ReadOnly": false,
"Binlog_format": "ROW",
"BinlogRowImage": "FULL",
"LogBinEnabled": true,
"LogSlaveUpdatesEnabled": false,
"SelfBinlogCoordinates": {
"LogFile": "mysql-bin.000007",
"LogPos": 1628009,
"Type": 0
},
...
I don't know whether this is a documentation issue or a bug, or a combination. (I'm very suspicious there's some configuration that would fix this if we only knew how to do it, which would make it a doc issue, I suppose.)