Failover recovery, graceful takeover don't work w/binlog

We set up a test environment with 1 master and 2 slaves.  (We were trying, obviously without success, to have a setup where Orchestrator would work "out of the box".)

Here's the Orchestrator setup file: [orchestrator.conf.json.txt](https://github.com/github/orchestrator/files/2920453/orchestrator.conf.json.txt)

We're using ProxySQL to send SELECTs (without update) to the slaves and everything else to the master.  

Originally, failovers did not work at all.  We added `read_only=1` to the MySQL config files, and added the [pre-failover hook](https://blog.pythian.com/graceful-master-switchover-proxysql-orchestrator/) recommended by Percona and at least failover started to work.  We didn't do anything else to tell Orchestrator about ProxySQL.  (According to the Percona article, the post-failover hook they give is no longer needed.)

In the config:
```  
"PreGracefulTakeoverProcesses": [
     "/tmp/prefailover.sh"
  ],
```
and the /tmp/prefailover.sh script (here, 10.42.42.42 is the VIP of keepalived for the 2 ProxySQL instances):
```
#!/bin/bash
 
# Variable exposed by Orchestrator
OldMaster=$ORC_FAILED_HOST
PROXYSQL_HOST="10.42.42.42"
 
# stop accepting connections to old master
(
echo 'UPDATE mysql_servers SET STATUS="OFFLINE_SOFT" WHERE hostname="'"$OldMaster"'";'
echo "LOAD MYSQL SERVERS TO RUNTIME;"
) | mysql -vvv -uivan -p**** -h ${PROXYSQL_HOST} -P6032
 
# wait while connections are still active and we are in the grace period
CONNUSED=`mysql -uivan -p**** -h ${PROXYSQL_HOST} -P6032 -e 'SELECT IFNULL(SUM(ConnUsed),0) FROM stats_mysql_connection_pool WHERE status="OFFLINE_SOFT" AND srv_host="'"$OldMaster"'"' -B -N 2&gt; /dev/null`
TRIES=0
while [ $CONNUSED -ne 0 -a $TRIES -ne 20 ]
do
  CONNUSED=`mysql -uivan -p**** -h ${PROXYSQL_HOST} -P6032 -e 'SELECT IFNULL(SUM(ConnUsed),0) FROM stats_mysql_connection_pool WHERE status="OFFLINE_SOFT" AND srv_host="'"$OldMaster"'"' -B -N 2&gt; /dev/null`
  TRIES=$(($TRIES+1))
  if [ $CONNUSED -ne "0" ]; then
    sleep 0.05
  fi
done
```

Now, if we kill the master, Orchestrator will eventually (5 minutes) promote a slave and get everything working again.  When the former master is brought back up, Orchestrator never  brings it back into replication; it has to be made a slave manually.

When we try to do a graceful master takeover with a slave from CLI, it refuses, saying `ERROR Relocating 1 replicas of stg1wpplatmysql04:3306 below stg1wpplatmysql03:3306 turns to be too complex; please do it manually.`  

When we try with the GUI (by dragging it "on top of" the master, it also refuses, saying `Desginated instance stg1wpplatgarbd02:3306 cannot take over all of its siblings. Error: 2019-03-01 12:13:19 ERROR Relocating 1 replicas of stg1wpplatmysql04:3306 below stg1wpplatgarbd02:3306 turns to be too complex; please do it manually.`  We also get the following in the log:
```
Mar  1 10:39:28 stg1wpplatdbmgr01 orchestrator: 2019-03-01 10:39:28 INFO moveReplicasViaGTID: Will move 1 replicas below stg1wpplatmysql03:3306 via GTID
```
_However_, we're not using GTID.  When we query the Orchestrator with the API, it reports:
```
# curl -s http://localhost:3000/api/problems | jq
[
  {
    "Key": {
      "Hostname": "stg1wpplatmysql04",
      "Port": 3306
    },
    "InstanceAlias": "",
    "Uptime": 78686,
    "ServerID": 2,
    "ServerUUID": "1dcecf18-3b05-11e9-8732-0050568411f5",
    "Version": "5.7.23-23-57-log",
    "VersionComment": "Percona XtraDB Cluster (GPL), Release rel23, Revision f5578f0, WSREP version 31.31, wsrep_31.31",
    "FlavorName": "Percona",
    "ReadOnly": false,
    "Binlog_format": "ROW",
    "BinlogRowImage": "FULL",
    "LogBinEnabled": true,
    "LogSlaveUpdatesEnabled": false,
    "SelfBinlogCoordinates": {
      "LogFile": "mysql-bin.000007",
      "LogPos": 1628009,
      "Type": 0
    },
...
```

I don't know whether this is a documentation issue or a bug, or a combination.  (I'm very suspicious there's some configuration that would fix this if we only knew how to do it, which would make it a doc issue, I suppose.)  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failover recovery, graceful takeover don't work w/binlog #824

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failover recovery, graceful takeover don't work w/binlog #824

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions