Hi folks,
I was testing a 3-nodes deployment of a TiDB cluster with default settings and observed a situation when TiDB stoped serving requests for 2 minutes 40 seconds. It happened when a leader got separated from the peers. The unavailable window is too wide so I believe that it a bug rather than an expected behavior.
The version of the TiDB is
rystsov@acceptor1:/mnt/perseus/tikv$ ./tidb-latest-linux-amd64/bin/pd-server --version
Git Commit Hash: f5744d7b52aa4793b84cfdcd4efae1fc9a9bac6b
UTC Build Time: 2017-02-17 09:18:31
rystsov@acceptor1:/mnt/perseus/tikv$ ./tidb-latest-linux-amd64/bin/tikv-server --version
Git Commit Hash: eb185b3babc476080306fef7c05b7673c1342455
UTC Build Time: 2017-02-17 08:12:57
Rustc Version: 1.17.0-nightly (ba7cf7cc5 2017-02-11)
rystsov@acceptor1:/mnt/perseus/tikv$ ./tidb-latest-linux-amd64/bin/tidb-server -V
Git Commit Hash: a8d185d8cb8485e1a124919d0df8b10a16bc6e40
UTC Build Time: 2017-02-17 08:50:53
The client app opened a connection to each of the nodes and was continuously running the following loop per each of them:
- read a value by a key
- if the wasn't set then set it to 0
- increment the value
- write it back
- increment a number of successful iterations
- repeat the loop
Each connection used its own key to avoid collision. If there was an error during the loop then it closed the current connection, opened a new one and began the next iteration.
Once in a second it dumped aggregated number of successful iterations per cluster and per each node for a last second.
When I separated a leader (10.0.0.7) from the peers with the following command:
sudo iptables -A INPUT -s 10.0.0.5 -j DROP
sudo iptables -A INPUT -s 10.0.0.6 -j DROP
sudo iptables -A OUTPUT -d 10.0.0.5 -j DROP
sudo iptables -A OUTPUT -d 10.0.0.6 -j DROP
the cluster became unavailable for more than two minutes (the rate of successful iterations dropped to zero)
Please see this repo for client's code, more information about the incident and the repro steps https://github.com/rystsov/perseus/tree/master/tidb
Hi folks,
I was testing a 3-nodes deployment of a TiDB cluster with default settings and observed a situation when TiDB stoped serving requests for 2 minutes 40 seconds. It happened when a leader got separated from the peers. The unavailable window is too wide so I believe that it a bug rather than an expected behavior.
The version of the TiDB is
The client app opened a connection to each of the nodes and was continuously running the following loop per each of them:
Each connection used its own key to avoid collision. If there was an error during the loop then it closed the current connection, opened a new one and began the next iteration.
Once in a second it dumped aggregated number of successful iterations per cluster and per each node for a last second.
When I separated a leader (10.0.0.7) from the peers with the following command:
the cluster became unavailable for more than two minutes (the rate of successful iterations dropped to zero)
Please see this repo for client's code, more information about the incident and the repro steps https://github.com/rystsov/perseus/tree/master/tidb