TiDB stoped serving requests for 2m40s when a leader got separated from the rest of the nodes

Hi folks,

I was testing a 3-nodes deployment of a TiDB cluster with default settings and observed a situation when TiDB stoped serving requests for 2 minutes 40 seconds. It happened when a leader got separated from the peers. The unavailable window is too wide so I believe that it a bug rather than an expected behavior.

The version of the TiDB is

    rystsov@acceptor1:/mnt/perseus/tikv$ ./tidb-latest-linux-amd64/bin/pd-server --version
    Git Commit Hash: f5744d7b52aa4793b84cfdcd4efae1fc9a9bac6b
    UTC Build Time:  2017-02-17 09:18:31
    
    rystsov@acceptor1:/mnt/perseus/tikv$ ./tidb-latest-linux-amd64/bin/tikv-server --version
    Git Commit Hash: eb185b3babc476080306fef7c05b7673c1342455
    UTC Build Time:  2017-02-17 08:12:57
    Rustc Version:   1.17.0-nightly (ba7cf7cc5 2017-02-11)
    
    rystsov@acceptor1:/mnt/perseus/tikv$ ./tidb-latest-linux-amd64/bin/tidb-server -V
    Git Commit Hash: a8d185d8cb8485e1a124919d0df8b10a16bc6e40
    UTC Build Time:  2017-02-17 08:50:53

The client app opened a connection to each of the nodes and was continuously running the following loop per each of them:

 1. read a value by a key
 2. if the wasn't set then set it to 0
 3. increment the value
 4. write it back
 5. increment a number of successful iterations
 6. repeat the loop

Each connection used its own key to avoid collision. If there was an error during the loop then it closed the current connection, opened a new one and began the next iteration.

Once in a second it dumped aggregated number of successful iterations per cluster and per each node for a last second.

When I separated a leader (10.0.0.7) from the peers with the following command:

    sudo iptables -A INPUT -s 10.0.0.5 -j DROP
    sudo iptables -A INPUT -s 10.0.0.6 -j DROP
    sudo iptables -A OUTPUT -d 10.0.0.5 -j DROP
    sudo iptables -A OUTPUT -d 10.0.0.6 -j DROP

the cluster became unavailable for more than two minutes (the rate of successful iterations dropped to zero)

Please see this repo for client's code, more information about the incident and the repro steps https://github.com/rystsov/perseus/tree/master/tidb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiDB stoped serving requests for 2m40s when a leader got separated from the rest of the nodes #2676

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TiDB stoped serving requests for 2m40s when a leader got separated from the rest of the nodes #2676

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions