Skip to content

TiDB stoped serving requests for 2m40s when a leader got separated from the rest of the nodes #2676

@rystsov

Description

@rystsov

Hi folks,

I was testing a 3-nodes deployment of a TiDB cluster with default settings and observed a situation when TiDB stoped serving requests for 2 minutes 40 seconds. It happened when a leader got separated from the peers. The unavailable window is too wide so I believe that it a bug rather than an expected behavior.

The version of the TiDB is

rystsov@acceptor1:/mnt/perseus/tikv$ ./tidb-latest-linux-amd64/bin/pd-server --version
Git Commit Hash: f5744d7b52aa4793b84cfdcd4efae1fc9a9bac6b
UTC Build Time:  2017-02-17 09:18:31

rystsov@acceptor1:/mnt/perseus/tikv$ ./tidb-latest-linux-amd64/bin/tikv-server --version
Git Commit Hash: eb185b3babc476080306fef7c05b7673c1342455
UTC Build Time:  2017-02-17 08:12:57
Rustc Version:   1.17.0-nightly (ba7cf7cc5 2017-02-11)

rystsov@acceptor1:/mnt/perseus/tikv$ ./tidb-latest-linux-amd64/bin/tidb-server -V
Git Commit Hash: a8d185d8cb8485e1a124919d0df8b10a16bc6e40
UTC Build Time:  2017-02-17 08:50:53

The client app opened a connection to each of the nodes and was continuously running the following loop per each of them:

  1. read a value by a key
  2. if the wasn't set then set it to 0
  3. increment the value
  4. write it back
  5. increment a number of successful iterations
  6. repeat the loop

Each connection used its own key to avoid collision. If there was an error during the loop then it closed the current connection, opened a new one and began the next iteration.

Once in a second it dumped aggregated number of successful iterations per cluster and per each node for a last second.

When I separated a leader (10.0.0.7) from the peers with the following command:

sudo iptables -A INPUT -s 10.0.0.5 -j DROP
sudo iptables -A INPUT -s 10.0.0.6 -j DROP
sudo iptables -A OUTPUT -d 10.0.0.5 -j DROP
sudo iptables -A OUTPUT -d 10.0.0.6 -j DROP

the cluster became unavailable for more than two minutes (the rate of successful iterations dropped to zero)

Please see this repo for client's code, more information about the incident and the repro steps https://github.com/rystsov/perseus/tree/master/tidb

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/enhancementThe issue or PR belongs to an enhancement.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions