Skip to content

After the recovery of a TiKV failure, other TiKV node was detected as slow stores #18605

@jolynejo

Description

@jolynejo

Bug Report

What version of TiKV are you using?

v8.5.2

What operating system and CPU are you using?

TiDB Cloud
16 vCPU, 64 GiB
2000 GiB Storage × 18 Nodes
Standard storage

Steps to reproduce

1、Simulate random tikv failure, then recover after 10 min
2、Consistently run simulated workload during & after failure
3、After the failure recovery of db-tikv-5, QPS dropped by 50%. db-tikv-2 was detected as a slow store and had its leaders evicted.

What did you expect?

  1. stable workload qps after failure recovery
  2. no unexpected evicting slow store

What did happened?

  1. qps drop 50% for minutes
    Image
    Image

Metadata

Metadata

Labels

affects-7.5This bug affects the 7.5.x(LTS) versions.affects-8.1This bug affects the 8.1.x(LTS) versions.affects-8.5This bug affects the 8.5.x(LTS) versions.report/customerCustomers have encountered this bug.severity/majortype/bugThe issue is confirmed as a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions