-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Development Task
When a TiKV peer becomes unavailable, region leaders on other nodes are prevented from entering hibernation due to the following logic:
tikv/components/raftstore/src/store/fsm/peer.rs
Lines 2487 to 2495 in d50767d
| if res.is_none() /* hibernate_region is false */ || | |
| !self.fsm.peer.check_after_tick(self.fsm.hibernate_state.group_state(), res.unwrap()) || | |
| (self.fsm.peer.is_leader() && !self.all_agree_to_hibernate()) | |
| { | |
| self.register_raft_base_tick(); | |
| // We need pd heartbeat tick to collect down peers and pending peers. | |
| self.register_pd_heartbeat_tick(); | |
| return; | |
| } |
This forces all leaders to remain active, causing a massive surge in Raft heartbeat traffic. In a cluster with a high number of Regions, this surge consumes excessive CPU in the Raftstore, potentially leading to CPU exhaustion and performance degradation.
One typical scenario is described below:
There is a 3-tikv instance cluster, and each tikv has 30w regions on it.
Heartbeat message before down one tikv:
heartbeat message sent after down one tikv:
Raftstore CPU usage before down one tikv.
Raftstore CPU usage after down one tikv:

