Raftstore CPU Exhaustion: leaders prevented from hibernating after down one tikv instance

## Development Task

When a TiKV peer becomes unavailable, region leaders on other nodes are prevented from entering hibernation due to the following logic:
https://github.com/tikv/tikv/blob/d50767d80798053594eb5bd3ccf01ee3fc79f462/components/raftstore/src/store/fsm/peer.rs#L2487-L2495

This forces all leaders to remain active, causing a massive surge in Raft heartbeat traffic. In a cluster with a high number of Regions, this surge consumes excessive CPU in the Raftstore, potentially leading to CPU exhaustion and performance degradation.

One typical scenario is described below:
There is a 3-tikv instance cluster, and each tikv has 30w regions on it.
Heartbeat message before down one tikv:

<img width="1884" height="358" alt="Image" src="https://github.com/user-attachments/assets/8e4c233f-4744-46a8-ab14-ce1423e4ea14" />

heartbeat message sent after down one tikv:

![Image](https://github.com/user-attachments/assets/0c785210-cea4-4cbf-8c37-da543ddb84ea)


Raftstore CPU usage before down one tikv.

<img width="3740" height="788" alt="Image" src="https://github.com/user-attachments/assets/590f8970-f7af-4220-9c43-901194b2f447" />

Raftstore CPU usage after down one tikv:

<img width="3762" height="810" alt="Image" src="https://github.com/user-attachments/assets/be05e4f0-c415-4bce-86dc-4c0ecb345126" />




	if res.is_none() /* hibernate_region is false */ \|\|
	!self.fsm.peer.check_after_tick(self.fsm.hibernate_state.group_state(), res.unwrap()) \|\|
	(self.fsm.peer.is_leader() && !self.all_agree_to_hibernate())
	{
	self.register_raft_base_tick();
	// We need pd heartbeat tick to collect down peers and pending peers.
	self.register_pd_heartbeat_tick();
	return;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raftstore CPU Exhaustion: leaders prevented from hibernating after down one tikv instance #19070

Development Task

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Raftstore CPU Exhaustion: leaders prevented from hibernating after down one tikv instance #19070

Description

Development Task

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions