Skip to content

Raftstore CPU Exhaustion: leaders prevented from hibernating after down one tikv instance #19070

@mayjiang0203

Description

@mayjiang0203

Development Task

When a TiKV peer becomes unavailable, region leaders on other nodes are prevented from entering hibernation due to the following logic:

if res.is_none() /* hibernate_region is false */ ||
!self.fsm.peer.check_after_tick(self.fsm.hibernate_state.group_state(), res.unwrap()) ||
(self.fsm.peer.is_leader() && !self.all_agree_to_hibernate())
{
self.register_raft_base_tick();
// We need pd heartbeat tick to collect down peers and pending peers.
self.register_pd_heartbeat_tick();
return;
}

This forces all leaders to remain active, causing a massive surge in Raft heartbeat traffic. In a cluster with a high number of Regions, this surge consumes excessive CPU in the Raftstore, potentially leading to CPU exhaustion and performance degradation.

One typical scenario is described below:
There is a 3-tikv instance cluster, and each tikv has 30w regions on it.
Heartbeat message before down one tikv:

Image

heartbeat message sent after down one tikv:

Image

Raftstore CPU usage before down one tikv.

Image

Raftstore CPU usage after down one tikv:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    affects-7.5This bug affects the 7.5.x(LTS) versions.affects-8.5This bug affects the 8.5.x(LTS) versions.contributionThis PR is from a community contributor.type/enhancementThe issue or PR belongs to an enhancement.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions