raftstore: avoid early hibernate if pending on applying logs when restart by LykxSassinator · Pull Request #18236 · tikv/tikv

LykxSassinator · 2025-02-20T10:15:50Z

What is changed and how it works?

Issue Number: Close #18233

What's Changed:

In previous work #16239, we introduced the busy_on_apply state to indicate
whether a Peer is pending the application of pending Raft logs upon restart.

However, this approach misses a corner case: if the Peer quickly enters the
hibernate state after restarting, the busy_on_apply state may not be updated
in a timely manner. This results in the Node failed to update the count of pending
applying regions, continuously reporting an incorrect is_busy == true state to PD.
Consequently, this can slow down the rolling-restart progress more than expected.

Therefore, this PR addresses this issue by updating the applied state in on_apply_res.

Fix the bug where some hibernated peers, marked with `busy_on_apply == true`, 
cannot be reset with normal even thought the `applied_index == committed_index`.

Related changes

PR to update pingcap/docs/pingcap/docs-cn:
Need to cherry-pick to the release branch

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Release note

Fix the bug where some hibernated peers, marked with `busy_on_apply == true`, 
cannot be reset with normal even thought the `applied_index == committed_index`.

…tart. Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

overvenus · 2025-02-20T15:06:19Z

components/raftstore/src/store/peer.rs

                && !self.is_leader()
                // Keep ticking if it's waiting for snapshot.
                && !self.wait_data
+                // Keep ticking if it still has some pending and unapplied raft logs.


How about check and update busy_on_apply on on_apply_res? Keeping tick might wake up hibernated peers.

Both acceptable but I prefer this choice.

IMO, since restarting a single node will wake up all hibernated regions on that node, applying this change will not introduce higher pressure than the peak load during this period. Therefore, simply ticking until the unapplied state has been updated is acceptable.

I'm fine with the current approach as well. I agree that the on_apply_res approach will allow peers to go into hibernation earlier, saving some ticks. But the difference might not matter a lot because the code change only affects the first few minutes after a TiKV restart, when the server is still in the busy apply stage. During this time, the server has no leaders and some extra ticks likely won't be a significant issue.

Besides saving ticks, I am also considering maintainability. Since the Hibernate region-related code is already complex, it's better to leave it unchanged. To me, it seems more intuitive to check and update the busy_on_apply field in on_apply_res rather than in on_raft_base_tick.

Also, why was it originally checked in on_raft_base_tick? Is there a reason I'm missing?

To safely unify the initialization of the check mechanism for the busy_on_apply state, the following cases will be covered, specifically after the peers are restarted and their roles (either "follower" or "leader") have been confirmed:

Peers with applied_index == committed_index will be checked only once.

Peers with applied_index <= committed_index can be updated using the Raft base tick.

Besides saving ticks, I am also considering maintainability. Since the Hibernate region-related code is already complex, it's better to leave it unchanged.

Accepted. This point is reasonable and acceptable, as it avoids increasing the complexity of the Hibernate mechanism.

hbisheng · 2025-02-24T09:26:54Z

components/raftstore/src/store/peer.rs

                && !self.is_leader()
                // Keep ticking if it's waiting for snapshot.
                && !self.wait_data
+                // Keep ticking if it still has some pending and unapplied raft logs.


I'm fine with the current approach as well. I agree that the on_apply_res approach will allow peers to go into hibernation earlier, saving some ticks. But the difference might not matter a lot because the code change only affects the first few minutes after a TiKV restart, when the server is still in the busy apply stage. During this time, the server has no leaders and some extra ticks likely won't be a significant issue.

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

hbisheng · 2025-03-05T02:13:17Z

Glad to see the new approach looks simpler

ti-chi-bot · 2025-03-05T02:13:37Z

@hbisheng: Your lgtm message is repeated, so it is ignored.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot · 2025-03-05T05:09:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hbisheng, overvenus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [overvenus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2025-03-05T05:09:57Z

[LGTM Timeline notifier]

Timeline:

2025-02-24 09:31:28.931539807 +0000 UTC m=+261836.884698075: ☑️ agreed by hbisheng.
2025-03-05 05:09:56.93814062 +0000 UTC m=+418310.067060361: ☑️ agreed by overvenus.

ti-chi-bot · 2025-04-14T09:04:56Z

In response to a cherrypick label: new pull request created to branch release-8.5: #18393.

…tart (#18236) (#18393) close #18233 Fix the bug where some hibernated peers, marked with `busy_on_apply == true`, cannot be reset with normal even thought the `applied_index == committed_index`. Signed-off-by: lucasliang <nkcs_lykx@hotmail.com> Co-authored-by: lucasliang <nkcs_lykx@hotmail.com>

close tikv#18233 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot · 2025-06-30T08:45:13Z

In response to a cherrypick label: new pull request created to branch release-7.5: #18601.
But this PR has conflicts, please resolve them!

ti-chi-bot · 2025-06-30T08:45:15Z

In response to a cherrypick label: new pull request created to branch release-8.1: #18602.

…tart (#18236) (#18601) close #18233 Fix the bug where some hibernated peers, marked with `busy_on_apply == true`, cannot be reset with normal even thought the `applied_index == committed_index`. Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io> Signed-off-by: lucasliang <nkcs_lykx@hotmail.com> Co-authored-by: lucasliang <nkcs_lykx@hotmail.com>

raftstore: avoid early hibernate if pending on applying logs when res…

1e92825

…tart. Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 20, 2025

LykxSassinator requested review from hbisheng and overvenus February 20, 2025 10:43

Add more cases.

02acc24

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

overvenus reviewed Feb 20, 2025

View reviewed changes

hbisheng approved these changes Feb 24, 2025

View reviewed changes

ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Feb 24, 2025

LykxSassinator requested a review from overvenus March 3, 2025 09:53

Address comments.

0d1366c

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Mar 4, 2025

LykxSassinator requested a review from hbisheng March 4, 2025 07:28

Merge branch 'master' into fix_early_hibernate

4f86bf2

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

hbisheng approved these changes Mar 5, 2025

View reviewed changes

overvenus approved these changes Mar 5, 2025

View reviewed changes

ti-chi-bot bot added lgtm approved and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 5, 2025

ti-chi-bot bot merged commit 6ec92c8 into tikv:master Mar 5, 2025
8 checks passed

ti-chi-bot bot added this to the Pool milestone Mar 5, 2025

LykxSassinator deleted the fix_early_hibernate branch March 5, 2025 06:48

ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Apr 14, 2025

ti-chi-bot mentioned this pull request Apr 14, 2025

raftstore: avoid early hibernate if pending on applying logs when restart (#18236) #18393

Merged

9 tasks

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this pull request Jun 30, 2025

This is an automated cherry-pick of tikv#18236

f3ae3aa

close tikv#18233 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot mentioned this pull request Jun 30, 2025

raftstore: avoid early hibernate if pending on applying logs when restart (#18236) #18601

Merged

9 tasks

ti-chi-bot mentioned this pull request Jun 30, 2025

raftstore: avoid early hibernate if pending on applying logs when restart (#18236) #18602

Open

9 tasks

This was referenced Dec 11, 2025

Raft peers may get stuck in busy apply state post TiKV startup #18233

Closed

raftstore: fix the corner case if entering hiberate state without correctly clearing busy_on_apply state. #19199

Merged

ti-chi-bot mentioned this pull request Dec 12, 2025

raftstore: fix the corner case if entering hiberate state without correctly clearing busy_on_apply state. (#19199) #19202

Merged

9 tasks

Conversation

LykxSassinator commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is changed and how it works?

Related changes

Check List

Release note

Uh oh!

overvenus Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

LykxSassinator Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hbisheng Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

overvenus Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

LykxSassinator Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

hbisheng Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

hbisheng commented Mar 5, 2025

Uh oh!

ti-chi-bot bot commented Mar 5, 2025

Uh oh!

ti-chi-bot bot commented Mar 5, 2025

Uh oh!

ti-chi-bot bot commented Mar 5, 2025

[LGTM Timeline notifier]

Uh oh!

Uh oh!

ti-chi-bot commented Apr 14, 2025

Uh oh!

ti-chi-bot commented Jun 30, 2025

Uh oh!

ti-chi-bot commented Jun 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LykxSassinator commented Feb 20, 2025 •

edited

Loading

LykxSassinator Feb 21, 2025 •

edited

Loading