Skip to content

raftstore: avoid early hibernate if pending on applying logs when restart#18236

Merged
ti-chi-bot[bot] merged 4 commits intotikv:masterfrom
LykxSassinator:fix_early_hibernate
Mar 5, 2025
Merged

raftstore: avoid early hibernate if pending on applying logs when restart#18236
ti-chi-bot[bot] merged 4 commits intotikv:masterfrom
LykxSassinator:fix_early_hibernate

Conversation

@LykxSassinator
Copy link
Contributor

@LykxSassinator LykxSassinator commented Feb 20, 2025

What is changed and how it works?

Issue Number: Close #18233

What's Changed:

In previous work #16239, we introduced the busy_on_apply state to indicate
whether a Peer is pending the application of pending Raft logs upon restart.

However, this approach misses a corner case: if the Peer quickly enters the
hibernate state after restarting, the busy_on_apply state may not be updated
in a timely manner. This results in the Node failed to update the count of pending
applying regions, continuously reporting an incorrect is_busy == true state to PD.
Consequently, this can slow down the rolling-restart progress more than expected.

Therefore, this PR addresses this issue by updating the applied state in on_apply_res.

Fix the bug where some hibernated peers, marked with `busy_on_apply == true`, 
cannot be reset with normal even thought the `applied_index == committed_index`.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note

Fix the bug where some hibernated peers, marked with `busy_on_apply == true`, 
cannot be reset with normal even thought the `applied_index == committed_index`.

…tart.

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 20, 2025
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
&& !self.is_leader()
// Keep ticking if it's waiting for snapshot.
&& !self.wait_data
// Keep ticking if it still has some pending and unapplied raft logs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about check and update busy_on_apply on on_apply_res? Keeping tick might wake up hibernated peers.

Copy link
Contributor Author

@LykxSassinator LykxSassinator Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both acceptable but I prefer this choice.

IMO, since restarting a single node will wake up all hibernated regions on that node, applying this change will not introduce higher pressure than the peak load during this period. Therefore, simply ticking until the unapplied state has been updated is acceptable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the current approach as well. I agree that the on_apply_res approach will allow peers to go into hibernation earlier, saving some ticks. But the difference might not matter a lot because the code change only affects the first few minutes after a TiKV restart, when the server is still in the busy apply stage. During this time, the server has no leaders and some extra ticks likely won't be a significant issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides saving ticks, I am also considering maintainability. Since the Hibernate region-related code is already complex, it's better to leave it unchanged. To me, it seems more intuitive to check and update the busy_on_apply field in on_apply_res rather than in on_raft_base_tick.

Also, why was it originally checked in on_raft_base_tick? Is there a reason I'm missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To safely unify the initialization of the check mechanism for the busy_on_apply state, the following cases will be covered, specifically after the peers are restarted and their roles (either "follower" or "leader") have been confirmed:

  1. Peers with applied_index == committed_index will be checked only once.
  2. Peers with applied_index <= committed_index can be updated using the Raft base tick.

Besides saving ticks, I am also considering maintainability. Since the Hibernate region-related code is already complex, it's better to leave it unchanged.

Accepted. This point is reasonable and acceptable, as it avoids increasing the complexity of the Hibernate mechanism.

&& !self.is_leader()
// Keep ticking if it's waiting for snapshot.
&& !self.wait_data
// Keep ticking if it still has some pending and unapplied raft logs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the current approach as well. I agree that the on_apply_res approach will allow peers to go into hibernation earlier, saving some ticks. But the difference might not matter a lot because the code change only affects the first few minutes after a TiKV restart, when the server is still in the busy apply stage. During this time, the server has no leaders and some extra ticks likely won't be a significant issue.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Feb 24, 2025
@LykxSassinator LykxSassinator requested a review from overvenus March 3, 2025 09:53
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Mar 4, 2025
@LykxSassinator LykxSassinator requested a review from hbisheng March 4, 2025 07:28
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
@hbisheng
Copy link
Member

hbisheng commented Mar 5, 2025

Glad to see the new approach looks simpler

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 5, 2025

@hbisheng: Your lgtm message is repeated, so it is ignored.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 5, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hbisheng, overvenus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm approved and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 5, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 5, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-02-24 09:31:28.931539807 +0000 UTC m=+261836.884698075: ☑️ agreed by hbisheng.
  • 2025-03-05 05:09:56.93814062 +0000 UTC m=+418310.067060361: ☑️ agreed by overvenus.

@ti-chi-bot ti-chi-bot bot merged commit 6ec92c8 into tikv:master Mar 5, 2025
8 checks passed
@ti-chi-bot ti-chi-bot bot added this to the Pool milestone Mar 5, 2025
@LykxSassinator LykxSassinator deleted the fix_early_hibernate branch March 5, 2025 06:48
@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Apr 14, 2025
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #18393.

ti-chi-bot bot pushed a commit that referenced this pull request Apr 16, 2025
…tart (#18236) (#18393)

close #18233

Fix the bug where some hibernated peers, marked with `busy_on_apply == true`, 
cannot be reset with normal even thought the `applied_index == committed_index`.

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

Co-authored-by: lucasliang <nkcs_lykx@hotmail.com>
@LykxSassinator LykxSassinator added needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. and removed needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. labels Jun 30, 2025
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this pull request Jun 30, 2025
close tikv#18233

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #18601.
But this PR has conflicts, please resolve them!

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.1: #18602.

ti-chi-bot bot pushed a commit that referenced this pull request Jul 15, 2025
…tart (#18236) (#18601)

close #18233

Fix the bug where some hibernated peers, marked with `busy_on_apply == true`, 
cannot be reset with normal even thought the `applied_index == committed_index`.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>

Co-authored-by: lucasliang <nkcs_lykx@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Raft peers may get stuck in busy apply state post TiKV startup

4 participants