raftstore: avoid early hibernate if pending on applying logs when restart#18236
Conversation
…tart. Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
| && !self.is_leader() | ||
| // Keep ticking if it's waiting for snapshot. | ||
| && !self.wait_data | ||
| // Keep ticking if it still has some pending and unapplied raft logs. |
There was a problem hiding this comment.
How about check and update busy_on_apply on on_apply_res? Keeping tick might wake up hibernated peers.
There was a problem hiding this comment.
Both acceptable but I prefer this choice.
IMO, since restarting a single node will wake up all hibernated regions on that node, applying this change will not introduce higher pressure than the peak load during this period. Therefore, simply ticking until the unapplied state has been updated is acceptable.
There was a problem hiding this comment.
I'm fine with the current approach as well. I agree that the on_apply_res approach will allow peers to go into hibernation earlier, saving some ticks. But the difference might not matter a lot because the code change only affects the first few minutes after a TiKV restart, when the server is still in the busy apply stage. During this time, the server has no leaders and some extra ticks likely won't be a significant issue.
There was a problem hiding this comment.
Besides saving ticks, I am also considering maintainability. Since the Hibernate region-related code is already complex, it's better to leave it unchanged. To me, it seems more intuitive to check and update the busy_on_apply field in on_apply_res rather than in on_raft_base_tick.
Also, why was it originally checked in on_raft_base_tick? Is there a reason I'm missing?
There was a problem hiding this comment.
To safely unify the initialization of the check mechanism for the busy_on_apply state, the following cases will be covered, specifically after the peers are restarted and their roles (either "follower" or "leader") have been confirmed:
- Peers with
applied_index==committed_indexwill be checked only once. - Peers with
applied_index<=committed_indexcan be updated using the Raft base tick.
Besides saving ticks, I am also considering maintainability. Since the Hibernate region-related code is already complex, it's better to leave it unchanged.
Accepted. This point is reasonable and acceptable, as it avoids increasing the complexity of the Hibernate mechanism.
| && !self.is_leader() | ||
| // Keep ticking if it's waiting for snapshot. | ||
| && !self.wait_data | ||
| // Keep ticking if it still has some pending and unapplied raft logs. |
There was a problem hiding this comment.
I'm fine with the current approach as well. I agree that the on_apply_res approach will allow peers to go into hibernation earlier, saving some ticks. But the difference might not matter a lot because the code change only affects the first few minutes after a TiKV restart, when the server is still in the busy apply stage. During this time, the server has no leaders and some extra ticks likely won't be a significant issue.
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
|
Glad to see the new approach looks simpler |
|
@hbisheng: Your lgtm message is repeated, so it is ignored. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hbisheng, overvenus The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
In response to a cherrypick label: new pull request created to branch |
close tikv#18233 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
|
In response to a cherrypick label: new pull request created to branch |
|
In response to a cherrypick label: new pull request created to branch |
…tart (#18236) (#18601) close #18233 Fix the bug where some hibernated peers, marked with `busy_on_apply == true`, cannot be reset with normal even thought the `applied_index == committed_index`. Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io> Signed-off-by: lucasliang <nkcs_lykx@hotmail.com> Co-authored-by: lucasliang <nkcs_lykx@hotmail.com>
What is changed and how it works?
Issue Number: Close #18233
What's Changed:
In previous work #16239, we introduced the
busy_on_applystate to indicatewhether a
Peeris pending the application of pending Raft logs upon restart.However, this approach misses a corner case: if the
Peerquickly enters thehibernatestate after restarting, thebusy_on_applystate may not be updatedin a timely manner. This results in the Node failed to update the count of pending
applying regions, continuously reporting an incorrect
is_busy == truestate to PD.Consequently, this can slow down the rolling-restart progress more than expected.
Therefore, this PR addresses this issue by updating the applied state in
on_apply_res.Related changes
pingcap/docs/pingcap/docs-cn:Check List
Tests
Side effects
Release note