Skip to content

raftstore: optimize entry cache eviction to prevent append rejections under memory pressure (#17488)#17793

Closed
ti-chi-bot wants to merge 2 commits intotikv:release-8.1from
ti-chi-bot:cherry-pick-17488-to-release-8.1
Closed

raftstore: optimize entry cache eviction to prevent append rejections under memory pressure (#17488)#17793
ti-chi-bot wants to merge 2 commits intotikv:release-8.1from
ti-chi-bot:cherry-pick-17488-to-release-8.1

Conversation

@ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #17488

What is changed and how it works?

Issue Number: Close #17392, #17537

What's Changed:

This PR addresses the issue where, under memory pressure, a follower  
directly rejects append requests instead of first attempting to free  
up memory by evicting the entry cache.  

One potential solution is for the follower, before rejecting append  
requests, to notify other peers on the same node (including leaders  
and followers) to evict their entry caches. This would require a  
global module that coordinates peers. When many followers on the  
same node receive append requests under memory pressure, they would  
notify the module, which would then send messages to a sufficient  
number of peers to trigger entry cache eviction and report the  
results back to the followers. This module could also be triggered  
by a leader on the same node when it's receiving write requests  
under memory pressure. However, this solution is a bit complex and  
has a low ROI.  

A more practical solution is to optimize the conditions for  
triggering entry cache eviction, so it occurs earlier, avoiding  
append rejection.  

Why doesn't the current version trigger cache eviction earlier?  

Currently, entry cache eviction is checked either by the  
`raft_log_gc_tick_interval` (default 3 seconds) or during  
`handle_raft_committed_entries`. The trigger conditions for eviction  
are:  
1. Total memory usage has reached the high water mark.  
2. Entry cache memory usage has reached the  
`evict_cache_on_memory_ratio` (default 0.1, not visible to the  
customer).  

On the other hand, append rejection is triggered when:  
1. Total memory usage has reached the high water mark.  
2. The total memory usage of the entry cache, raft messages, and  
applying entries has reached the `reject_messages_on_memory_ratio`  
(default 0.2).  

The issue is that when the first condition is met, the second  
condition for append rejection may be triggered earlier.  

The solution proposed in this PR is to modify the first condition  
for cache eviction. Instead of waiting for memory usage to fully  
reach the high water mark, eviction will be triggered when memory  
usage is **near** the high water mark.  

This change should not introduce significant performance overhead  
because eviction is only triggered when memory usage is near the  
high water mark, and it helps prevent the more disruptive append  
rejections.  

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note

None

hhwyt added 2 commits November 8, 2024 07:49
… under memory pressure

Signed-off-by: hhwyt <hhwyt1@gmail.com>
…ntries

Non-dangle entries are copied from EntryCache, so their memory usage should also be included in the accounting.

Signed-off-by: hhwyt <hhwyt1@gmail.com>
@ti-chi-bot ti-chi-bot added contribution This PR is from a community contributor. dco-signoff: yes Indicates the PR's author has signed the dco. ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. type/cherry-pick-for-release-8.1 This PR is cherry-picked to release-8.1 from a source PR. labels Nov 8, 2024
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Nov 8, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ninglin-p for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 1, 2024

@ti-chi-bot: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test 8387b66 link true /test pull-unit-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ti-chi-bot ti-chi-bot bot added do-not-merge/cherry-pick-not-approved and removed cherry-pick-approved Cherry pick PR approved by release team. labels Dec 13, 2024
@ti-chi-bot ti-chi-bot bot added cherry-pick-approved Cherry pick PR approved by release team. and removed do-not-merge/cherry-pick-not-approved labels Dec 26, 2024
@ti-chi-bot ti-chi-bot bot removed the cherry-pick-approved Cherry pick PR approved by release team. label Feb 14, 2025
@ti-chi-bot ti-chi-bot closed this Feb 14, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 14, 2025

This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

  1. It must be approved by the approvers firstly.
  2. AFTER it has been approved by approvers, please wait for the cherry-pick merging approval from triage owners.
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contribution This PR is from a community contributor. dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/cherry-pick-not-approved ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. type/cherry-pick-for-release-8.1 This PR is cherry-picked to release-8.1 from a source PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants