Skip to content

raftstore: optimize entry cache eviction to prevent append rejections under memory pressure (#17488)#17792

Merged
ti-chi-bot[bot] merged 2 commits intotikv:release-7.5from
ti-chi-bot:cherry-pick-17488-to-release-7.5
Dec 12, 2024
Merged

raftstore: optimize entry cache eviction to prevent append rejections under memory pressure (#17488)#17792
ti-chi-bot[bot] merged 2 commits intotikv:release-7.5from
ti-chi-bot:cherry-pick-17488-to-release-7.5

Conversation

@ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #17488

What is changed and how it works?

Issue Number: Close #17392, #17537

What's Changed:

This PR addresses the issue where, under memory pressure, a follower  
directly rejects append requests instead of first attempting to free  
up memory by evicting the entry cache.  

One potential solution is for the follower, before rejecting append  
requests, to notify other peers on the same node (including leaders  
and followers) to evict their entry caches. This would require a  
global module that coordinates peers. When many followers on the  
same node receive append requests under memory pressure, they would  
notify the module, which would then send messages to a sufficient  
number of peers to trigger entry cache eviction and report the  
results back to the followers. This module could also be triggered  
by a leader on the same node when it's receiving write requests  
under memory pressure. However, this solution is a bit complex and  
has a low ROI.  

A more practical solution is to optimize the conditions for  
triggering entry cache eviction, so it occurs earlier, avoiding  
append rejection.  

Why doesn't the current version trigger cache eviction earlier?  

Currently, entry cache eviction is checked either by the  
`raft_log_gc_tick_interval` (default 3 seconds) or during  
`handle_raft_committed_entries`. The trigger conditions for eviction  
are:  
1. Total memory usage has reached the high water mark.  
2. Entry cache memory usage has reached the  
`evict_cache_on_memory_ratio` (default 0.1, not visible to the  
customer).  

On the other hand, append rejection is triggered when:  
1. Total memory usage has reached the high water mark.  
2. The total memory usage of the entry cache, raft messages, and  
applying entries has reached the `reject_messages_on_memory_ratio`  
(default 0.2).  

The issue is that when the first condition is met, the second  
condition for append rejection may be triggered earlier.  

The solution proposed in this PR is to modify the first condition  
for cache eviction. Instead of waiting for memory usage to fully  
reach the high water mark, eviction will be triggered when memory  
usage is **near** the high water mark.  

This change should not introduce significant performance overhead  
because eviction is only triggered when memory usage is near the  
high water mark, and it helps prevent the more disruptive append  
rejections.  

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note

None

@ti-chi-bot ti-chi-bot added contribution This PR is from a community contributor. dco-signoff: yes Indicates the PR's author has signed the dco. ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. type/cherry-pick-for-release-7.5 This PR is cherry-picked to release-7.5 from a source PR. labels Nov 8, 2024
@hbisheng
Copy link
Member

@hhwyt Looks like there's a merge conflict. Could you fix it when you get a chance?

@hhwyt
Copy link
Contributor

hhwyt commented Dec 10, 2024

@hhwyt Looks like there's a merge conflict. Could you fix it when you get a chance?

Sure, I'll resolve it.

… under memory pressure (tikv#17488)

Signed-off-by: hhwyt <hhwyt1@gmail.com>
@hhwyt hhwyt force-pushed the cherry-pick-17488-to-release-7.5 branch from ad2a0ad to 0c65b42 Compare December 12, 2024 02:47
@hhwyt
Copy link
Contributor

hhwyt commented Dec 12, 2024

/cc @glorv @hbisheng

@ti-chi-bot ti-chi-bot bot requested review from glorv and hbisheng December 12, 2024 02:55
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Dec 12, 2024
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: glorv, hbisheng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 12, 2024
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 12, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-12-12 02:55:13.88388366 +0000 UTC m=+493503.972686206: ☑️ agreed by glorv.
  • 2024-12-12 02:56:17.379386048 +0000 UTC m=+493567.468188592: ☑️ agreed by hbisheng.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 12, 2024

@ti-chi-bot: Your PR was out of date, I have automatically updated it for you.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@hhwyt
Copy link
Contributor

hhwyt commented Dec 12, 2024

/retest

@ti-chi-bot ti-chi-bot bot merged commit 6a319ce into tikv:release-7.5 Dec 12, 2024
@glorv glorv deleted the cherry-pick-17488-to-release-7.5 branch December 13, 2024 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved cherry-pick-approved Cherry pick PR approved by release team. contribution This PR is from a community contributor. dco-signoff: yes Indicates the PR's author has signed the dco. lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. type/cherry-pick-for-release-7.5 This PR is cherry-picked to release-7.5 from a source PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants