In-memory Engine: optimize auto evict to reduce latency spike by overvenus · Pull Request #18130 · tikv/tikv

overvenus · 2025-01-16T08:29:42Z

What is changed and how it works?

Issue Number: Close #18093

What's Changed:

Tested with read-only workload.

Left: This PR
Right: nightly 2025-01-13

Optimize auto evict to minimize coprocessor request latency by avoiding
periodic eviction and loading of regions with high MVCC amplification.

* Do not evict cached regions when memory has not reached the
  `stop-load-threshold`.
* Replace MVCC amplification-based auto evict strategy with a simple
  moving average of coprocessor requests. This is because
  `RegionWriteCfCopDetail` only reflects MVCC amplification in IME,
  not in RocksDB, making low MVCC amplification a poor indicator for eviction.

Related changes

Need to cherry-pick to the release branch

Check List

Tests

Unit test

Release note

Optimize auto evict to minimize coprocessor request latency by avoiding periodic eviction and loading of regions with high MVCC amplification.

Signed-off-by: Neil Shen <overvenus@gmail.com>

glorv

most LGTM, but I think request count only is a very poor metrics for region eviction.

glorv · 2025-01-17T06:47:07Z

components/in_memory_engine/src/region_stats.rs

+                let is_cop_requests_low = crs.sma_cop_requests_avg
+                    <= avg_cop_requests / SMA_COP_REQUEST_AVG_FILTER_FACTOR;
+
+                let is_cop_requests_reliable =


Does this mean if a region has no reuqest, it can be evicted at all?

is_cop_requests_low means if a region has a few requests, then it will be an eviction candidate.
is_cop_requests_reliable means an eviction candidate can only be evicted if IME has sampled it three times (by default 30 minutes)

Could you update the annotations with the above comments or incorporate them into the annotations? This would make the strategy for choosing eviction candidates clearer and easier to understand.

Comments are added in f263f5e

glorv · 2025-01-17T06:52:41Z

components/in_memory_engine/src/region_stats.rs

+        });
+    }
+
+    crss.sort_by(|a, b| {


Ideally, we should cache regions which to most reqeusts * (avg_kv_per_req_before_cache - avg_kv_per_req_after_cache), this can largely reflect the resource usage save after load. Thus, I think QPS or request_count is an even poorer metrics here.

We can iterative improve the eviction algorithm, after all the master implementation is inaccurate, and this PR can reduce spike as tests show.

glorv · 2025-01-17T06:55:46Z

components/in_memory_engine/src/region_manager.rs

            in_gc: AtomicBool::new(source_meta.in_gc.load(Ordering::Relaxed)),
            is_written: AtomicBool::new(source_meta.is_written.load(Ordering::Relaxed)),
            evict_info: None,
+            average_cop_requests: Arc::new(Mutex::new(Default::default())),


So after region split, all child region's stats will be reset. Better derive the stats data from parent region.

glorv · 2025-01-17T06:57:41Z

components/in_memory_engine/src/region_manager.rs

-#[derive(Debug)]
+/// Estimates the smoothed coprocessor request rate over the last hour using a
+/// simple moving average.
+pub(crate) type CopRequestsSMA = Smoother<f64, COP_REQUEST_SMA_RECORD_COUNT, ONE_HOUR_IN_SECS, 0>;


Suggested change

pub(crate) type CopRequestsSMA = Smoother<f64, COP_REQUEST_SMA_RECORD_COUNT, ONE_HOUR_IN_SECS, 0>;

pub(crate) type CopRequestsSma = Smoother<f64, COP_REQUEST_SMA_RECORD_COUNT, ONE_HOUR_IN_SECS, 0>;

Follow the rust naming convension

LykxSassinator

Overall LGTM

components/in_memory_engine/src/region_stats.rs

Signed-off-by: Neil Shen <overvenus@gmail.com>

ti-chi-bot · 2025-01-22T06:57:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: glorv, LykxSassinator

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [LykxSassinator,glorv]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2025-01-22T06:57:59Z

[LGTM Timeline notifier]

Timeline:

2025-01-22 03:44:05.550732613 +0000 UTC m=+238772.881652012: ☑️ agreed by glorv.
2025-01-22 06:57:58.086839174 +0000 UTC m=+250405.417758573: ☑️ agreed by LykxSassinator.

LykxSassinator · 2025-01-22T06:59:05Z

/retest

ti-chi-bot · 2025-01-22T06:59:22Z

@overvenus: Your PR was out of date, I have automatically updated it for you.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

overvenus · 2025-01-23T02:59:58Z

/test pull-unit-test

ti-chi-bot · 2025-01-23T03:00:01Z

@overvenus: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-unit-test

The following commands are available to trigger optional jobs:

/debug pull-integration-test

Use /test all to run the following jobs that were automatically triggered:

tikv/tikv/pull_unit_test

Details

In response to this:

/test pull-unit-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

overvenus · 2025-01-23T12:43:39Z

/test all

glorv · 2025-01-23T12:55:55Z

@overvenus there is a lint error

Signed-off-by: Neil Shen <overvenus@gmail.com>

…into ime/fix-auto-load-evict-bounce Signed-off-by: Neil Shen <overvenus@gmail.com>

overvenus · 2025-01-23T13:56:27Z

/merge

ti-chi-bot · 2025-01-23T13:56:30Z

@overvenus: We have migrated to builtin LGTM and approve plugins for reviewing.

👉 Please use /approve when you want approve this pull request.

The changes announcement: Proposal: Strengthen configuration change approval.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot · 2025-02-18T06:56:22Z

In response to a cherrypick label: new pull request created to branch release-8.5: #18228.

#18228) close #18093 Optimize auto evict to minimize coprocessor request latency by avoiding periodic eviction and loading of regions with high MVCC amplification. * Do not evict cached regions when memory has not reached the `stop-load-threshold`. * Replace MVCC amplification-based auto evict strategy with a simple moving average of coprocessor requests. This is because `RegionWriteCfCopDetail` only reflects MVCC amplification in IME, not in RocksDB, making low MVCC amplification a poor indicator for eviction. Signed-off-by: Neil Shen <overvenus@gmail.com> Co-authored-by: Neil Shen <overvenus@gmail.com>

overvenus added 6 commits January 16, 2025 15:41

In-memory Engine: restrict InMemoryEngineConfigManager visibility

b1e9938

Signed-off-by: Neil Shen <overvenus@gmail.com>

In-memory Engine: do not evict non-idle high amplified regions

d52cfb7

Signed-off-by: Neil Shen <overvenus@gmail.com>

Evict regions based on SMA coprocessor requests

d1d3219

Signed-off-by: Neil Shen <overvenus@gmail.com>

Consider SMA unreliable period

aa1829d

Signed-off-by: Neil Shen <overvenus@gmail.com>

Only auto evict when reaching stop load

00b4bd7

Signed-off-by: Neil Shen <overvenus@gmail.com>

Clean up

100f7e6

Signed-off-by: Neil Shen <overvenus@gmail.com>

glorv reviewed Jan 17, 2025

View reviewed changes

LykxSassinator reviewed Jan 17, 2025

View reviewed changes

components/in_memory_engine/src/region_stats.rs Outdated Show resolved Hide resolved

components/in_memory_engine/src/region_stats.rs Outdated Show resolved Hide resolved

Address comments

f263f5e

Signed-off-by: Neil Shen <overvenus@gmail.com>

glorv approved these changes Jan 22, 2025

View reviewed changes

ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jan 22, 2025

LykxSassinator approved these changes Jan 22, 2025

View reviewed changes

ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jan 22, 2025

Merge branch 'master' into ime/fix-auto-load-evict-bounce

be6692a

Fix clippy warnings

90ae5dc

Signed-off-by: Neil Shen <overvenus@gmail.com>

Merge remote-tracking branch 'github/ime/fix-auto-load-evict-bounce' …

7cbb1be

…into ime/fix-auto-load-evict-bounce Signed-off-by: Neil Shen <overvenus@gmail.com>

ti-chi-bot bot merged commit 743fbd6 into tikv:master Jan 23, 2025
8 checks passed

ti-chi-bot bot added this to the Pool milestone Jan 23, 2025

overvenus added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Feb 18, 2025

ti-chi-bot mentioned this pull request Feb 18, 2025

In-memory Engine: optimize auto evict to reduce latency spike (#18130) #18228

Merged

1 task

	pub(crate) type CopRequestsSMA = Smoother<f64, COP_REQUEST_SMA_RECORD_COUNT, ONE_HOUR_IN_SECS, 0>;
	pub(crate) type CopRequestsSma = Smoother<f64, COP_REQUEST_SMA_RECORD_COUNT, ONE_HOUR_IN_SECS, 0>;

Conversation

overvenus commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is changed and how it works?

Related changes

Check List

Release note

Uh oh!

glorv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LykxSassinator left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ti-chi-bot bot commented Jan 22, 2025

Uh oh!

ti-chi-bot bot commented Jan 22, 2025

[LGTM Timeline notifier]

Uh oh!

LykxSassinator commented Jan 22, 2025

Uh oh!

ti-chi-bot bot commented Jan 22, 2025

Uh oh!

overvenus commented Jan 23, 2025

Uh oh!

ti-chi-bot bot commented Jan 23, 2025

Uh oh!

overvenus commented Jan 23, 2025

Uh oh!

glorv commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

overvenus commented Jan 23, 2025

Uh oh!

ti-chi-bot bot commented Jan 23, 2025

Uh oh!

Uh oh!

ti-chi-bot commented Feb 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

overvenus commented Jan 16, 2025 •

edited

Loading

glorv commented Jan 23, 2025 •

edited

Loading