concurrency_manager: check update_max_ts against a limit by ekexium · Pull Request #17917 · tikv/tikv

ekexium · 2024-12-02T13:39:27Z

What is changed and how it works?

Issue Number: close #17916

What's Changed:

concurrency_manager: add safety boundary for max_ts updates

Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is 
synchronized with PD timestamp periodically. Configure via max_ts_allowance_secs
 and max_ts_sync_interval_secs.

Updates from PD bypass this limit.

Metric:

Benchmark of update_max_ts:
master

update_max_ts           time:   [3.7480 ns 3.7481 ns 3.7483 ns]

This PR

update_max_ts           time:   [4.1907 ns 4.1909 ns 4.1911 ns]

Read operations typically need hundreds of microseconds, so I suppose this regression has little impact.

Related changes

PR to update pingcap/docs/pingcap/docs-cn:
Need to cherry-pick to the release branch

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Release note

Introduce a max-ts checker.
Introduce config items: storage.max-ts-drift-allowance, storage.max-ts-sync-interval, and storage.action-on-invalid-max-ts.

ekexium · 2024-12-03T09:04:00Z

src/storage/errors.rs

It could be better to introduce a new error type, but it is inappropriate for cherry-picking to older versions.

Signed-off-by: ekexium <eke@fastmail.com>

components/concurrency_manager/src/lib.rs

cfzjywxk · 2024-12-04T01:57:29Z

components/concurrency_manager/src/lib.rs

+        let limit = self.max_ts_limit.load(Ordering::SeqCst);
+        if limit > 0 && new_ts > limit {
+            if self.panic_on_invalid_max_ts {
+                panic!(


There would be not enough information for the kv requests if panic here directly, is there a way to print the related request information?

Passing context information into the concurrency manager is not feasible. So to print related information will require us to panic in the caller side or upper layers. There are 17 callers currently, even if we address them individually, we cannot guarantee consistent panic handling for future code changes. What do you think?

One approach is to pass a tracker to the update_max_ts method(or just get it by with_tracker like elsewhere). The tracker contains some execution context information, and if a panic occurs, the related tracker information can be printed for debugging.

I explored the implementation but adding the required boilerplate code wouldn't be worth it for this error message. Since most cases only need the command type and timestamps, I opted to pass a string instead, which also makes the PR more suitable for cherry-picking to older versions.

After this change, TiKV's availability state introduces an additional influencing factor: fetching timestamps from PD. If there is a network partition between TiKV and PD leader for a period, the update_max_ts could be rejected continuously, introducing a potential stability risk.

This is still a challenging trade-off between stability risk and correctness risk, difficult to decide.

BTW, if the current TiKV is isolated from the PD leader for a period but communication between regions remains normal, what is TiKV's existing behavior?

We can implement a failsafe: The max_ts check will be suspended after consecutive TSO fetch failures (defined by time period or failure count) until the limit is updated.

components/concurrency_manager/src/lib.rs

Signed-off-by: ekexium <eke@fastmail.com>

cfzjywxk · 2024-12-04T08:40:01Z

src/storage/config.rs

    pub txn_status_cache_capacity: usize,
    pub memory_quota: ReadableSize,
+    /// Maximum max_ts deviation allowed from PD timestamp (in seconds)
+    #[online_config(skip)]


This configurations need to be changed online.

Making the sync interval dynamically adjustable would add unnecessary complexity, while I've already made the other two parameters configurable at runtime.

Signed-off-by: ekexium <eke@fastmail.com>

MyonKeminta · 2024-12-05T07:10:28Z

components/concurrency_manager/src/lib.rs

+    pub fn update_max_ts(
+        &self,
+        new_ts: TimeStamp,
+        source: Option<String>,


Why not receiving &str? Passing "...".to_owned() looks potentially introducing additional overhead.

Ah... I see. Perhaps we can have a enum type like UpdateMaxSource carrying the important fields of that request or operation, and format it into string only when necessary.

One problem I see is that this approach somewhat breaks a design principle: we impose the knowledge of users or callers on the concurrency manager, by defining the enum. It could bring some performance gains. Do you think it's worth it?

🤔
How about:

fn update_max_ts(&self, new_ts: TimeStamp, source: impl Display)

So the enum type can be defined outside the concurrency manager module(?)

And then it doesn't need to be a single type (e.g., CDC can have its source representation in cdc module, while txn related requests can have an enum defined in the tikv module)

Using the generic arguments seems a good idea.
Will format_args be enough so we don't have to define types everywhere?

MyonKeminta · 2024-12-05T07:23:30Z

components/concurrency_manager/src/lib.rs

+        }
+        let new_ts = new_ts.into_inner();
+        let limit = self.max_ts_limit.load(Ordering::SeqCst);
+        if limit > 0 && new_ts > limit {


Should we consider the case that PD and TiKV is network-isolated and the limit is not updated in time? How about adding an additional gap according to the duration that has been elapsed since the last successful set_max_ts_limit call? We can first check new_ts > limit as the fast path, and check new_ts > limit + gap then if the former failed, so that the overhead of getting the system time can be avoided in most cases.

Yeah ~~already added just as your suggestion 😁~~ I used a different approach that sets a valid lifetime of each limit, which is more strict but also a bit more complex 🤔.
PTAL

Signed-off-by: ekexium <eke@fastmail.com>

ti-chi-bot · 2024-12-24T05:26:54Z

In response to a cherrypick label: new pull request created to branch release-8.5: #18047.

ti-chi-bot · 2024-12-24T05:26:54Z

In response to a cherrypick label: new pull request created to branch release-7.5: #18049.

ti-chi-bot · 2024-12-24T05:26:54Z

In response to a cherrypick label: new pull request created to branch release-7.1: #18048.

ti-chi-bot · 2024-12-24T05:26:56Z

In response to a cherrypick label: new pull request created to branch release-6.5: #18050.

ti-chi-bot · 2024-12-24T05:26:56Z

In response to a cherrypick label: new pull request created to branch release-8.1: #18051.

close tikv#17916 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io> remove config; log instead of panic Signed-off-by: ekexium <eke@fastmail.com>

Signed-off-by: ekexium <eke@fastmail.com>

…8047) close #17916 concurrency_manager: add safety boundary for max_ts updates For this cherry-pick: only log error, do not return error or panic. Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is synchronized with PD timestamp periodically. Updates from PD bypass this limit. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ekexium <eke@fastmail.com>

…8051) close #17916 concurrency_manager: add safety boundary for max_ts updates For this cherry-pick: only log error, do not return error or panic. Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is synchronized with PD timestamp periodically. Updates from PD bypass this limit. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…8048) close #17916 concurrency_manager: add safety boundary for max_ts updates For this cherry-pick: only log error, do not return error or panic. Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is synchronized with PD timestamp periodically. Updates from PD bypass this limit. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…8049) close #17916 concurrency_manager: add safety boundary for max_ts updates For this cherry-pick: only log error, do not return error or panic. Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is synchronized with PD timestamp periodically. Updates from PD bypass this limit. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…8050) close #17916 concurrency_manager: add safety boundary for max_ts updates For this cherry-pick: only log error, do not return error or panic. Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is synchronized with PD timestamp periodically. Updates from PD bypass this limit. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

* concurrency_manager: check update_max_ts against a limit (tikv#17917) close tikv#17916 concurrency_manager: add safety boundary for max_ts updates Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is synchronized with PD timestamp periodically. Configure via max_ts_allowance_secs and max_ts_sync_interval_secs. Updates from PD bypass this limit. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com> Signed-off-by: ekexium <eke@fastmail.com> * concurrency_manager: double check via PD TSO before reporting error of invalid max-ts update (tikv#18057) close tikv#18055 concurrency_manager: double check via PD TSO before reporting error of invalid max-ts update Signed-off-by: ekexium <eke@fastmail.com> * concurrency_manager: make max-ts checker more robust (tikv#18080) ref tikv#18055 When validating max-ts updates, do not report error or panic unless confirmed by PD TSO. This reduces both false positive and false negative cases. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com> Signed-off-by: ekexium <eke@fastmail.com> * config: rename config items for max-ts checker (tikv#18118) ref tikv#17916 config: rename config items for max-ts checker Signed-off-by: ekexium <eke@fastmail.com> * concurrency-manager: do not assert in concurrency manager (tikv#18329) ref tikv#17916 Do not assert in concurrency manager. Signed-off-by: ekexium <eke@fastmail.com> * delete unexpected files from cherry-picking Signed-off-by: ekexium <eke@fastmail.com> --------- Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

ekexium force-pushed the feat-max-ts-checker branch 4 times, most recently from ae2f39c to 2717f55 Compare December 3, 2024 08:16

ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 3, 2024

ekexium requested a review from you06 December 3, 2024 08:18

ekexium force-pushed the feat-max-ts-checker branch from 2717f55 to 55b44cc Compare December 3, 2024 08:27

ekexium requested a review from cfzjywxk December 3, 2024 08:28

ekexium commented Dec 3, 2024

View reviewed changes

feat: check update_max_ts against a limit

c1ec81f

Signed-off-by: ekexium <eke@fastmail.com>

ekexium force-pushed the feat-max-ts-checker branch from 55b44cc to c1ec81f Compare December 3, 2024 10:32

cfzjywxk requested a review from MyonKeminta December 4, 2024 01:51

cfzjywxk reviewed Dec 4, 2024

View reviewed changes

components/concurrency_manager/src/lib.rs Outdated Show resolved Hide resolved

refactor: remove update_max_ts_from_pd

80d46bc

Signed-off-by: ekexium <eke@fastmail.com>

cfzjywxk reviewed Dec 4, 2024

View reviewed changes

ekexium added 3 commits December 4, 2024 18:58

pass source when updating max_ts

ab017d3

Signed-off-by: ekexium <eke@fastmail.com>

allow dynamic change of max_ts_allowance and panic_on_invalid_max_ts

0834149

Signed-off-by: ekexium <eke@fastmail.com>

failsafe and benchmark

7f2691f

Signed-off-by: ekexium <eke@fastmail.com>

ekexium force-pushed the feat-max-ts-checker branch from 3f16766 to 7f2691f Compare December 5, 2024 07:22

MyonKeminta reviewed Dec 5, 2024

View reviewed changes

opt: save an atomic load in the common case

677688f

Signed-off-by: ekexium <eke@fastmail.com>

ekexium force-pushed the feat-max-ts-checker branch from 5d84906 to 677688f Compare December 5, 2024 08:18

chore: upgrade pprof

452a828

Signed-off-by: ekexium <eke@fastmail.com>

This was referenced Dec 24, 2024

concurrency_manager: check update_max_ts against a limit (#17917) #18047

Merged

concurrency_manager: check update_max_ts against a limit (#17917) #18048

Merged

ti-chi-bot mentioned this pull request Dec 24, 2024

concurrency_manager: check update_max_ts against a limit (#17917) #18049

Merged

9 tasks

This was referenced Dec 24, 2024

concurrency_manager: check update_max_ts against a limit (#17917) #18050

Merged

concurrency_manager: check update_max_ts against a limit (#17917) #18051

Merged

ekexium added a commit to ti-chi-bot/tikv that referenced this pull request Dec 24, 2024

cherry pick tikv#17917: remove configs and only log errors

b08db20

Signed-off-by: ekexium <eke@fastmail.com>

ekexium added a commit to ti-chi-bot/tikv that referenced this pull request Dec 24, 2024

cherry pick tikv#17917: remove configs and only log errors

2d40b5f

Signed-off-by: ekexium <eke@fastmail.com>

ekexium added a commit to ti-chi-bot/tikv that referenced this pull request Dec 24, 2024

cherry pick tikv#17917: remove configs and only log errors

a7ced27

Signed-off-by: ekexium <eke@fastmail.com>

ekexium added a commit to ti-chi-bot/tikv that referenced this pull request Dec 24, 2024

cherry pick tikv#17917: remove configs and only log errors

c2bcfa2

Signed-off-by: ekexium <eke@fastmail.com>

This was referenced Dec 25, 2024

doc: configs of max-ts checker pingcap/docs-cn#19380

Merged

more than one of tikv panic when injection pd leader timeoffset (ahead 5mins) #18055

Closed

qiancai mentioned this pull request Feb 12, 2025

doc: configs of max-ts checker pingcap/docs#20250

Merged

17 tasks

Conversation

ekexium commented Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is changed and how it works?

Related changes

Check List

Release note

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ekexium Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ekexium Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MyonKeminta Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MyonKeminta Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ekexium Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot commented Dec 24, 2024

Uh oh!

ti-chi-bot commented Dec 24, 2024

Uh oh!

ti-chi-bot commented Dec 24, 2024

Uh oh!

ti-chi-bot commented Dec 24, 2024

Uh oh!

ti-chi-bot commented Dec 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ekexium commented Dec 2, 2024 •

edited

Loading

ekexium Dec 4, 2024 •

edited

Loading

ekexium Dec 5, 2024 •

edited

Loading

MyonKeminta Dec 5, 2024 •

edited

Loading

MyonKeminta Dec 5, 2024 •

edited

Loading

ekexium Dec 5, 2024 •

edited

Loading