Skip to content

concurrency_manager: check update_max_ts against a limit#17917

Merged
ti-chi-bot[bot] merged 25 commits intotikv:masterfrom
ekexium:feat-max-ts-checker
Dec 23, 2024
Merged

concurrency_manager: check update_max_ts against a limit#17917
ti-chi-bot[bot] merged 25 commits intotikv:masterfrom
ekexium:feat-max-ts-checker

Conversation

@ekexium
Copy link
Contributor

@ekexium ekexium commented Dec 2, 2024

What is changed and how it works?

Issue Number: close #17916

What's Changed:

concurrency_manager: add safety boundary for max_ts updates

Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is 
synchronized with PD timestamp periodically. Configure via max_ts_allowance_secs
 and max_ts_sync_interval_secs.

Updates from PD bypass this limit.

Metric:
image

Benchmark of update_max_ts:
master

update_max_ts           time:   [3.7480 ns 3.7481 ns 3.7483 ns]

This PR

update_max_ts           time:   [4.1907 ns 4.1909 ns 4.1911 ns]

Read operations typically need hundreds of microseconds, so I suppose this regression has little impact.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note

Introduce a max-ts checker.
Introduce config items: storage.max-ts-drift-allowance, storage.max-ts-sync-interval, and storage.action-on-invalid-max-ts.

@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 2, 2024
@ekexium ekexium force-pushed the feat-max-ts-checker branch 4 times, most recently from ae2f39c to 2717f55 Compare December 3, 2024 08:16
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 3, 2024
@ekexium ekexium requested a review from you06 December 3, 2024 08:18
@ekexium ekexium force-pushed the feat-max-ts-checker branch from 2717f55 to 55b44cc Compare December 3, 2024 08:27
@ekexium ekexium requested a review from cfzjywxk December 3, 2024 08:28
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be better to introduce a new error type, but it is inappropriate for cherry-picking to older versions.

Signed-off-by: ekexium <eke@fastmail.com>
@ekexium ekexium force-pushed the feat-max-ts-checker branch from 55b44cc to c1ec81f Compare December 3, 2024 10:32
@cfzjywxk cfzjywxk requested a review from MyonKeminta December 4, 2024 01:51
let limit = self.max_ts_limit.load(Ordering::SeqCst);
if limit > 0 && new_ts > limit {
if self.panic_on_invalid_max_ts {
panic!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There would be not enough information for the kv requests if panic here directly, is there a way to print the related request information?

Copy link
Contributor Author

@ekexium ekexium Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing context information into the concurrency manager is not feasible. So to print related information will require us to panic in the caller side or upper layers. There are 17 callers currently, even if we address them individually, we cannot guarantee consistent panic handling for future code changes. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach is to pass a tracker to the update_max_ts method(or just get it by with_tracker like elsewhere). The tracker contains some execution context information, and if a panic occurs, the related tracker information can be printed for debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explored the implementation but adding the required boilerplate code wouldn't be worth it for this error message. Since most cases only need the command type and timestamps, I opted to pass a string instead, which also makes the PR more suitable for cherry-picking to older versions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, TiKV's availability state introduces an additional influencing factor: fetching timestamps from PD. If there is a network partition between TiKV and PD leader for a period, the update_max_ts could be rejected continuously, introducing a potential stability risk.

This is still a challenging trade-off between stability risk and correctness risk, difficult to decide.

BTW, if the current TiKV is isolated from the PD leader for a period but communication between regions remains normal, what is TiKV's existing behavior?

Copy link
Contributor Author

@ekexium ekexium Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can implement a failsafe: The max_ts check will be suspended after consecutive TSO fetch failures (defined by time period or failure count) until the limit is updated.

Signed-off-by: ekexium <eke@fastmail.com>
pub txn_status_cache_capacity: usize,
pub memory_quota: ReadableSize,
/// Maximum max_ts deviation allowed from PD timestamp (in seconds)
#[online_config(skip)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configurations need to be changed online.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making the sync interval dynamically adjustable would add unnecessary complexity, while I've already made the other two parameters configurable at runtime.

Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
@ekexium ekexium force-pushed the feat-max-ts-checker branch from 3f16766 to 7f2691f Compare December 5, 2024 07:22
pub fn update_max_ts(
&self,
new_ts: TimeStamp,
source: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not receiving &str? Passing "...".to_owned() looks potentially introducing additional overhead.

Copy link
Contributor

@MyonKeminta MyonKeminta Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... I see. Perhaps we can have a enum type like UpdateMaxSource carrying the important fields of that request or operation, and format it into string only when necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem I see is that this approach somewhat breaks a design principle: we impose the knowledge of users or callers on the concurrency manager, by defining the enum. It could bring some performance gains. Do you think it's worth it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔
How about:

fn update_max_ts(&self, new_ts: TimeStamp, source: impl Display)

So the enum type can be defined outside the concurrency manager module(?)

Copy link
Contributor

@MyonKeminta MyonKeminta Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then it doesn't need to be a single type (e.g., CDC can have its source representation in cdc module, while txn related requests can have an enum defined in the tikv module)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the generic arguments seems a good idea.
Will format_args be enough so we don't have to define types everywhere?

}
let new_ts = new_ts.into_inner();
let limit = self.max_ts_limit.load(Ordering::SeqCst);
if limit > 0 && new_ts > limit {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider the case that PD and TiKV is network-isolated and the limit is not updated in time? How about adding an additional gap according to the duration that has been elapsed since the last successful set_max_ts_limit call? We can first check new_ts > limit as the fast path, and check new_ts > limit + gap then if the former failed, so that the overhead of getting the system time can be avoided in most cases.

Copy link
Contributor Author

@ekexium ekexium Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah already added just as your suggestion 😁 I used a different approach that sets a valid lifetime of each limit, which is more strict but also a bit more complex 🤔.
PTAL

Signed-off-by: ekexium <eke@fastmail.com>
@ekexium ekexium force-pushed the feat-max-ts-checker branch from 5d84906 to 677688f Compare December 5, 2024 08:18
Signed-off-by: ekexium <eke@fastmail.com>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #18047.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #18049.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.1: #18048.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.5: #18050.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.1: #18051.

ekexium added a commit to ti-chi-bot/tikv that referenced this pull request Dec 24, 2024
close tikv#17916

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

remove config; log instead of panic

Signed-off-by: ekexium <eke@fastmail.com>
ekexium added a commit to ti-chi-bot/tikv that referenced this pull request Dec 24, 2024
Signed-off-by: ekexium <eke@fastmail.com>
ekexium added a commit to ti-chi-bot/tikv that referenced this pull request Dec 24, 2024
Signed-off-by: ekexium <eke@fastmail.com>
ekexium added a commit to ti-chi-bot/tikv that referenced this pull request Dec 24, 2024
Signed-off-by: ekexium <eke@fastmail.com>
ekexium added a commit to ti-chi-bot/tikv that referenced this pull request Dec 24, 2024
Signed-off-by: ekexium <eke@fastmail.com>
ti-chi-bot bot pushed a commit that referenced this pull request Jan 8, 2025
…8047)

close #17916

concurrency_manager: add safety boundary for max_ts updates

For this cherry-pick: only log error, do not return error or panic.

Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is 
synchronized with PD timestamp periodically.

Updates from PD bypass this limit.

Signed-off-by: ekexium <eke@fastmail.com>

Co-authored-by: ekexium <eke@fastmail.com>
ti-chi-bot bot added a commit that referenced this pull request Jan 8, 2025
…8051)

close #17916

concurrency_manager: add safety boundary for max_ts updates

For this cherry-pick: only log error, do not return error or panic.

Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is 
synchronized with PD timestamp periodically.

Updates from PD bypass this limit.

Signed-off-by: ekexium <eke@fastmail.com>

Co-authored-by: ekexium <eke@fastmail.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot added a commit that referenced this pull request Jan 8, 2025
…8048)

close #17916

concurrency_manager: add safety boundary for max_ts updates

For this cherry-pick: only log error, do not return error or panic.

Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is 
synchronized with PD timestamp periodically.

Updates from PD bypass this limit.

Signed-off-by: ekexium <eke@fastmail.com>

Co-authored-by: ekexium <eke@fastmail.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot added a commit that referenced this pull request Jan 8, 2025
…8049)

close #17916

concurrency_manager: add safety boundary for max_ts updates

For this cherry-pick: only log error, do not return error or panic.

Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is 
synchronized with PD timestamp periodically.

Updates from PD bypass this limit.

Signed-off-by: ekexium <eke@fastmail.com>

Co-authored-by: ekexium <eke@fastmail.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot added a commit that referenced this pull request Jan 9, 2025
…8050)

close #17916

concurrency_manager: add safety boundary for max_ts updates

For this cherry-pick: only log error, do not return error or panic.

Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is 
synchronized with PD timestamp periodically.

Updates from PD bypass this limit.

Signed-off-by: ekexium <eke@fastmail.com>

Co-authored-by: ekexium <eke@fastmail.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
bufferflies pushed a commit to bufferflies/tikv that referenced this pull request Sep 19, 2025
* concurrency_manager: check update_max_ts against a limit (tikv#17917)

close tikv#17916

concurrency_manager: add safety boundary for max_ts updates

Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is
synchronized with PD timestamp periodically. Configure via max_ts_allowance_secs
 and max_ts_sync_interval_secs.

Updates from PD bypass this limit.

Signed-off-by: ekexium <eke@fastmail.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Signed-off-by: ekexium <eke@fastmail.com>

* concurrency_manager: double check via PD TSO before reporting error of invalid max-ts update (tikv#18057)

close tikv#18055

concurrency_manager: double check via PD TSO before reporting error of invalid max-ts update

Signed-off-by: ekexium <eke@fastmail.com>

* concurrency_manager: make max-ts checker more robust (tikv#18080)

ref tikv#18055

When validating max-ts updates, do not report error or panic unless confirmed by PD TSO.
This reduces both false positive and false negative cases.

Signed-off-by: ekexium <eke@fastmail.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Signed-off-by: ekexium <eke@fastmail.com>

* config: rename config items for max-ts checker (tikv#18118)

ref tikv#17916

config: rename config items for max-ts checker

Signed-off-by: ekexium <eke@fastmail.com>

* concurrency-manager: do not assert in concurrency manager (tikv#18329)

ref tikv#17916

Do not assert in concurrency manager.

Signed-off-by: ekexium <eke@fastmail.com>

* delete unexpected files from cherry-picking

Signed-off-by: ekexium <eke@fastmail.com>

---------

Signed-off-by: ekexium <eke@fastmail.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhancement: Add timestamp safety boundary to prevent unreasonable max_ts updates

5 participants