concurrency_manager: check update_max_ts against a limit#17917
concurrency_manager: check update_max_ts against a limit#17917ti-chi-bot[bot] merged 25 commits intotikv:masterfrom
Conversation
ae2f39c to
2717f55
Compare
2717f55 to
55b44cc
Compare
src/storage/errors.rs
Outdated
There was a problem hiding this comment.
It could be better to introduce a new error type, but it is inappropriate for cherry-picking to older versions.
Signed-off-by: ekexium <eke@fastmail.com>
55b44cc to
c1ec81f
Compare
| let limit = self.max_ts_limit.load(Ordering::SeqCst); | ||
| if limit > 0 && new_ts > limit { | ||
| if self.panic_on_invalid_max_ts { | ||
| panic!( |
There was a problem hiding this comment.
There would be not enough information for the kv requests if panic here directly, is there a way to print the related request information?
There was a problem hiding this comment.
Passing context information into the concurrency manager is not feasible. So to print related information will require us to panic in the caller side or upper layers. There are 17 callers currently, even if we address them individually, we cannot guarantee consistent panic handling for future code changes. What do you think?
There was a problem hiding this comment.
One approach is to pass a tracker to the update_max_ts method(or just get it by with_tracker like elsewhere). The tracker contains some execution context information, and if a panic occurs, the related tracker information can be printed for debugging.
There was a problem hiding this comment.
I explored the implementation but adding the required boilerplate code wouldn't be worth it for this error message. Since most cases only need the command type and timestamps, I opted to pass a string instead, which also makes the PR more suitable for cherry-picking to older versions.
There was a problem hiding this comment.
After this change, TiKV's availability state introduces an additional influencing factor: fetching timestamps from PD. If there is a network partition between TiKV and PD leader for a period, the update_max_ts could be rejected continuously, introducing a potential stability risk.
This is still a challenging trade-off between stability risk and correctness risk, difficult to decide.
BTW, if the current TiKV is isolated from the PD leader for a period but communication between regions remains normal, what is TiKV's existing behavior?
There was a problem hiding this comment.
We can implement a failsafe: The max_ts check will be suspended after consecutive TSO fetch failures (defined by time period or failure count) until the limit is updated.
Signed-off-by: ekexium <eke@fastmail.com>
src/storage/config.rs
Outdated
| pub txn_status_cache_capacity: usize, | ||
| pub memory_quota: ReadableSize, | ||
| /// Maximum max_ts deviation allowed from PD timestamp (in seconds) | ||
| #[online_config(skip)] |
There was a problem hiding this comment.
This configurations need to be changed online.
There was a problem hiding this comment.
Making the sync interval dynamically adjustable would add unnecessary complexity, while I've already made the other two parameters configurable at runtime.
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
3f16766 to
7f2691f
Compare
| pub fn update_max_ts( | ||
| &self, | ||
| new_ts: TimeStamp, | ||
| source: Option<String>, |
There was a problem hiding this comment.
Why not receiving &str? Passing "...".to_owned() looks potentially introducing additional overhead.
There was a problem hiding this comment.
Ah... I see. Perhaps we can have a enum type like UpdateMaxSource carrying the important fields of that request or operation, and format it into string only when necessary.
There was a problem hiding this comment.
One problem I see is that this approach somewhat breaks a design principle: we impose the knowledge of users or callers on the concurrency manager, by defining the enum. It could bring some performance gains. Do you think it's worth it?
There was a problem hiding this comment.
🤔
How about:
fn update_max_ts(&self, new_ts: TimeStamp, source: impl Display)So the enum type can be defined outside the concurrency manager module(?)
There was a problem hiding this comment.
And then it doesn't need to be a single type (e.g., CDC can have its source representation in cdc module, while txn related requests can have an enum defined in the tikv module)
There was a problem hiding this comment.
Using the generic arguments seems a good idea.
Will format_args be enough so we don't have to define types everywhere?
| } | ||
| let new_ts = new_ts.into_inner(); | ||
| let limit = self.max_ts_limit.load(Ordering::SeqCst); | ||
| if limit > 0 && new_ts > limit { |
There was a problem hiding this comment.
Should we consider the case that PD and TiKV is network-isolated and the limit is not updated in time? How about adding an additional gap according to the duration that has been elapsed since the last successful set_max_ts_limit call? We can first check new_ts > limit as the fast path, and check new_ts > limit + gap then if the former failed, so that the overhead of getting the system time can be avoided in most cases.
There was a problem hiding this comment.
Yeah already added just as your suggestion 😁 I used a different approach that sets a valid lifetime of each limit, which is more strict but also a bit more complex 🤔.
PTAL
Signed-off-by: ekexium <eke@fastmail.com>
5d84906 to
677688f
Compare
Signed-off-by: ekexium <eke@fastmail.com>
|
In response to a cherrypick label: new pull request created to branch |
|
In response to a cherrypick label: new pull request created to branch |
|
In response to a cherrypick label: new pull request created to branch |
|
In response to a cherrypick label: new pull request created to branch |
|
In response to a cherrypick label: new pull request created to branch |
close tikv#17916 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io> remove config; log instead of panic Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
Signed-off-by: ekexium <eke@fastmail.com>
…8047) close #17916 concurrency_manager: add safety boundary for max_ts updates For this cherry-pick: only log error, do not return error or panic. Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is synchronized with PD timestamp periodically. Updates from PD bypass this limit. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ekexium <eke@fastmail.com>
…8051) close #17916 concurrency_manager: add safety boundary for max_ts updates For this cherry-pick: only log error, do not return error or panic. Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is synchronized with PD timestamp periodically. Updates from PD bypass this limit. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…8048) close #17916 concurrency_manager: add safety boundary for max_ts updates For this cherry-pick: only log error, do not return error or panic. Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is synchronized with PD timestamp periodically. Updates from PD bypass this limit. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…8049) close #17916 concurrency_manager: add safety boundary for max_ts updates For this cherry-pick: only log error, do not return error or panic. Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is synchronized with PD timestamp periodically. Updates from PD bypass this limit. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…8050) close #17916 concurrency_manager: add safety boundary for max_ts updates For this cherry-pick: only log error, do not return error or panic. Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is synchronized with PD timestamp periodically. Updates from PD bypass this limit. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
* concurrency_manager: check update_max_ts against a limit (tikv#17917) close tikv#17916 concurrency_manager: add safety boundary for max_ts updates Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is synchronized with PD timestamp periodically. Configure via max_ts_allowance_secs and max_ts_sync_interval_secs. Updates from PD bypass this limit. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com> Signed-off-by: ekexium <eke@fastmail.com> * concurrency_manager: double check via PD TSO before reporting error of invalid max-ts update (tikv#18057) close tikv#18055 concurrency_manager: double check via PD TSO before reporting error of invalid max-ts update Signed-off-by: ekexium <eke@fastmail.com> * concurrency_manager: make max-ts checker more robust (tikv#18080) ref tikv#18055 When validating max-ts updates, do not report error or panic unless confirmed by PD TSO. This reduces both false positive and false negative cases. Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com> Signed-off-by: ekexium <eke@fastmail.com> * config: rename config items for max-ts checker (tikv#18118) ref tikv#17916 config: rename config items for max-ts checker Signed-off-by: ekexium <eke@fastmail.com> * concurrency-manager: do not assert in concurrency manager (tikv#18329) ref tikv#17916 Do not assert in concurrency manager. Signed-off-by: ekexium <eke@fastmail.com> * delete unexpected files from cherry-picking Signed-off-by: ekexium <eke@fastmail.com> --------- Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
What is changed and how it works?
Issue Number: close #17916
What's Changed:
Metric:

Benchmark of
update_max_ts:master
This PR
Read operations typically need hundreds of microseconds, so I suppose this regression has little impact.
Related changes
pingcap/docs/pingcap/docs-cn:Check List
Tests
Side effects
Release note