server/lock_manager: Support replacing lock by key to reduce RPC overhead of `update_wait_for` when the waiting queue is very long by MyonKeminta · Pull Request #17451 · tikv/tikv

MyonKeminta · 2024-08-28T05:53:46Z

What is changed and how it works?

Issue Number: Close #17394

What's Changed:

server/lock_manager: Support replacing lock by key to reduce RPC overhead of `update_wait_for` when the waiting queue is very long

* Make deadlock detector distinguish keys instead of key hashes
* Add `replace_lock_by_key` to support redetecting all waiters on specific key and specific lock
* Add `ReplaceLocksByKeys` type to the `DeadlockRequest` in the protobuf to support doing the new kind of check remotely

This is expected to solve the issue that frequently updating a key where the lock waiting queue is very long might cause TiKV OOM due to messages cumulating in a channel.

Related changes

Requires: Support replacing lock by key for deadlock detector service pingcap/kvproto#1259

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below) (TODO)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Release note

Fix an issue that when frequently updating a row while many transaction is waiting on the lock, it might sometimes cause TiKV OOM

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

…correct Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

MyonKeminta · 2024-08-28T09:32:56Z

/release

sre-bot · 2024-08-28T10:37:08Z

download tikv at http://fileserver.pingcap.net/download/builds/pingcap/tikv/pr/7de93faa905cdc243e1d00879e99802b34a6dc01/centos7/tikv-server.tar.gz

src/server/lock_manager/deadlock.rs

cfzjywxk · 2024-09-03T03:14:16Z

src/server/lock_manager/mod.rs

+            500,
+        );
+    }
+


Do we have a upgrading test case to verify the behaviours are expected when upgrading tikv-servers?

It was now manually done.

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

MyonKeminta · 2024-09-03T08:56:23Z

/release

ekexium

The PR itself LGTM.
Have we conducted any throughput benchmarks comparing this implementation to the original one? Additionally, micro-benchmarks could help us determine the component's maximum performance capacity.

src/server/lock_manager/deadlock.rs

sre-bot · 2024-09-03T10:09:47Z

download tikv at http://fileserver.pingcap.net/download/builds/pingcap/tikv/pr/230b1de9fa6e9d3cf3fe8d0398b2222783a71aac/centos7/tikv-server.tar.gz

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

crazycs520 · 2024-09-13T03:21:08Z

src/server/lock_manager/deadlock.rs

+                        detect_table.clean_up(txn.into());
+                        None
+                    }
+                    (_, true) => {


This match is a bit weird, how about:

match req.get_tp(){ DeadlockRequestType::CleanUpWaitFor => { if req.has_replace_locks_by_keys() { // replace_locks_by_keys logic }else{ // old logic } } }

ti-chi-bot · 2024-09-27T09:01:47Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

overvenus · 2024-09-27T15:33:24Z

src/server/lock_manager/deadlock.rs

        self.wait_for_map.clear();
+        self.key_reverse_map.clear();


Forget to shrink? An empty hashmap may still occupy memory.

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

MyonKeminta · 2024-11-19T07:20:19Z

Benchmarks

cargo bench -p tests --bench deadlock_detector

Gnuplot not found, using plotters backend
bench_dense_detect_without_cleanup/Config { n: 10, range: 10, ttl: 100000000s }
                        time:   [10.724 us 11.120 us 11.552 us]
                        change: [-17.155% +22.163% +64.753%] (p = 0.36 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
bench_dense_detect_without_cleanup/Config { n: 10, range: 100, ttl: 100000000s }
                        time:   [30.282 us 30.784 us 31.059 us]
                        change: [+13.147% +16.133% +18.699%] (p = 0.00 < 0.05)
                        Performance has regressed.
bench_dense_detect_without_cleanup/Config { n: 10, range: 1000, ttl: 100000000s }
                        time:   [140.32 us 153.44 us 159.31 us]
                        change: [-18.998% +3.2548% +30.293%] (p = 0.80 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
bench_dense_detect_without_cleanup/Config { n: 10, range: 10000, ttl: 100000000s }
                        time:   [8.1210 us 20.725 us 29.401 us]
                        change: [-49.527% +7.0155% +171.54%] (p = 0.88 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
bench_dense_detect_without_cleanup/Config { n: 10, range: 100000, ttl: 100000000s }
                        time:   [8.4774 us 9.0053 us 9.3103 us]
                        change: [+92.856% +108.55% +124.40%] (p = 0.00 < 0.05)
                        Performance has regressed.
bench_dense_detect_without_cleanup/Config { n: 10, range: 1000000, ttl: 100000000s }
                        time:   [6.7621 us 7.0729 us 7.2675 us]
                        change: [+72.810% +79.399% +85.969%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild
bench_dense_detect_without_cleanup/Config { n: 10, range: 10000000, ttl: 100000000s }
                        time:   [10.089 us 10.501 us 11.550 us]
                        change: [+112.08% +342.11% +802.07%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high severe
bench_dense_detect_without_cleanup/Config { n: 10, range: 100000000, ttl: 100000000s }
                        time:   [9.8447 us 10.390 us 10.950 us]
                        change: [+97.254% +235.22% +499.62%] (p = 0.04 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

bench_dense_detect_with_cleanup/Config { n: 10, range: 1000, ttl: 1ms }
                        time:   [9.0705 us 9.1736 us 9.2422 us]
                        change: [+84.786% +87.079% +89.794%] (p = 0.00 < 0.05)
                        Performance has regressed.
bench_dense_detect_with_cleanup/Config { n: 10, range: 1000, ttl: 3ms }
                        time:   [12.324 us 12.379 us 12.443 us]
                        change: [+17.887% +18.956% +19.876%] (p = 0.00 < 0.05)
                        Performance has regressed.
bench_dense_detect_with_cleanup/Config { n: 10, range: 1000, ttl: 5ms }
                        time:   [17.529 us 17.573 us 17.644 us]
                        change: [+10.552% +10.941% +11.331%] (p = 0.00 < 0.05)
                        Performance has regressed.
bench_dense_detect_with_cleanup/Config { n: 10, range: 1000, ttl: 10ms }
                        time:   [28.987 us 29.191 us 29.308 us]
                        change: [+5.3008% +6.2587% +7.1333%] (p = 0.00 < 0.05)
                        Performance has regressed.
bench_dense_detect_with_cleanup/Config { n: 10, range: 1000, ttl: 100ms }
                        time:   [120.22 us 124.16 us 125.74 us]
                        change: [-8.0532% +5.5688% +21.993%] (p = 0.50 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
bench_dense_detect_with_cleanup/Config { n: 10, range: 1000, ttl: 500ms }
                        time:   [140.86 us 155.11 us 161.97 us]
                        change: [-31.140% -13.518% +10.684%] (p = 0.26 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
bench_dense_detect_with_cleanup/Config { n: 10, range: 1000, ttl: 1s }
                        time:   [142.32 us 154.59 us 160.57 us]
                        change: [-32.598% -12.598% +15.456%] (p = 0.35 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
bench_dense_detect_with_cleanup/Config { n: 10, range: 1000, ttl: 3s }
                        time:   [139.10 us 150.11 us 155.37 us]
                        change: [-31.773% -12.046% +16.189%] (p = 0.38 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

There will be obvious regression in the deadlock detector.

…sts (tikv#17500) close tikv#17394 lock_manager: Skip updating lock wait info for non-fair-locking requests This is a simpler and lower-risky fix of the OOM issue tikv#17394 for released branches, as an alternative solution to tikv#17451 . In this way, for acquire_pessimistic_lock requests without enabling fair locking, the behavior of update_wait_for will be a noop. So that if fair locking is globally disabled, the behavior will be equivalent to versions before 7.0. Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

…sts (#17500) (#17870) close #17394 lock_manager: Skip updating lock wait info for non-fair-locking requests This is a simpler and lower-risky fix of the OOM issue #17394 for released branches, as an alternative solution to #17451 . In this way, for acquire_pessimistic_lock requests without enabling fair locking, the behavior of update_wait_for will be a noop. So that if fair locking is globally disabled, the behavior will be equivalent to versions before 7.0. Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

…sts (#17500) (#17869) close #17394 lock_manager: Skip updating lock wait info for non-fair-locking requests This is a simpler and lower-risky fix of the OOM issue #17394 for released branches, as an alternative solution to #17451 . In this way, for acquire_pessimistic_lock requests without enabling fair locking, the behavior of update_wait_for will be a noop. So that if fair locking is globally disabled, the behavior will be equivalent to versions before 7.0. Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com> Co-authored-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

ekexium · 2025-03-20T15:17:42Z

Is this abandoned?

MyonKeminta · 2025-03-24T04:50:38Z

Is this abandoned?

Currently no plan to continue this as its overhead is large and we've find it hard to determine if it's acceptable.

ti-chi-bot · 2025-09-16T15:13:55Z

@MyonKeminta: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test	`9504bd8`	link	true	`/test pull-unit-test`
pull-check-deps	`9504bd8`	link	true	`/test pull-check-deps`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

MyonKeminta added 2 commits August 21, 2024 18:44

Add key to protocol and deadlock detection logic

92970e2

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

Sync the content of the reversed index

9a4d8f4

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 28, 2024

MyonKeminta added 4 commits August 28, 2024 13:55

Add task while noticing that the implementation in detect table is in…

2425f86

…correct Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

Fix implementation of redetecting

e9d63ee

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

Fix wrong detect logic

c29101f

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

Add unit tests to detect table

cf8d0f5

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

MyonKeminta force-pushed the m/deadlock-oom-fix branch from 2f277c0 to cf8d0f5 Compare August 28, 2024 05:57

MyonKeminta marked this pull request as draft August 28, 2024 05:57

ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 28, 2024

MyonKeminta added 2 commits August 28, 2024 17:09

Add test in lock_manager layer

16e257d

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

Enforce remote detection is covered in integration tests

7de93fa

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

cfzjywxk reviewed Sep 3, 2024

View reviewed changes

cfzjywxk mentioned this pull request Sep 3, 2024

Support replacing lock by key for deadlock detector service pingcap/kvproto#1259

Merged

cfzjywxk reviewed Sep 3, 2024

View reviewed changes

update protocol to avoid reconnection when upgrading

230b1de

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

ekexium reviewed Sep 3, 2024

View reviewed changes

src/server/lock_manager/deadlock.rs Outdated Show resolved Hide resolved

update kvproto

aea785f

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

MyonKeminta marked this pull request as ready for review September 3, 2024 12:57

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 3, 2024

MyonKeminta added 2 commits September 4, 2024 16:02

Address comments

2efa418

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

fix lint

7e4c404

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

cfzjywxk requested review from you06 and zyguan September 4, 2024 15:13

crazycs520 reviewed Sep 13, 2024

View reviewed changes

ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 27, 2024

overvenus reviewed Sep 27, 2024

View reviewed changes

ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Nov 14, 2024

MyonKeminta added 2 commits November 18, 2024 18:43

fix panic in the benchmark

dc864d1

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

Add keys to the benchmark

9504bd8

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>

MyonKeminta marked this pull request as draft November 19, 2024 09:55

ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2024

MyonKeminta added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 19, 2024

MyonKeminta assigned MyonKeminta and unassigned MyonKeminta Nov 19, 2024

MyonKeminta marked this pull request as ready for review November 19, 2024 09:56

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2024

MyonKeminta mentioned this pull request Nov 20, 2024

lock_manager: Skip updating lock wait info for non-fair-locking requests #17500

Merged

9 tasks

ti-chi-bot mentioned this pull request Nov 20, 2024

lock_manager: Skip updating lock wait info for non-fair-locking requests (#17500) #17869

Merged

9 tasks

MyonKeminta mentioned this pull request Nov 20, 2024

lock_manager: Skip updating lock wait info for non-fair-locking requests (#17500) #17870

Merged

9 tasks

MyonKeminta closed this Sep 17, 2025

Conversation

MyonKeminta commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is changed and how it works?

Related changes

Check List

Release note

Uh oh!

MyonKeminta commented Aug 28, 2024

Uh oh!

sre-bot commented Aug 28, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cfzjywxk Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MyonKeminta Sep 4, 2024

Choose a reason for hiding this comment

Uh oh!

MyonKeminta commented Sep 3, 2024

Uh oh!

ekexium left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sre-bot commented Sep 3, 2024

Uh oh!

crazycs520 Sep 13, 2024

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot bot commented Sep 27, 2024

Uh oh!

overvenus Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

MyonKeminta commented Nov 19, 2024

Uh oh!

ekexium commented Mar 20, 2025

Uh oh!

MyonKeminta commented Mar 24, 2025

Uh oh!

ti-chi-bot bot commented Sep 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

MyonKeminta commented Aug 28, 2024 •

edited

Loading

cfzjywxk Sep 3, 2024 •

edited

Loading