Skip to content

copr: improve observability of total suspend time#19178

Merged
ti-chi-bot[bot] merged 2 commits intotikv:masterfrom
hbisheng:copr-suspend
Dec 10, 2025
Merged

copr: improve observability of total suspend time#19178
ti-chi-bot[bot] merged 2 commits intotikv:masterfrom
hbisheng:copr-suspend

Conversation

@hbisheng
Copy link
Member

@hbisheng hbisheng commented Dec 8, 2025

What is changed and how it works?

Issue Number: Close #19179

What's Changed:

Changes:
- Include coprocessor suspend time in the existing copr wait time metric
- Add panels for semaphore waiting time and the number of copr tasks
  waiting on the semaphore.

Background: 
- `total_suspend_time` measures how long a coprocessor task is not
actually being processed. This can be prolonged for several reasons,
including YATP scheduling wait and waiting for the concurrency limiter
semaphore. Previously, `total_suspend_time` was only visible in TiKV’s
slow query logs.
- Semaphore waiting is one of the main contributors to long suspend
times. The concurrency limiter exists to prioritize completing a limited
number of heavy tasks rather than spreading work thin across too many
tasks. In this mechanism, when a copr task runs for more than 5ms, it
must acquire a semaphore permit before continuing. The total number of 
permits defaults to the number of CPU cores.

Grafana example:
20251210-143717

image

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note

None

What’s changed?
- Include coprocessor suspend time in the existing copr wait time metric
- Add metrics for semaphore waiting time and the number of copr tasks
  waiting on the semaphore.

Background:
- `total_suspend_time` measures how long a coprocessor task is not
actually being processed. This can be prolonged for several reasons,
including YATP scheduling wait and waiting for the concurrency limiter
semaphore. Previously, `total_suspend_time` was only visible in TiKV’s
slow query logs.
- Semaphore waiting is one of the main contributors to long suspend
times. The concurrency limiter exists to prioritize completing a limited
number of heavy tasks rather than spreading work thin across too many
tasks, which could cause all of them to make slow progress. In this
mechanism, when a copr task runs for more than 5 ms, it must acquire a
semaphore permit before continuing. The total number of permits defaults
to the number of CPU cores.

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 8, 2025
@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 9, 2025
@glorv glorv added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Dec 9, 2025
Copy link
Contributor

@hhwyt hhwyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 9, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 9, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-12-09 02:40:53.461825601 +0000 UTC m=+922398.275603173: ☑️ agreed by glorv.
  • 2025-12-09 08:15:17.430568392 +0000 UTC m=+942462.244345964: ☑️ agreed by hhwyt.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: glorv, hhwyt, LykxSassinator

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [LykxSassinator,glorv]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@glorv
Copy link
Contributor

glorv commented Dec 10, 2025

/retest

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>
@hbisheng
Copy link
Member Author

/retest

2 similar comments
@hbisheng
Copy link
Member Author

/retest

@hbisheng
Copy link
Member Author

/retest

@ti-chi-bot ti-chi-bot bot merged commit 294b472 into tikv:master Dec 10, 2025
9 checks passed
@ti-chi-bot ti-chi-bot bot added this to the Pool milestone Dec 10, 2025
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this pull request Dec 10, 2025
close tikv#19179

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #19195.
But this PR has conflicts, please resolve them!

ti-chi-bot bot pushed a commit that referenced this pull request Dec 11, 2025
close #19179

Changes:
- Include coprocessor suspend time in the existing copr wait time metric
- Add panels for semaphore waiting time and the number of copr tasks
  waiting on the semaphore.

Background: 
- `total_suspend_time` measures how long a coprocessor task is not
actually being processed. This can be prolonged for several reasons,
including YATP scheduling wait and waiting for the concurrency limiter
semaphore. Previously, `total_suspend_time` was only visible in TiKV’s
slow query logs.
- Semaphore waiting is one of the main contributors to long suspend
times. The concurrency limiter exists to prioritize completing a limited
number of heavy tasks rather than spreading work thin across too many
tasks. In this mechanism, when a copr task runs for more than 5ms, it
must acquire a semaphore permit before continuing. The total number of 
permits defaults to the number of CPU cores.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

Co-authored-by: Bisheng Huang <hbisheng@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve observability for coprocessor suspend time and semaphore contention

5 participants