Skip to content

Improve observability for coprocessor suspend time and semaphore contention #19179

@hbisheng

Description

@hbisheng

Coprocessor (copr) tasks may experience long total suspend time, which represents periods when a cop task is not actively being processed. Today, this information is only available in TiKV’s slow-query logs, making it difficult to diagnose copr latency problems from metrics alone.

Suspend time can be caused by multiple factors, including:

  • Yatp scheduling delay
  • Waiting for the copr concurrency-limiter semaphore

Semaphore waiting, in particular, is a common contributor. The concurrency limiter is designed to prioritize completing a limited number of heavy tasks instead of allowing too many tasks to run concurrently and all make slow progress. When a cop task runs for more than 5 ms, it must acquire a semaphore permit before continuing, and the number of permits defaults to the number of CPU cores.

To improve observability, we should expose:

  1. Total suspend time as a metric (not just in slow-query logs)
  2. Semaphore waiting time per task
  3. Number of cop tasks currently waiting for a semaphore permit

These metrics will make it easier to diagnose whether copr latency is caused by scheduling pressure, semaphore contention, or other factors.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions