-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Coprocessor (copr) tasks may experience long total suspend time, which represents periods when a cop task is not actively being processed. Today, this information is only available in TiKV’s slow-query logs, making it difficult to diagnose copr latency problems from metrics alone.
Suspend time can be caused by multiple factors, including:
- Yatp scheduling delay
- Waiting for the copr concurrency-limiter semaphore
Semaphore waiting, in particular, is a common contributor. The concurrency limiter is designed to prioritize completing a limited number of heavy tasks instead of allowing too many tasks to run concurrently and all make slow progress. When a cop task runs for more than 5 ms, it must acquire a semaphore permit before continuing, and the number of permits defaults to the number of CPU cores.
To improve observability, we should expose:
- Total suspend time as a metric (not just in slow-query logs)
- Semaphore waiting time per task
- Number of cop tasks currently waiting for a semaphore permit
These metrics will make it easier to diagnose whether copr latency is caused by scheduling pressure, semaphore contention, or other factors.