kv: add new kv.lock_table.wait_queue.waiting metric#66343
kv: add new kv.lock_table.wait_queue.waiting metric#66343nvb wants to merge 1 commit intocockroachdb:masterfrom
Conversation
This metric tracks the number of requests waiting in lock wait-queues in the lock table. It would have been helpful to have during a recent customer investigation, because not all requests that wait in lock wait-queues will eventually push and end up in a txn wait-queue.
100db44 to
33518b7
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)
pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 148 at r1 (raw file):
w.lockWaitQueueWaiters.Inc(1) defer w.lockWaitQueueWaiters.Dec(1)
Would this lead to unnecessary worries if the total number was high but they were waiting on a large number of keys?
Would it be better to have a metric for the longest and mean queue length and longest wait duration? Keeping these up-to-date in an incremental manner would be costly, but given we sample metrics at a low frequency we could compute these non-incrementally at the same low frequency. I suppose we'd also need to iterate across all replicas lock tables -- we could also log information about the replicas whose wait durations or queue lengths were higher than some threshold, to complement the aggregated-across-all-replicas information in the metrics.
|
These were good points, thanks for raising them. Adding just this single metric was useful, but also not nearly enough. The lockTable state has been lacking observability for a while. I opened #67350 to improve upon this PR and introduce more comprehensive metric tracking at this level to address the general observability issue. |
67350: kv/concurrency: expose lock table counters, introduce timeseries metrics r=sumeerbhola,andreimateidhartunian a=nvanbenschoten Replaces #66343. This change exposes access to a collection of counters that describe the state of a lockTable, including information on locks and on lock wait-queues. Some of these counters are plumbed up through debug endpoints, providing access to this information in the DB console and through debug.zips. It then introduces following four timeseries metrics: - `kv.concurrency.locks` - `kv.concurrency.locks_with_wait_queues` - `kv.concurrency.lock_wait_queue_waiters` - `kv.concurrency.max_lock_wait_queue_waiters_for_lock` These are computed as aggregations over all of the ranges on a given store using the new lock table counters. They are collected periodically - every 10s. The process of collecting these metrics is made efficient enough that I do not have major concerns about calling this once every 10s per range (from `computePeriodicMetrics`). The vast majority of ranges in large clusters will have empty lock tables, so this will only add about 50ns per replica in `updateReplicationGauges`. Even on a node with 100k ranges, this will only take about 5ms. ``` name time/op LockTableMetrics/locks=0-16 43.0ns ± 4% LockTableMetrics/locks=1-16 76.1ns ± 3% LockTableMetrics/locks=16-16 389ns ± 3% LockTableMetrics/locks=256-16 5.58µs ± 5% LockTableMetrics/locks=4096-16 89.6µs ± 4% name alloc/op LockTableMetrics/locks=0-16 0.00B LockTableMetrics/locks=1-16 0.00B LockTableMetrics/locks=16-16 0.00B LockTableMetrics/locks=256-16 0.00B LockTableMetrics/locks=4096-16 0.00B name allocs/op LockTableMetrics/locks=0-16 0.00 LockTableMetrics/locks=1-16 0.00 LockTableMetrics/locks=16-16 0.00 LockTableMetrics/locks=256-16 0.00 LockTableMetrics/locks=4096-16 0.00 ``` Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
This metric tracks the number of requests waiting in lock wait-queues in the
lock table. It would have been helpful to have during a recent customer
investigation, because not all requests that wait in lock wait-queues will
eventually push and end up in a txn wait-queue.