Skip to content

kv: add new kv.lock_table.wait_queue.waiting metric#66343

Closed
nvb wants to merge 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/waitQueueCountMetric
Closed

kv: add new kv.lock_table.wait_queue.waiting metric#66343
nvb wants to merge 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/waitQueueCountMetric

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Jun 11, 2021

This metric tracks the number of requests waiting in lock wait-queues in the
lock table. It would have been helpful to have during a recent customer
investigation, because not all requests that wait in lock wait-queues will
eventually push and end up in a txn wait-queue.

@nvb nvb requested a review from aayushshah15 June 11, 2021 00:10
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

This metric tracks the number of requests waiting in lock wait-queues in the
lock table. It would have been helpful to have during a recent customer
investigation, because not all requests that wait in lock wait-queues will
eventually push and end up in a txn wait-queue.
@nvb nvb force-pushed the nvanbenschoten/waitQueueCountMetric branch from 100db44 to 33518b7 Compare June 11, 2021 03:16
Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 148 at r1 (raw file):

	w.lockWaitQueueWaiters.Inc(1)
	defer w.lockWaitQueueWaiters.Dec(1)

Would this lead to unnecessary worries if the total number was high but they were waiting on a large number of keys?
Would it be better to have a metric for the longest and mean queue length and longest wait duration? Keeping these up-to-date in an incremental manner would be costly, but given we sample metrics at a low frequency we could compute these non-incrementally at the same low frequency. I suppose we'd also need to iterate across all replicas lock tables -- we could also log information about the replicas whose wait durations or queue lengths were higher than some threshold, to complement the aggregated-across-all-replicas information in the metrics.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Jul 8, 2021

These were good points, thanks for raising them. Adding just this single metric was useful, but also not nearly enough. The lockTable state has been lacking observability for a while. I opened #67350 to improve upon this PR and introduce more comprehensive metric tracking at this level to address the general observability issue.

@nvb nvb closed this Jul 8, 2021
@nvb nvb deleted the nvanbenschoten/waitQueueCountMetric branch July 8, 2021 20:01
craig bot pushed a commit that referenced this pull request Jul 16, 2021
67350: kv/concurrency: expose lock table counters, introduce timeseries metrics r=sumeerbhola,andreimateidhartunian a=nvanbenschoten

Replaces #66343.

This change exposes access to a collection of counters that describe the state of a lockTable, including information on locks and on lock wait-queues. Some of these counters are plumbed up through debug endpoints, providing access to this information in the DB console and through debug.zips.

It then introduces following four timeseries metrics:
- `kv.concurrency.locks`
- `kv.concurrency.locks_with_wait_queues`
- `kv.concurrency.lock_wait_queue_waiters`
- `kv.concurrency.max_lock_wait_queue_waiters_for_lock`

These are computed as aggregations over all of the ranges on a given store using the new lock table counters. They are collected periodically - every 10s.

The process of collecting these metrics is made efficient enough that I do not have major concerns about calling this once every 10s per range (from `computePeriodicMetrics`). The vast majority of ranges in large clusters will have empty lock tables, so this will only add about 50ns per replica in `updateReplicationGauges`. Even on a node with 100k ranges, this will only take about 5ms.

```
name                            time/op
LockTableMetrics/locks=0-16     43.0ns ± 4%
LockTableMetrics/locks=1-16     76.1ns ± 3%
LockTableMetrics/locks=16-16     389ns ± 3%
LockTableMetrics/locks=256-16   5.58µs ± 5%
LockTableMetrics/locks=4096-16  89.6µs ± 4%

name                            alloc/op
LockTableMetrics/locks=0-16      0.00B
LockTableMetrics/locks=1-16      0.00B
LockTableMetrics/locks=16-16     0.00B
LockTableMetrics/locks=256-16    0.00B
LockTableMetrics/locks=4096-16   0.00B

name                            allocs/op
LockTableMetrics/locks=0-16       0.00
LockTableMetrics/locks=1-16       0.00
LockTableMetrics/locks=16-16      0.00
LockTableMetrics/locks=256-16     0.00
LockTableMetrics/locks=4096-16    0.00
```

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants