kvserver: quota pool observability

**Is your feature request related to a problem? Please describe.**

We know of several instances where the behavior of the quota pool was poorly understood.
The quota pool exports no metrics, nor is there a way to pull information from it proactively.

Common questions are:

- is the quota pool throttling, and if so, how severely?
- is a given quota pool ignoring followers, and which ones?

**Describe the solution you'd like**

Metrics:

- [ ] Slow acquisitions, i.e. a Gauge that tracks all ongoing acquisitions that trigger https://github.com/cockroachdb/cockroach/blob/d35cf750ba6e11176a14cfb32e78049b77e84209/pkg/kv/kvserver/replica_proposal_quota.go#L112
- [ ] Quota pool wait times, i.e. a latency histogram for which each `Acquire()` call records its latency. This provides a measure of how pervasively the quota pool throttles across the replicas.
- [ ] Quota pool allocations, i.e. a bytes histogram into which each allocation (by any replicaID) is recorded. The rate of the sum effectively provides a throughput for the range, and also has the breakdown of raft proposal sizes.
- the number of times [this code](https://github.com/cockroachdb/cockroach/blob/d35cf750ba6e11176a14cfb32e78049b77e84209/pkg/kv/kvserver/replica_proposal_quota.go#L147-L220) ignores a follower for the purposes of proposal quota enforcement, with one counter metric (or prometheus label) for each reason:
	- [ ] follower inactive (lastUpdateTimes check)
	- [ ] no healthy conn (ConnHealth check)
	- [ ] below base index

Observability via `kvserverpb.RangeInfo`:

- [ ] `ApproximateQuota()` (already there)
- [ ] `Capacity`
- [ ] quota release queue length and base index https://github.com/cockroachdb/cockroach/blob/5a7a7bcfc2cd62698edda2e2977af4d5f34150a1/pkg/kv/kvserver/replica.go#L573-L585
- [ ] the current result of [this logic](https://github.com/cockroachdb/cockroach/blob/d35cf750ba6e11176a14cfb32e78049b77e84209/pkg/kv/kvserver/replica_proposal_quota.go#L147-L220) as applied to each follower, i.e. a slice `[replicaID, ignoreReason]` mirroring the requested metrics above.

Problem ranges:

- [ ] highlight slow proposal quota (>15s) as this is definitely a problem. (Empty quota is not a problem - it is the expected steady state when a follower is slightly slower than the rest)

Logging:

- [ ] over each (say) 10s interval, save (by distinct rangeID+replicaID) up to (say) ten raft statuses (in a store-wide map) for which the quota pool did not consider a follower for release, i.e. every ~10 seconds print a message (if there is anything to print, which in the common case there is not)

> quota pool not enforced for:
r100/5: follower_inactive base=150 status=[insert raft status]
r100/7: base_index base=160 status=[...]
r200/2: no_healthy_conn [...]

**Describe alternatives you've considered**

**Additional context**

The above suggestion was written up quickly, and shouldn't be considered final or flawless.

Jira issue: CRDB-15931

	// The base index is the index up to (including) which quota was already
	// released. That is, the first element in quotaReleaseQueue below is
	// released as the base index moves up by one, etc.
	proposalQuotaBaseIndex uint64

	// Once the leader observes a proposal come 'out of Raft', we add the size
	// of the associated command to a queue of quotas we have yet to release
	// back to the quota pool. At that point ownership of the quota is
	// transferred from r.mu.proposals to this queue.
	// We'll release the respective quota once all replicas have persisted the
	// corresponding entry into their logs (or once we give up waiting on some
	// replica because it looks like it's dead).
	quotaReleaseQueue []*quotapool.IntAlloc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: quota pool observability #79756

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvserver: quota pool observability #79756

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions