admission: additional observability

In order of importance and/or done-ness:
- [x] #87883;
- [x] #87424;
- [x] https://github.com/cockroachdb/cockroach/issues/89814;
- [x] https://github.com/cockroachdb/cockroach/pull/93217;
- [x] #88076;
  - [ ] Reduce the cardinality of `Pri * WorkQueues` by:
    - [ ] Skip segmenting the `errored`, `requested` and `admitted` counts
    - [ ] Only introduce a new histogram for `NormalPri` for relevant work queues. Since we have the histograms for the full work queue across all requests, it can tell us how foreground load behaves and how everything else does. Alternatively add segmentation histogram for only NormalPri, everything higher than NormalPri, everything less than NormalPri.
    - [ ] Only introduce the segmentation for kv and kv-stores queues.
  - [ ] Backport to 22.2.
- [x] Metric capturing compaction bandwidth out of L0 (which is used to generate write tokens in admission control), or metric capturing tokens generated directly;
- [x] Introduce metrics for how many L0 tokens are being consumed/produced
- [x] Fix math.MaxInt64 scale for available IO tokens metric when unlimited tokens are present;
- [x] https://github.com/cockroachdb/cockroach/issues/92673
- [x] Fix the AC queue histograms to also include requests that don't get queued, otherwise the percentiles are wrong
- [x] Metric capturing IO tokens consumed by bypassed work;
- [x] Make sure the flow control graphs are present in the Overload dashboard (the flow token wait time histograms, admitted rates, blocked replication streams, flow token deductions in bytes)
- [ ] Export scheduling latency metric even if elastic cpu limiter is disabled (useful in v22.2)
- [ ] Fix units in the Overload dashboard (https://github.com/cockroachdb/cockroach/issues/110056)
- [ ] Metric capturing how long CPU slots are being held for on average, to reason about sudden changes in slot count where total incoming request rate for slots stays unchanged. When requests hold onto slots for longer (due to longer IO wait times, or increased latch/lock wait times), we need to increase slot count to service incoming work. Some times this increase is not fast enough and there's resulting AC queuing like we saw in https://github.com/cockroachlabs/support/issues/1921.
- [ ] We could log the {max,min} slot count and {max,min} runnable goroutine count every second, or export metrics for it. In internal experimentation we find ourselves reaching for it;
- [ ] https://github.com/cockroachdb/cockroach/issues/96495

Jira issue: CRDB-16641

Epic CRDB-25469

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission: additional observability #82743

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

admission: additional observability #82743

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions