-
Notifications
You must be signed in to change notification settings - Fork 4.1k
admission: additional observability #82743
Copy link
Copy link
Closed
Labels
A-admission-controlC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-admission-controlAdmission ControlAdmission Control
Description
In order of importance and/or done-ness:
- schedulerlatency: export Go scheduling latency metric #87883;
- kvserver: add IOThreshold to metrics #87424;
- admission: export elastic CPU utilization % as a metric, natively #89814;
- ui: add go scheduling latency graph to Overload dashboard #93217;
- admission: Split stats for WorkQueueMetrics #88076;
- Reduce the cardinality of
Pri * WorkQueuesby:- Skip segmenting the
errored,requestedandadmittedcounts - Only introduce a new histogram for
NormalPrifor relevant work queues. Since we have the histograms for the full work queue across all requests, it can tell us how foreground load behaves and how everything else does. Alternatively add segmentation histogram for only NormalPri, everything higher than NormalPri, everything less than NormalPri. - Only introduce the segmentation for kv and kv-stores queues.
- Skip segmenting the
- Backport to 22.2.
- Reduce the cardinality of
- Metric capturing compaction bandwidth out of L0 (which is used to generate write tokens in admission control), or metric capturing tokens generated directly;
- Introduce metrics for how many L0 tokens are being consumed/produced
- Fix math.MaxInt64 scale for available IO tokens metric when unlimited tokens are present;
- admission: better observability of slot adjustment behavior #92673
- Fix the AC queue histograms to also include requests that don't get queued, otherwise the percentiles are wrong
- Metric capturing IO tokens consumed by bypassed work;
- Make sure the flow control graphs are present in the Overload dashboard (the flow token wait time histograms, admitted rates, blocked replication streams, flow token deductions in bytes)
- Export scheduling latency metric even if elastic cpu limiter is disabled (useful in v22.2)
- Fix units in the Overload dashboard (ui: fix overload dashboard units #110056)
- Metric capturing how long CPU slots are being held for on average, to reason about sudden changes in slot count where total incoming request rate for slots stays unchanged. When requests hold onto slots for longer (due to longer IO wait times, or increased latch/lock wait times), we need to increase slot count to service incoming work. Some times this increase is not fast enough and there's resulting AC queuing like we saw in https://github.com/cockroachlabs/support/issues/1921.
- We could log the {max,min} slot count and {max,min} runnable goroutine count every second, or export metrics for it. In internal experimentation we find ourselves reaching for it;
- admission: CPU metrics for high concurrency scenarios #96495
Jira issue: CRDB-16641
Epic CRDB-25469
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
A-admission-controlC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-admission-controlAdmission ControlAdmission Control