admission: CPU slot adjustment and utilization metrics#95007
admission: CPU slot adjustment and utilization metrics#95007craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
irfansharif
left a comment
There was a problem hiding this comment.
If you haven't sanity checked these metrics by hand, next week you could look at the results of the weekly admission-control/multitenant-fairness/read-heavy tests on the prometheus instance the test-eng folks have running. Or actually, that test spits out a prometheus dump in the weekly artifacts (I'm not sure if they get nuked on successful runs).
pkg/util/admission/granter.go
Outdated
| } | ||
| kvSlotAdjusterIncrements = metric.Metadata{ | ||
| Name: "admission.granter.slot_adjuster_increments.kv", | ||
| Help: "Number of increments of the total KV slots ", |
There was a problem hiding this comment.
Extra space at the end of this help text and the one below.
| KVTotalSlots: metric.NewGauge(totalSlots), | ||
| KVUsedSlots: metric.NewGauge(addName(workKindString(KVWork), usedSlots)), | ||
| // TODO(sumeer): remove moderate load slots and soft slots code and | ||
| // metrics. |
There was a problem hiding this comment.
Refer to #88032? Which will be closed once removed.
sumeerbhola
left a comment
There was a problem hiding this comment.
Thanks for the pointer.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/util/admission/grant_coordinator.go line 976 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Refer to #88032? Which will be closed once removed.
Done
pkg/util/admission/granter.go line 760 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Extra space at the end of this help text and the one below.
Done
Our existing metrics are gauges (total and used slots) which don't give us
insight into what is happening at smaller time scales. This creates
uncertainty when we observe admission queueing but the gauge samples show
total slots consistenly greater than used slots. Additionally, if total
slots is steady during queuing, it doesn't tell us whether that was because
of roughly matching increments or decrements, or no increments/decrements.
The following metrics are added:
- admission.granter.slots_exhausted_duration.kv: cumulative duration when
the slots were exhausted. This can give insight into how much exhaustion
was occurring. It is insufficient to tell us whether 0.5sec/sec of
exhaustion is due to a long 500ms of exhaustion and then non-exhaustion
or alternating 1ms of exhaustion and non-exhaustion. But this is an
improvement over what we have.
- admission.granter.slot_adjuster_{increments,decrements}.kv: Counts the
increments and decrements of the total slots.
- admission.granter.cpu_load_{short,long}_period_duration.kv: cumulative
duration of short and long ticks, as indicated by the period in the
CPULoad callback. We don't expect long period ticks when admission
control is active (and we explicitly disable enforcement during long
period ticks), but it helps us eliminate some hypothesis during
incidents (e.g. long period ticks alternating with short period ticks
causing a slow down in how fast we increment slots). Additionally, the
sum of the rate of these two, if significantly < 1, would indicate that
CPULoad frequency is lower than expected, say due to CPU overload.
Fixes cockroachdb#92673
Epic: none
Release note: None
|
bors r=irfansharif |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 774df8d to blathers/backport-release-22.1-95007: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. error creating merge commit from 774df8d to blathers/backport-release-22.2-95007: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
95590: admission: remove soft/moderate load slots r=irfansharif a=irfansharif We originally introduced these notions in admission control (#78519) for additional threads for Pebble compaction compression. We envisioned granting these "squishy" slots to background activities and permit work only under periods of low load. In working through #86638 (as part of \#75066), we observed experimentally that the moderate-slots count was not sensitive enough to scheduling latency, and consequently latency observed by foreground traffic. Elastic CPU tokens, the kind now being used for backups, offer an alternative to soft slots. We've since replaced uses of soft slots with elastic CPU tokens. This PR just removes the now dead-code code around soft/moderate load slots (it's better to minimize the number of mechanisms in the admission package). Fixes #88032. Release note: None --- First commit is from #95007. Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Our existing metrics are gauges (total and used slots) which don't give us insight into what is happening at smaller time scales. This creates uncertainty when we observe admission queueing but the gauge samples show total slots consistenly greater than used slots. Additionally, if total slots is steady during queuing, it doesn't tell us whether that was because of roughly matching increments or decrements, or no increments/decrements.
The following metrics are added:
Fixes #92673
Epic: none
Release note: None