-
Notifications
You must be signed in to change notification settings - Fork 4.1k
admission: token bucket in kvStoreTokenGranter should be replenished every 1ms #91509
Description
kvStoreTokenGranter keeps track of tokens corresponding to "L0 bandwidth" (single value computed by ioLoadListener based on flush bandwidth into L0 and compaction bandwidth out of L0), and disk bandwidth (also computed by ioLoadListener and used for elastic work). The ioLoadListener computes tokens for a 15s interval and then doles them out to kvStoreTokenGranter at 250ms intervals. This 250ms interval is problematic in that it can result in high latency and prevents latency isolation (does not prevent throughput isolation).
As a simple example, consider a scenario where each request needs 1 byte token, and there are 1000 tokens added every 250ms. There is a uniform arrival rate of 2000 high priority requests/s, so 500 requests uniformly distributed over 250ms. And a uniform arrival rate of 10,000/s of low priority requests, so 2500 requests uniformly distributed over 250ms. There are more than enough tokens to fully satisfy the high priority tokens (they use only 50% of the tokens), but not enough for the low priority requests. Ignore the fact that the latter will result in indefinite queue growth in the admission control WorkQueue. At a particular 250ms tick, the token bucket will go from 0 tokens to 1000 tokens. Any queued high priority requests will be immediately granted their token, until there are no queued high priority requests. Then since there are always a large number of low priority requests waiting, they will be granted until 0 tokens remain. Now we have a 250ms duration until the next replenishment and 0 tokens, so any high priority requests arriving will have to wait. The maximum wait time is 250ms.
If replenishment was running at 1ms intervals, the maximum wait time would be 1ms, which is probably good enough for latency isolation even for transactions with many statements and BatchRequests (each of which could see that 1ms latency increase). There are two things to keep in mind when making this change:
- We cannot run 1ms ticks for unloaded systems: We have tried before for goroutine scheduler runnable monitoring -- see comment at . A simple solution in our case would be run at the usual 250ms when there are unlimited tokens, and 1ms otherwise.
cockroach/pkg/util/goschedstats/runnable.go
Lines 57 to 68 in 1b4aa43
// We sample the number of runnable goroutines once per samplePeriodShort or // samplePeriodLong (if the system is underloaded). Using samplePeriodLong can // cause sluggish response to a load spike, from the perspective of // RunnableCountCallback implementers (admission control), so it is not ideal. // We support this behavior only because we have observed 5-10% of cpu // utilization on CockroachDB nodes that are doing no other work, even though // 1ms polling (samplePeriodShort) is extremely cheap. The cause may be a poor // interaction with processor idle state // https://github.com/golang/go/issues/30740#issuecomment-471634471. See // #66881. const samplePeriodShort = time.Millisecond const samplePeriodLong = 250 * time.Millisecond - The replenishment logic clamps the tokens at the increment value: see and
cockroach/pkg/util/admission/granter.go
Lines 546 to 549 in 1b4aa43
if sg.availableIOTokens > tokens { // Clamp to tokens. sg.availableIOTokens = tokens } . This is to avoid accumulating unused tokens which would allow for a huge burst later (we do not want that). There are 2 risks with doing this with token replenishment at 1ms intervals.cockroach/pkg/util/admission/granter.go
Lines 558 to 560 in 1b4aa43
if sg.elasticDiskBWTokensAvailable > tokens { sg.elasticDiskBWTokensAvailable = tokens } - The tokens given for a 1ms interval may not be enough to even admit a single request: This is not an issue in our implementation since the granter will hand out tokens as long as there are > 0 tokens, and will let the token count go negative.
- Wasted tokens with bursty workloads: If the traffic has bursty behavior at time scales that are slightly larger than 1ms, say 10ms of no traffic and then 1ms of burst, then the tokens added during the 10ms of no traffic will be wasted, because of this clamping, and we will admit less. We should simply use the previous replenishment interval as the burst multiplier. That is, when adding
ttokens at a 1ms interval, allow the total tokens to go up to250*t.
Jira issue: CRDB-21298
Epic CRDB-25469