admission: change sgc.allocateIOOnTick interval by bananabrick · Pull Request #103736 · cockroachdb/cockroach

bananabrick · 2023-05-22T19:57:20Z

We currently allocate tokens for the kvStoreTokenGranter
every 250ms. This prs moves this allocation interval to
1ms. Since unloaded systems use too much CPU with a 1ms
interval, we implement a mechanism to switch back and forth
between a 250ms interval, when the system is unloaded, and
a 1ms interval when the system is unloaded. We maintain a
given interval for this allocation, over every adjustment
interval. That is, we make the decision to switch to a
different allocation interval at the end of the adjustment
interval.

Fixes: #91509
Epic: none
Release note: None

cockroach-teamcity · 2023-05-22T19:57:28Z

This change is

bananabrick · 2023-05-22T20:22:47Z

Needs some cleanup before review.

bananabrick

Looking for feedback on the structure, and some of the logic I used here. Mostly implements all of the suggestions in: #91509. I owe some ioLoadListener tests.

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/util/admission/granter.go line 525 at r1 (raw file):

	maxTokensAllowed := ioTokens * burstMultiplier
	if prevAvailableIOTokens > maxTokensAllowed {
		// Only allow availableIOTokens to increase up to maxTokensAllowed when the

Not sure if we need this. I was concerned about times where a large number of tokens are available, say 100000, with a 250 ms token allocation interval, and then we move to a 1ms token allocation interval, and call setAvailableTokens with the ioTokens == 300. In such a case, even though we technically allow availableTokens to go up to 250*300=75000, I don't think we should.

irfansharif

Haven't looked at the test changes yet, will do after our call. Left some comments and some nits that you should feel free to ignore. The structure overall looks like right to me.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick)

pkg/util/admission/grant_coordinator.go line 81 at r2 (raw file):

	sgc.closeCh = make(chan struct{})
	metrics := sgc.pebbleMetricsProvider.GetPebbleMetrics()
	// System starts of unloaded/with unlimited tokens, so use the unloaded values.

Nit: "starts off"

pkg/util/admission/grant_coordinator.go line 119 at r2 (raw file):

				if ticks%currIntervalData.ticksInAdjustmentInterval == 0 {
					if timeElapsed := time.Since(currTime); timeElapsed - (15 * time.Second) > time.Millisecond {
						// TODO(bananabrick): Get rid of this.

Leaving this comment thread here to remind us to get rid of it + the currTime state.

pkg/util/admission/grant_coordinator.go line 132 at r2 (raw file):
This could use a code comment. BTW, looking through this code block, if we're in the systemUnloaded regime, would we only switch to the !systemUnloaded regime after 15s? Which in turn dictates the switching between unloadedIntervalData and loadedIntervalData? We want to switch more frequently than that, like your commit message says:

That is, we make the decision to switch to a different allocation interval at the end of the adjustment interval.

There's also an oddity here with multiple stores, some that might be unloaded and others loaded. We're actually just the tick regime corresponding to the very last store in this iteration. Perhaps what we want is to pick the unloaded tick regime iff all of the stores have unlimited tokens. Worth correcting and adding a comment.

pkg/util/admission/grant_coordinator.go line 146 at r2 (raw file):

					}
					ticker.Reset(currIntervalData.tickDuration)
					ticks = 0

Mind explaining to me why we're setting this to 0? Is it so that we're not slow by up to 250ms in passing the ticks%currIntervalData.ticksInAdjustmentInterval == 0 check in a subsequent loop iteration? If so, I don't think that's important -- 15s +/- 250ms is as coarse a granularity for token changes as 15s +/- 0ms.

pkg/util/admission/io_load_listener.go line 222 at r2 (raw file):

//     explicitly don't want a huge burst every 15s.
//   - For loaded systems, a replenishment rate equal to
//     totalNumByteTokens/15000(once per ma), with a burst capped at

Add a bullet point here that explains why the burst is capped to the same amount independent of the replenishment rate.

Nit: s/ma/ms.
Nit: there's a linebreak in the comment block that now makes it two comment blocks.

pkg/util/admission/io_load_listener.go line 259 at r2 (raw file):

// compactions).
//
// At most a 250ms interval to hand out the computed tokens is due to the

Nit: Something's off with the start of this sentence.

pkg/util/admission/io_load_listener.go line 272 at r2 (raw file):

//
// We use a 1ms interval for handing out tokens, to avoid upto 250ms wait times
// for high priority requests. See https://github.com/cockroachdb/cockroach/issues/91509

Nit: Here and in the commit message, it's friendlier to future readers to just inline the commentary instead of linking to soon-to-{closed,stale} github issue.

pkg/util/admission/io_load_listener.go line 279 at r2 (raw file):

// Metadata associated with a single tick in an adjustment interval.
type tickIntervalData struct {
	ticksInAdjustmentInterval int64

Nit: This parameter is purely a function of the adjustmentInterval (a constant 15s) and the tickDuration (which varies between 250ms and 1ms), so it can be removed or made a function on this tickIntervalData type.

Very optional nit: Feel free to also use the time.Duration type for adjustmentInterval (i.e. 15 * time.Second), which maybe is more readable than the opaque 15 it is today.

pkg/util/admission/io_load_listener.go line 338 at r2 (raw file):

}

// allocateTokensTick gives out 1/currTicksInAdjustmentInterval of the

Nit: currTicksInAdjustmentInterval is not a variable declared below. You mean currTickIntervalData.ticksInAdjustmentInterval, right? Ditto for currIoTokenTickDuration.

irfansharif

Looked at the tests.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick)

pkg/util/admission/granter.go line 525 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Not sure if we need this. I was concerned about times where a large number of tokens are available, say 100000, with a 250 ms token allocation interval, and then we move to a 1ms token allocation interval, and call setAvailableTokens with the ioTokens == 300. In such a case, even though we technically allow availableTokens to go up to 250*300=75000, I don't think we should.

(Ignoring elastic disk bandwidth tokens, writing this for myself.)

1. So when starting from availableIOTokens=100000 and going through setAvailableTokens(300, 1ms), we end up with availableIOTokens=300 instead of availableIOTokens=75000. Previously, where all we had was the 250ms period, is it correct to say our call would've been setAvailableTokens(75000, 250ms). So this behavior is changing now - at this point we're reducing the write burst.

2. a. In the next tick interval, if it's 1ms, and availableIOTokens=300, setAvailableTokens(299, 1ms) will result in availableIOTokens=74750. I'm using 299 instead of 230 because of the non-inclusive maxTokensAvailable, but that can be changed.

2. b. In the next tick interval, if it's 250ms, and availableIOTokens=300, setAvailableTokens(75000, 250ms) will result in availableIOTokens=75000.

So I'm not sure what the logic in step 1. is achieving.

pkg/util/admission/granter.go line 535 at r2 (raw file):

		sg.coordMu.availableIOTokens = ioTokens
	} else if sg.coordMu.availableIOTokens > maxTokensAllowed {
		sg.coordMu.availableIOTokens = ioTokens * burstMultiplier

s/ioTokens * burstMultiplier/maxTokensAllowed so the clamping logic is easier to read at a glance. Ditto below for s/elasticDiskBandwidthTokens * burstMultiplier/maxDiskBWTokensAllowed.

Nit: make maxTokensAllowed inclusive, so change the conditional to sg.coordMu.availableIOTokens >= maxTokensAllowed.

pkg/util/admission/granter_test.go line 247 at r2 (raw file):

			d.ScanArgs(t, "io-tokens", &ioTokens)
			d.ScanArgs(t, "elastic-disk-bw-tokens", &elasticTokens)
			if d.HasArg("tick-rate") {

Nit: it's more accurate to call this "tick-interval" given we're soliciting a duration.

bananabrick

Addressed the high level comments. I'll address the nits tomorrow morning. I just realized that I didn't exercise the scenario described in the issue in a test. I'll write that next, and write tests for the ioLoadListener once I get some responses to my questions here.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)

pkg/util/admission/grant_coordinator.go line 132 at r2 (raw file):

BTW, looking through this code block, if we're in the systemUnloaded regime, would we only switch to the !systemUnloaded regime after 15s?

We want to switch more frequently than that, like your commit message says:

On every tick we check, if ticks%currIntervalData.ticksInAdjustmentInterval == 0 which means that we're at the end of an adjustment interval. And set the new value of the systemUnloaded(now systemLoaded) variable.

Are you referring the first iteration of the loop where the system is unloaded and it will take 15 seconds to switch to a new regime.

Perhaps what we want is to pick the unloaded tick regime iff all of the stores have unlimited tokens.

That's a good point. Done.

pkg/util/admission/grant_coordinator.go line 146 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Mind explaining to me why we're setting this to 0? Is it so that we're not slow by up to 250ms in passing the ticks%currIntervalData.ticksInAdjustmentInterval == 0 check in a subsequent loop iteration? If so, I don't think that's important -- 15s +/- 250ms is as coarse a granularity for token changes as 15s +/- 0ms.

I think I set it to 0 because I'm using the currIntervalData.ticksInCurrentInterval to make sure that we maintain an adjustment interval of 15 seconds.

If we tick at 250ms which has 60 ticks in the adjustment interval(15s), then at the end of first adjustment interval we're at 60 ticks. Then let's say we switch to a 1ms tick rate, which has 15000 ticks in the adjustment interval(15s). But since ticks is already 60, we'll hit 15000 ticks in 15000-60 ms. So, yea the adjustment interval will be off by 60ms, it'll be 15s - 60 ms.

I guess it's small, but we allocate tokens for an adjustment interval based on the ticks in adjustment interval. So I think we'll fail to allocate totalNumByteTokens * 60 / 15000 tokens. Not completely sure if this is a problem. I think I'll gain more context on this once I understand the totalNumBytesTokens calculation in the ioLoadListener.

pkg/util/admission/granter.go line 525 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

(Ignoring elastic disk bandwidth tokens, writing this for myself.)

1. So when starting from availableIOTokens=100000 and going through setAvailableTokens(300, 1ms), we end up with availableIOTokens=300 instead of availableIOTokens=75000. Previously, where all we had was the 250ms period, is it correct to say our call would've been setAvailableTokens(75000, 250ms). So this behavior is changing now - at this point we're reducing the write burst.

2. a. In the next tick interval, if it's 1ms, and availableIOTokens=300, setAvailableTokens(299, 1ms) will result in availableIOTokens=74750. I'm using 299 instead of 230 because of the non-inclusive maxTokensAvailable, but that can be changed.

2. b. In the next tick interval, if it's 250ms, and availableIOTokens=300, setAvailableTokens(75000, 250ms) will result in availableIOTokens=75000.

So I'm not sure what the logic in step 1. is achieving.

1. So when starting from availableIOTokens=100000 and going through setAvailableTokens(300, 1ms), we end up with availableIOTokens=300 instead of availableIOTokens=75000. Previously, where all we had was the 250ms period, is it correct to say our call would've been setAvailableTokens(75000, 250ms). So this behavior is changing now - at this point we're reducing the write burst.

Based on my understanding of bursts, I believe we're reducing the burst which can happen during that interval. But we'll still allocate 75000 tokens over the next 250ms.

2. a. In the next tick interval, if it's 1ms, and availableIOTokens=300, setAvailableTokens(299, 1ms) will result in availableIOTokens=74750. I'm using 299 instead of 230 because of the non-inclusive maxTokensAvailable, but that can be changed.

2. b. In the next tick interval, if it's 250ms, and availableIOTokens=300, setAvailableTokens(75000, 250ms) will result in availableIOTokens=75000.

I'm currently only allowing the tick interval to change at the end of every adjustment interval. So, we'd only go from a 1ms tick to a 250ms tick at the end of an adjustment interval, which is ~15000 ticks away at this point assuming that we switched from 250 ms to 1ms during point 1.

I think I agree that we should cap bursts to 250 * ioTokens on call to setAvailableTokens with some ioTokens. My concern was that we'd be granting too many additional tokens, but 250ms worth of additional tokens isn't that much. I'm going to get rid of the special casing I have here. What do you think?

pkg/util/admission/io_load_listener.go line 222 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Add a bullet point here that explains why the burst is capped to the same amount independent of the replenishment rate.

Nit: s/ma/ms.
Nit: there's a linebreak in the comment block that now makes it two comment blocks.

Done.

sumeerbhola

Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick and @irfansharif)

pkg/util/admission/grant_coordinator.go line 132 at r2 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

BTW, looking through this code block, if we're in the systemUnloaded regime, would we only switch to the !systemUnloaded regime after 15s?

We want to switch more frequently than that, like your commit message says:

On every tick we check, if ticks%currIntervalData.ticksInAdjustmentInterval == 0 which means that we're at the end of an adjustment interval. And set the new value of the systemUnloaded(now systemLoaded) variable.

Are you referring the first iteration of the loop where the system is unloaded and it will take 15 seconds to switch to a new regime.

Perhaps what we want is to pick the unloaded tick regime iff all of the stores have unlimited tokens.

That's a good point. Done.

I was imagining something simple where pebbleMetricsTick would return a bool (slowTicks) if the store is ok with slow ticks. We would and the bool across all stores. Then allocateIOTokensTick would accept the result bool. So this ticker loop only needs a bit of tweaking, like the following:

	go func() {
		var ticks int64
		ticker := time.NewTicker(slowIOTokenTickDuration)
		done := false
		tickDuration := slowIOTokenTickDuration
		ticksInAdjustmentInterval := slowTicksInAdjustmentInterval
		slowTicks := true
		for !done {
			select {
			case <-ticker.C:
				ticks++
				if ticks%ticksInAdjustmentInterval == 0 {
					prevSlowTicks = slowTicks
					slowTicks = true
					metrics := sgc.pebbleMetricsProvider.GetPebbleMetrics()
					if len(metrics) != sgc.numStores {
						log.Warningf(ctx,
							"expected %d store metrics and found %d metrics", sgc.numStores, len(metrics))
					}
					for _, m := range metrics {
						if unsafeGc, ok := sgc.gcMap.Load(int64(m.StoreID)); ok {
							gc := (*GrantCoordinator)(unsafeGc)
							slowTicks = slowTicks && gc.pebbleMetricsTick(ctx, m)
							iotc.UpdateIOThreshold(m.StoreID, gc.ioLoadListener.ioThreshold)
						} else {
							log.Warningf(ctx,
								"seeing metrics for unknown storeID %d", m.StoreID)
						}
					}
					if slowTicks != prevSlowTicks {
					   // TODO: update tickDuration and ticksInAdjustmentInterval and call Ticker.Reset.
					}
				}
				sgc.gcMap.Range(func(_ int64, unsafeGc unsafe.Pointer) bool {
					gc := (*GrantCoordinator)(unsafeGc)
					gc.allocateIOTokensTick(slowTicks)
					// true indicates that iteration should continue after the
					// current entry has been processed.
					return true
				})
			case <-sgc.closeCh:
				done = true
			}
		}
		ticker.Stop()
	}()

pkg/util/admission/granter.go line 503 at r3 (raw file):

	ioTokens int64,
	elasticDiskBandwidthTokens int64,
	currTickIntervalData tickIntervalData,

I think all this interval logic should be lifted out of kvStoreTokenGranter. Earlier there was one token value which was both the replenishment amount and the max burst size. Now there will be two values, the amount to add, and the bucket capacity. So this becomes

func (sg *kvStoreTokenGranter) setAvailableTokens(
	ioTokens int64, ioTokensCapacity int64, elasticDiskBandwidthTokens int64, elasticDiskBandwidthTokensCapacity int64,
) (ioTokensUsed int64) {
	sg.coord.mu.Lock()
	defer sg.coord.mu.Unlock()
	ioTokensUsed = sg.startingIOTokens - sg.coordMu.availableIOTokens
	// It is possible for availableIOTokens to be negative because of
	// tookWithoutPermission or because tryGet will satisfy requests until
	// availableIOTokens become <= 0. We want to remember this previous
	// over-allocation.
	sg.subtractTokensLocked(-ioTokens, true)
	if sg.coordMu.availableIOTokens > ioTokensCapacity {
		// Clamp to tokens.
		sg.coordMu.availableIOTokens = ioTokensCapacity
	}
	sg.startingIOTokens = sg.coordMu.availableIOTokens

	sg.coordMu.elasticDiskBWTokensAvailable += elasticDiskBandwidthTokens
	if sg.coordMu.elasticDiskBWTokensAvailable > elasticDiskBandwidthTokensCapacity {
		sg.coordMu.elasticDiskBWTokensAvailable = elasticDiskBandwidthTokensCapacity
	}

	return ioTokensUsed
}

pkg/util/admission/granter.go line 515 at r3 (raw file):

		sg.coordMu.availableIOTokens = ioTokens
	}
	sg.startingIOTokens = ioTokens

this looks like a bug in the old code. Glad you noticed this.

pkg/util/admission/io_load_listener.go line 339 at r3 (raw file):

// allocateTokensTick gives out 1/currTickIntervalData.ticksInAdjustmentInterval
// of the various tokens every currTickIntervalData.tickDuration.
func (io *ioLoadListener) allocateTokensTick(currTickIntervalData tickIntervalData) {

This is roughly what I had in mind. The first allocation after the adjustment calculates the burst (the burst stays the same for the whole adjustment interval), and it gives out the burst size as the allocation. If we are running at 1ms ticks and care about the error introduced by giving out (61/60)*100 = 101.66% of what we calculated, then calculate the burst capacity as 1/61 instead of 1/60.

func (io *ioLoadListener) allocateTokensTick(slowTicks bool) {
	ticksInAdjustmentInterval := slowTicksInAdjustmentInterval
	if !slowTicks {
		ticksInAdjustmentInterval := fastTicksInAdjustmentInterval
	}
	firstCallSinceAdjustment := io.firstCallSinceAdjustment
	if io.firstCallSinceAdjustment {
		// Calculate burst size, once.
		io.firstCallSinceAdjustment = false
		// Call allocateFunc(total, 0, slowTicksInAdjustmentInterval) to get the
		// burst size. If want to be very precise for !slowTicks, call it with
		// slowTicksInAdjustmentInterval+1, to divide by 61, since we are going to
		// give that burst up front.
	}
	if firstCallSinceAdjustment {
		// Use the burst size as the values of toAllocate*
	} else {
		// Call allocateFunc(total, allocated, ticksInAdjustmentInterval) to
		// compute toAllocate*
	}
	// Use toAllocate* to adjust internal state. Then call setAvailableTokens()
      ...

bananabrick

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)

pkg/util/admission/grant_coordinator.go line 132 at r2 (raw file):

Previously, sumeerbhola wrote…

I was imagining something simple where pebbleMetricsTick would return a bool (slowTicks) if the store is ok with slow ticks. We would and the bool across all stores. Then allocateIOTokensTick would accept the result bool. So this ticker loop only needs a bit of tweaking, like the following:

	go func() {
		var ticks int64
		ticker := time.NewTicker(slowIOTokenTickDuration)
		done := false
		tickDuration := slowIOTokenTickDuration
		ticksInAdjustmentInterval := slowTicksInAdjustmentInterval
		slowTicks := true
		for !done {
			select {
			case <-ticker.C:
				ticks++
				if ticks%ticksInAdjustmentInterval == 0 {
					prevSlowTicks = slowTicks
					slowTicks = true
					metrics := sgc.pebbleMetricsProvider.GetPebbleMetrics()
					if len(metrics) != sgc.numStores {
						log.Warningf(ctx,
							"expected %d store metrics and found %d metrics", sgc.numStores, len(metrics))
					}
					for _, m := range metrics {
						if unsafeGc, ok := sgc.gcMap.Load(int64(m.StoreID)); ok {
							gc := (*GrantCoordinator)(unsafeGc)
							slowTicks = slowTicks && gc.pebbleMetricsTick(ctx, m)
							iotc.UpdateIOThreshold(m.StoreID, gc.ioLoadListener.ioThreshold)
						} else {
							log.Warningf(ctx,
								"seeing metrics for unknown storeID %d", m.StoreID)
						}
					}
					if slowTicks != prevSlowTicks {
					   // TODO: update tickDuration and ticksInAdjustmentInterval and call Ticker.Reset.
					}
				}
				sgc.gcMap.Range(func(_ int64, unsafeGc unsafe.Pointer) bool {
					gc := (*GrantCoordinator)(unsafeGc)
					gc.allocateIOTokensTick(slowTicks)
					// true indicates that iteration should continue after the
					// current entry has been processed.
					return true
				})
			case <-sgc.closeCh:
				done = true
			}
		}
		ticker.Stop()
	}()

The only differences from what's suggested here and what I had were:

I was setting ticks = 0. I think we should keep the ticks = 0, to ensure that there are actually 15000 ticks in a 15s adjustment interval if the tick rate is 1 tick/ms. The explanation for why is given below in another comment.
I was directly accessing totalNumByteTokens to determine load rather than using a return value from gc.pebbleMetricsTick. I've made that change.
I'm not doing the check to see if slowTicks != prevTicks. I don't think there's any harm in always resetting the ticker once every 15 seconds.

pkg/util/admission/granter.go line 535 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/ioTokens * burstMultiplier/maxTokensAllowed so the clamping logic is easier to read at a glance. Ditto below for s/elasticDiskBandwidthTokens * burstMultiplier/maxDiskBWTokensAllowed.

Nit: make maxTokensAllowed inclusive, so change the conditional to sg.coordMu.availableIOTokens >= maxTokensAllowed.

Got rid of that logic.

pkg/util/admission/granter.go line 503 at r3 (raw file):

Previously, sumeerbhola wrote…

I think all this interval logic should be lifted out of kvStoreTokenGranter. Earlier there was one token value which was both the replenishment amount and the max burst size. Now there will be two values, the amount to add, and the bucket capacity. So this becomes

func (sg *kvStoreTokenGranter) setAvailableTokens(
	ioTokens int64, ioTokensCapacity int64, elasticDiskBandwidthTokens int64, elasticDiskBandwidthTokensCapacity int64,
) (ioTokensUsed int64) {
	sg.coord.mu.Lock()
	defer sg.coord.mu.Unlock()
	ioTokensUsed = sg.startingIOTokens - sg.coordMu.availableIOTokens
	// It is possible for availableIOTokens to be negative because of
	// tookWithoutPermission or because tryGet will satisfy requests until
	// availableIOTokens become <= 0. We want to remember this previous
	// over-allocation.
	sg.subtractTokensLocked(-ioTokens, true)
	if sg.coordMu.availableIOTokens > ioTokensCapacity {
		// Clamp to tokens.
		sg.coordMu.availableIOTokens = ioTokensCapacity
	}
	sg.startingIOTokens = sg.coordMu.availableIOTokens

	sg.coordMu.elasticDiskBWTokensAvailable += elasticDiskBandwidthTokens
	if sg.coordMu.elasticDiskBWTokensAvailable > elasticDiskBandwidthTokensCapacity {
		sg.coordMu.elasticDiskBWTokensAvailable = elasticDiskBandwidthTokensCapacity
	}

	return ioTokensUsed
}

Lifted it out.

pkg/util/admission/granter.go line 515 at r3 (raw file):

Previously, sumeerbhola wrote…

this looks like a bug in the old code. Glad you noticed this.

Done.

pkg/util/admission/io_load_listener.go line 279 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Nit: This parameter is purely a function of the adjustmentInterval (a constant 15s) and the tickDuration (which varies between 250ms and 1ms), so it can be removed or made a function on this tickIntervalData type.

Very optional nit: Feel free to also use the time.Duration type for adjustmentInterval (i.e. 15 * time.Second), which maybe is more readable than the opaque 15 it is today.

Did the first one.

pkg/util/admission/io_load_listener.go line 339 at r3 (raw file):

Previously, sumeerbhola wrote…

This is roughly what I had in mind. The first allocation after the adjustment calculates the burst (the burst stays the same for the whole adjustment interval), and it gives out the burst size as the allocation. If we are running at 1ms ticks and care about the error introduced by giving out (61/60)*100 = 101.66% of what we calculated, then calculate the burst capacity as 1/61 instead of 1/60.
func (io *ioLoadListener) allocateTokensTick(slowTicks bool) {
	ticksInAdjustmentInterval := slowTicksInAdjustmentInterval
	if !slowTicks {
		ticksInAdjustmentInterval := fastTicksInAdjustmentInterval
	}
	firstCallSinceAdjustment := io.firstCallSinceAdjustment
	if io.firstCallSinceAdjustment {
		// Calculate burst size, once.
		io.firstCallSinceAdjustment = false
		// Call allocateFunc(total, 0, slowTicksInAdjustmentInterval) to get the
		// burst size. If want to be very precise for !slowTicks, call it with
		// slowTicksInAdjustmentInterval+1, to divide by 61, since we are going to
		// give that burst up front.
	}
	if firstCallSinceAdjustment {
		// Use the burst size as the values of toAllocate*
	} else {
		// Call allocateFunc(total, allocated, ticksInAdjustmentInterval) to
		// compute toAllocate*
	}
	// Use toAllocate* to adjust internal state. Then call setAvailableTokens()
      ...

I think the code here is getting complicated to account for minor error. I could implement this if you insist, but I think we should keep it simple unless there's major problems with the following scheme:

We call toAllocate := allocateFunc(total, allocated, ticksInAdjustmentInterval) to determine the tokens to give out every tick interval.
We cap the burst to toAllocate * 250. We already have a variable value for the burst, as we set it to the number of tokens given out to the kvStoreTokenGranter.

The only difference from what's suggested is that we don't give out the burst value in the first interval, and the burst remains variable.

pkg/util/admission/io_load_listener.go line 348 at r3 (raw file):

			// Round up so that we don't accumulate tokens to give in a burst on the
			// last tick.
			toAllocate = (total + currTickIntervalData.ticksInAdjustmentInterval - 1) / currTickIntervalData.ticksInAdjustmentInterval

I'm concerned about the rounding up here.

Assume tick rate is 1ms. If totalNumByteTokens = 15001, and the ticksInAdjustmentInterval = 15000, then toAllocate = 2. because of the rounding up. So over the next 7500 ticks, we'll give out 15000 tokens. So for the next 7500 ticks or 7.5 seconds after that, we have 0 tokens to give out. That seems bad. Am I misunderstanding the rounding scheme here?

Even if totalNumByteTokens=150001, I believe there'll be a few seconds at the end of the adjustment interval where we have no tokens to give out.

sumeerbhola

I'm ignoring the tests, and assuming @irfansharif will review them.

Reviewed 2 of 7 files at r4, 1 of 3 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick and @irfansharif)

pkg/util/admission/grant_coordinator.go line 132 at r2 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

The only differences from what's suggested here and what I had were:

I was setting ticks = 0. I think we should keep the ticks = 0, to ensure that there are actually 15000 ticks in a 15s adjustment interval if the tick rate is 1 tick/ms. The explanation for why is given below in another comment.

I was directly accessing totalNumByteTokens to determine load rather than using a return value from gc.pebbleMetricsTick. I've made that change.

I'm not doing the check to see if slowTicks != prevTicks. I don't think there's any harm in always resetting the ticker once every 15 seconds.

Ack

pkg/util/admission/grant_coordinator.go line 118 at r6 (raw file):

				ticks++
				if ticks%currTickDuration.ticksInAdjustmentInterval() == 0 {
					abs := func(diff time.Duration) time.Duration {

This abs logic, the panic and the currTime is going to be deleted, yes? Our code is not responsible for the ticker being slower or faster. Slowness could happen because of severe cpu overload, and that's acceptable.

pkg/util/admission/io_load_listener.go line 339 at r3 (raw file):

The only difference from what's suggested is that we don't give out the burst value in the first interval

Let's consider some scenarios, just so that we understand the implications.

Switching back and forth between unlimited and finite tokens: this happens often during overload. When switching from finite to unlimited the toAllocate value is also effectively unlimited, so it doesn't matter that we are not giving out the burst size in the first tick. The reverse direction is also ok, since the burst size will be lower than what the token bucket has, so the token bucket will just truncate to the burst size.
Decreasing the tokens from one finite value to another (say 2T to T): say B(T) is the new burst size. If the token bucket already has more than B(T) we just reduce the tokens in the token bucket. If the token bucket has less than B(T) we were consuming fast in the previous 15s interval, so there is no reason to give out the burst immediately.
Increasing the tokens from one finite value to another (say T to 2T): If the token bucket was empty there will be a 250ms lag before we can build up to the full burst size. This is also ok.

So this simplification seems fine.

and the burst remains variable.

I don't think this is a good idea. If we run out of tokens from your rounding up example, we'll reduce the burst to zero too. So if the token bucket had some tokens, those will vanish. I think the burst needs to solely be a function of total and ticks.

pkg/util/admission/io_load_listener.go line 285 at r6 (raw file):

// have a 250ms duration until the next replenishment and 0 tokens, so any high
// priority requests arriving will have to wait. The maximum wait time is 250ms.
// for details.

nit: superflous "for details"

pkg/util/admission/io_load_listener.go line 298 at r6 (raw file):

var unloadedDuration = tickDuration(250 * time.Millisecond)
var loadedDuration = tickDuration(1 * time.Millisecond)

nit: can you change these to consts.

bananabrick · 2023-05-31T00:29:27Z

Unusual bugs:

I'm seeing tests where the adjustment interval calculations are off by more than 7 seconds.

=== RUN   TestBackupMixedVersionElements_base_alter_table_alter_primary_key_vanilla/backup/restore_stage_2_of_10/restore_database_with_schema-only#01
[00:14:06 ](https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/10320387?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true&showLog=10320386_6446_6446&logFilter=debug)        cumulative.go:2024: testing backup 1 (rollback=false)
[00:14:06 ](https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/10320387?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true&showLog=10320386_6447_6447&logFilter=debug)    === RUN   TestBackupMixedVersionElements_base_alter_table_alter_primary_key_vanilla/backup/restore_stage_2_of_10/restore_all_tables_in_database#01
[00:14:06 ](https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/10320387?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true&showLog=10320386_6448_6448&logFilter=debug)        cumulative.go:2024: testing backup 1 (rollback=false)
[00:14:06 ](https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/10320387?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true&showLog=10320386_6449_6449&logFilter=debug)    === RUN   TestBackupMixedVersionElements_base_alter_table_alter_primary_key_vanilla/backup/restore_stage_2_of_10/restore_database#02
[00:14:06 ](https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/10320387?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true&showLog=10320386_6450_6450&logFilter=debug)        cumulative.go:1732: skipping due to randomness
[00:14:06 ](https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/10320387?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true&showLog=10320386_6451_6451&logFilter=debug)    === RUN   TestBackupMixedVersionElements_base_alter_table_alter_primary_key_vanilla/backup/restore_stage_2_of_10/restore_database_with_schema-only#02
[00:14:06 ](https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/10320387?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true&showLog=10320386_6452_6452&logFilter=debug)        cumulative.go:1732: skipping due to randomness
[00:14:06 ](https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/10320387?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true&showLog=10320386_6453_6453&logFilter=debug)    === RUN   TestBackupMixedVersionElements_base_alter_table_alter_primary_key_vanilla/backup/restore_stage_2_of_10/restore_all_tables_in_database#02
[00:14:06 ](https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/10320387?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true&showLog=10320386_6454_6454&logFilter=debug)        cumulative.go:2024: testing backup 2 (rollback=true)
[00:14:06 ](https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/10320387?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true&showLog=10320386_6455_6455&logFilter=debug)    panic: 7.366021858s

Looking into these. I tested this code out locally, and didn't see adjustment intervals vary by over 100ms.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick and @irfansharif)

pkg/util/admission/io_load_listener.go line 378 at r8 (raw file):

				toAllocate = total - allocated
			}
		}

With 250ms ticks, I did not worry about timing error accumulation, but now that we are at 1ms ticks I think we should handle (small) ticker timing errors (e.g. if the ticket is consistently firing after 1.5ms we don't want to give out 33% fewer tokens). We can do so by counting the remaining ticks based on the current time, and divide the remaining tokens by the remaining ticks. The remaining ticks calculation can happen where we have the ticker and can pass the remainingTicks parameter to allocateTokensTick.

bananabrick · 2023-06-01T04:19:42Z

Kicked off the CI with some new code. Not worth a look yet.

bananabrick

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)

pkg/util/admission/grant_coordinator.go line 132 at r2 (raw file):

Previously, sumeerbhola wrote…

Ack

Done.

pkg/util/admission/grant_coordinator.go line 118 at r6 (raw file):

Previously, sumeerbhola wrote…

This abs logic, the panic and the currTime is going to be deleted, yes? Our code is not responsible for the ticker being slower or faster. Slowness could happen because of severe cpu overload, and that's acceptable.

Yea, will delete before merge.

pkg/util/admission/grant_coordinator.go line 144 at r11 (raw file):

					}
					// TODO(bananabrick): Remove this.
					systemLoaded = true

Just forcing this to see if CI passes. Will remove before merging.

pkg/util/admission/io_load_listener.go line 339 at r3 (raw file):

I don't think this is a good idea. If we run out of tokens from your rounding up example, we'll reduce the burst to zero too. So if the token bucket had some tokens, those will vanish

That's a goodpoint. Done.

pkg/util/admission/io_load_listener.go line 298 at r6 (raw file):

Previously, sumeerbhola wrote…

nit: can you change these to consts.

Nice.

pkg/util/admission/io_load_listener.go line 378 at r8 (raw file):

Previously, sumeerbhola wrote…

With 250ms ticks, I did not worry about timing error accumulation, but now that we are at 1ms ticks I think we should handle (small) ticker timing errors (e.g. if the ticket is consistently firing after 1.5ms we don't want to give out 33% fewer tokens). We can do so by counting the remaining ticks based on the current time, and divide the remaining tokens by the remaining ticks. The remaining ticks calculation can happen where we have the ticker and can pass the remainingTicks parameter to allocateTokensTick.

Done.

bananabrick

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)

pkg/util/admission/io_load_listener_test.go line 433 at r12 (raw file):

// Tests if the tokenAllocationTicker produces correct adjustment interval
// durations for both loaded and unloaded systems.
func TestTokenAllocationTickerAdjustmentCalculation(t *testing.T) {

This test takes 30 seconds, so we might not want to merge this. Just wrote it to sanity check the tokenAllocationTicker implementation.

sumeerbhola

looks good

Reviewed 1 of 4 files at r11, 1 of 2 files at r12, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick and @irfansharif)

-- commits line 26 at r12:
are these going to squashed down to 1 commit?

pkg/util/admission/grant_coordinator.go line 144 at r11 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Just forcing this to see if CI passes. Will remove before merging.

I'd like to make a quick pass over the code in its final form.

pkg/util/admission/grant_coordinator.go line 113 at r12 (raw file):

			default:
				if remainingTicks == 0 {
					abs := func(diff time.Duration) time.Duration {

is this panic logic ready to be removed?
We don't want it to accidentally get merged.

pkg/util/admission/io_load_listener.go line 313 at r12 (raw file):

// Start a new adjustment interval. adjustmentStart must be called before tick
// is called. After the initial call, adjustmentStart must also be called if
// tick returns 0, to indicate that a new adjustment interval has started.

... if remainingTicks() returns 0, ...

pkg/util/admission/io_load_listener.go line 465 at r12 (raw file):

	var tokensMaxCapacity int64 = toAllocateByteTokens
	var diskBWTokenMaxCapacity int64 = toAllocateElasticDiskBWTokens
	if loaded {

we should do this unconditionally, i.e., not only in the loaded case. I think that also eliminates the need for the loaded parameter. The possibility that we have given out all tokens before the last tick exists even with 250ms ticks (even though we ignored this prior to this PR) and the burst size should stay the same.

bananabrick

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)

-- commits line 26 at r12:

Previously, sumeerbhola wrote…

are these going to squashed down to 1 commit?

Yes.

pkg/util/admission/grant_coordinator.go line 119 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Leaving this comment thread here to remind us to get rid of it + the currTime state.

Done.

pkg/util/admission/grant_coordinator.go line 113 at r12 (raw file):

Previously, sumeerbhola wrote…

is this panic logic ready to be removed?
We don't want it to accidentally get merged.

Yes.

pkg/util/admission/io_load_listener.go line 313 at r12 (raw file):

Previously, sumeerbhola wrote…

... if remainingTicks() returns 0, ...

Done.

bananabrick

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)

pkg/util/admission/io_load_listener.go line 465 at r12 (raw file):

Previously, sumeerbhola wrote…

we should do this unconditionally, i.e., not only in the loaded case. I think that also eliminates the need for the loaded parameter. The possibility that we have given out all tokens before the last tick exists even with 250ms ticks (even though we ignored this prior to this PR) and the burst size should stay the same.

Done.

sumeerbhola

Reviewed 1 of 7 files at r4, 1 of 3 files at r5, 3 of 5 files at r13, 2 of 2 files at r14, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @bananabrick and @irfansharif)

pkg/util/admission/io_load_listener_test.go line 455 at r14 (raw file):

			diff := abs(timeElapsed - (15 * time.Second))
			if diff > 1*time.Second {
				t.FailNow()

I suspect this will be flaky. You could make tokenAllocationTicker use timeutil.TimeSource which has a ManualTime implementation. You may be able to get the imperfections you desire by calling ManualTime.Advance by say 1.5ms.

pkg/util/admission/io_load_listener_test.go line 473 at r14 (raw file):

	require.Equal(t, 60, int(ticker.remainingTicks()))
	time.Sleep(1 * time.Second)
	// At least one second has passed, we assume that 2 seconds could've passed.

this may be flaky too.

bananabrick

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @sumeerbhola)

pkg/util/admission/io_load_listener_test.go line 455 at r14 (raw file):

Previously, sumeerbhola wrote…

I suspect this will be flaky. You could make tokenAllocationTicker use timeutil.TimeSource which has a ManualTime implementation. You may be able to get the imperfections you desire by calling ManualTime.Advance by say 1.5ms.

Skipping this for now. I was thinking of skipping it anyway because it's too long, and slows down runs of the AC tests locally. I left a todo to use a timeutil.TimeSource.

pkg/util/admission/io_load_listener_test.go line 473 at r14 (raw file):

Previously, sumeerbhola wrote…

this may be flaky too.

Leaving this for now. Will keep an eye and skip if it starts being flaky.

The reason for this change is noted clearly in #91509. This commit message summarizes the implementation. We currently allocate tokens for the kvStoreTokenGranter every 250ms. This pr moves this allocation interval to 1ms. Since unloaded systems use too much CPU with a 1ms interval, we implement a mechanism to switch back and forth between a 250ms interval, when the system is unloaded, and a 1ms interval when the system is loaded. We maintain a given interval for this allocation, over every adjustment interval. That is, we make the decision to switch to a different allocation interval at the end of the adjustment interval. Fixes: #91509 Epic: none Release note: None

bananabrick · 2023-06-07T06:58:29Z

bors r+

craig · 2023-06-07T06:58:33Z

👎 Rejected by code reviews

got one approval, and addressed the comments from this changes requested.

bananabrick · 2023-06-07T06:59:23Z

bors r+

craig · 2023-06-07T07:08:16Z

Build failed:

Bazel Essential CI (Cockroach)

bananabrick · 2023-06-07T07:19:45Z

bors r-

bananabrick · 2023-06-07T07:19:51Z

bors r+

craig · 2023-06-07T07:53:42Z

Build succeeded:

Bazel Essential CI (Cockroach)

bananabrick commented May 23, 2023

View reviewed changes

bananabrick marked this pull request as ready for review May 23, 2023 04:44

bananabrick requested a review from a team as a code owner May 23, 2023 04:44

bananabrick changed the title ~~admission: increase frequency of calls to setAvailableTokens~~ admission: change sgc.allocateIOOnTick interval May 23, 2023

bananabrick requested a review from irfansharif May 23, 2023 04:45

irfansharif previously requested changes May 23, 2023

View reviewed changes

sumeerbhola requested a review from a team May 23, 2023 14:32

irfansharif reviewed May 23, 2023

View reviewed changes

bananabrick commented May 24, 2023

View reviewed changes

blathers-crl bot requested a review from irfansharif May 24, 2023 07:08

sumeerbhola requested changes May 24, 2023

View reviewed changes

bananabrick commented May 30, 2023

View reviewed changes

sumeerbhola requested changes May 30, 2023

View reviewed changes

sumeerbhola requested changes May 31, 2023

View reviewed changes

bananabrick commented Jun 3, 2023

View reviewed changes

bananabrick requested a review from sumeerbhola June 3, 2023 19:16

sumeerbhola requested changes Jun 5, 2023

View reviewed changes

bananabrick commented Jun 6, 2023

View reviewed changes

bananabrick requested a review from sumeerbhola June 6, 2023 18:49

sumeerbhola approved these changes Jun 6, 2023

View reviewed changes

bananabrick commented Jun 6, 2023

View reviewed changes

craig bot merged commit dd9e2a9 into cockroachdb:master Jun 7, 2023

arulajmani mentioned this pull request Nov 24, 2023

kvserver: regression in BenchmarkBumpSideTransportClosed from release 23.1 #115058

Closed

Conversation

bananabrick commented May 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented May 22, 2023

Uh oh!

bananabrick commented May 22, 2023

Uh oh!

bananabrick left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif left a comment

Choose a reason for hiding this comment

Uh oh!

bananabrick left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

bananabrick left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

bananabrick commented May 31, 2023

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

bananabrick commented Jun 1, 2023

Uh oh!

bananabrick left a comment

Choose a reason for hiding this comment

Uh oh!

bananabrick left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

bananabrick left a comment

Choose a reason for hiding this comment

Uh oh!

bananabrick left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

bananabrick left a comment

Choose a reason for hiding this comment

Uh oh!

bananabrick commented Jun 7, 2023

Uh oh!

craig bot commented Jun 7, 2023

Uh oh!

bananabrick commented Jun 7, 2023

Uh oh!

craig bot commented Jun 7, 2023

Uh oh!

bananabrick commented Jun 7, 2023

Uh oh!

bananabrick commented Jun 7, 2023

Uh oh!

craig bot commented Jun 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bananabrick commented May 22, 2023 •

edited

Loading