admission: avoid recursive grant chain by irfansharif · Pull Request #106236 · cockroachdb/cockroach

irfansharif · 2023-07-06T00:39:29Z

In #97599 we introduced a non-blocking admission interface for below-raft, replication admission control. When doing so, we unintentionally violated the 'requester' interface -- when 'requester.granted()' is invoked, the granter expects to admit a single queued request. The code layering made it so that after granting one, when doing post-hoc token adjustments, if we observed the granter was exhausted but now no longer was so, we'd try to grant again. This resulted in admitting work recursively, with a callstack as deep as the admit chain.

Not only is that undesirable design-wise, it also triggered panics in the granter that wasn't expecting more than one request being admitted. Recursively we were incrementing the grant chain index, which overflowed (it was in int8, so happened readily with long enough admit chains), after which we panic-ed when using a negative index to access an array.

We add a test that fails without the changes. The failure can also be triggered by applying the diff below (which reverts to the older, buggy behavior):

dev test pkg/kv/kvserver -f TestFlowControlGranterAdmitOneByOne -v --show-logs

diff --git i/pkg/util/admission/granter.go w/pkg/util/admission/granter.go
index ba42213c375..7c526fbb3d8 100644
--- i/pkg/util/admission/granter.go
+++ w/pkg/util/admission/granter.go
@@ -374,7 +374,7 @@ func (cg *kvStoreTokenChildGranter) storeWriteDone(
 func (cg *kvStoreTokenChildGranter) storeReplicatedWorkAdmittedLocked(
        originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo,
 ) (additionalTokens int64) {
-       return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, false /* canGrantAnother */)
+       return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, true /* canGrantAnother */)
 }

Release note: None

cockroach-teamcity · 2023-07-06T00:39:41Z

This change is

sumeerbhola

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)

pkg/util/admission/admission.go line 310 at r1 (raw file):

	// outstanding token adjustments. Unlike storeWriteDone, the token
	// adjustments don't result in further admission. It's invoked with the
	// coord.mu held.

It doesn't need further admission since the caller itself is looping (in the callpath from WorkQueue.granted) which originates in a GrantCoordinator loop, yes?

The fast path in WorkQueue.Admit is done by the work wanting to be admitted and is only subtracting tokens, so I don't think there is any need there to grant more. We probably should add a comment clarifying the threading model in a comment -- what threads are involved and who is responsible for granting until no more tokens are available.

pkg/util/admission/admission.go line 316 at r1 (raw file):

	// request) that we want to avoid further grants -- our callstack could be
	// as large the number of waiting requests. But this interface is also
	// invoked by the WorkQueue fast path, where it's ok to have further grants

"where it's ok to have further grants" : see my previous comment -- I don't think this is accurate.

Fixes cockroachdb#105185. Fixes cockroachdb#105613. In cockroachdb#97599 we introduced a non-blocking admission interface for below-raft, replication admission control. When doing so, we unintentionally violated the 'requester' interface -- when 'requester.granted()' is invoked, the granter expects to admit a single queued request. The code layering made it so that after granting one, when doing post-hoc token adjustments, if we observed the granter was exhausted but now no longer was so, we'd try to grant again. This resulted in admitting work recursively, with a callstack as deep as the admit chain. Not only is that undesirable design-wise, it also triggered panics in the granter that wasn't expecting more than one request being admitted. Recursively we were incrementing the grant chain index, which overflowed (it was in int8, so happened readily with long enough admit chains), after which we panic-ed when using a negative index to access an array. We add a test that fails without the changes. The failure can also be triggered by applying the diff below (which reverts to the older, buggy behavior): dev test pkg/kv/kvserver -f TestFlowControlGranterAdmitOneByOne -v --show-logs diff --git i/pkg/util/admission/granter.go w/pkg/util/admission/granter.go index ba42213..7c526fbb3d8 100644 --- i/pkg/util/admission/granter.go +++ w/pkg/util/admission/granter.go @@ -374,7 +374,7 @@ func (cg *kvStoreTokenChildGranter) storeWriteDone( func (cg *kvStoreTokenChildGranter) storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) { - return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, false /* canGrantAnother */) + return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, true /* canGrantAnother */) } Release note: None

irfansharif

TFTR!

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/util/admission/admission.go line 310 at r1 (raw file):

Previously, sumeerbhola wrote…

It doesn't need further admission since the caller itself is looping (in the callpath from WorkQueue.granted) which originates in a GrantCoordinator loop, yes?

The fast path in WorkQueue.Admit is done by the work wanting to be admitted and is only subtracting tokens, so I don't think there is any need there to grant more. We probably should add a comment clarifying the threading model in a comment -- what threads are involved and who is responsible for granting until no more tokens are available.

Expanded the comment.

pkg/util/admission/admission.go line 316 at r1 (raw file):

Previously, sumeerbhola wrote…

"where it's ok to have further grants" : see my previous comment -- I don't think this is accurate.

Removed TODO.

craig · 2023-07-07T16:32:40Z

🕐 Waiting for PR status (GitHub check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set.

irfansharif · 2023-07-07T16:33:23Z

bors r+

craig · 2023-07-07T16:33:25Z

Already running a review

craig · 2023-07-07T17:06:44Z

Build succeeded:

Bazel Essential CI (Cockroach)

irfansharif requested review from a team as code owners July 6, 2023 00:39

irfansharif requested a review from sumeerbhola July 6, 2023 00:39

irfansharif mentioned this pull request Jul 6, 2023

kvflowcontrol,admission: productionize replication admission control #98703

Closed

15 tasks

irfansharif changed the title ~~kvflowcontrol: avoid recursive admission chain~~ admission: avoid recursive grant chain Jul 6, 2023

irfansharif force-pushed the 230704.fix-clearrange branch 2 times, most recently from 2b4e092 to 54b3bd3 Compare July 6, 2023 01:33

sumeerbhola approved these changes Jul 6, 2023

View reviewed changes

irfansharif force-pushed the 230704.fix-clearrange branch from 54b3bd3 to 30ffe66 Compare July 7, 2023 16:32

irfansharif commented Jul 7, 2023

View reviewed changes

craig bot merged commit d802aaf into cockroachdb:master Jul 7, 2023

irfansharif deleted the 230704.fix-clearrange branch July 7, 2023 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission: avoid recursive grant chain#106236

admission: avoid recursive grant chain#106236
craig[bot] merged 1 commit intocockroachdb:masterfrom
irfansharif:230704.fix-clearrange

irfansharif commented Jul 6, 2023

Uh oh!

cockroach-teamcity commented Jul 6, 2023

Uh oh!

sumeerbhola left a comment

Uh oh!

irfansharif left a comment

Uh oh!

craig bot commented Jul 7, 2023

Uh oh!

irfansharif commented Jul 7, 2023

Uh oh!

craig bot commented Jul 7, 2023

Uh oh!

craig bot commented Jul 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

irfansharif commented Jul 6, 2023

Uh oh!

cockroach-teamcity commented Jul 6, 2023

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif left a comment

Choose a reason for hiding this comment

Uh oh!

craig bot commented Jul 7, 2023

Uh oh!

irfansharif commented Jul 7, 2023

Uh oh!

craig bot commented Jul 7, 2023

Uh oh!

craig bot commented Jul 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants