Skip to content

admission: fix race conditions involving WorkQueue GC#72535

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
sumeerbhola:work_queue_race
Nov 9, 2021
Merged

admission: fix race conditions involving WorkQueue GC#72535
craig[bot] merged 1 commit intocockroachdb:masterfrom
sumeerbhola:work_queue_race

Conversation

@sumeerbhola
Copy link
Copy Markdown
Collaborator

The periodic reset of tenantInfo.used=0 and the early releasing
of the mutex in Admit can cause two race conditions (when the
WorkQueue is using tokens):

  • A tenantInfo referenced in Admit may have been GC'd.
  • The decrement of used, when work is cancelled, can
    cause the uint64 to overflow.

Both are fixed, and a test is added that fails if the original
code is instrumented with additional invariant checking. The
new code includes the extra invariant checking.

Fixes #72485

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: with some nits. We should backport this.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @stevendanna and @sumeerbhola)


pkg/util/admission/work_queue.go, line 298 at r1 (raw file):

		// where the tenantInfo struct is declared.
		tenant, ok = q.mu.tenants[tenantID]
		if (!ok || prevTenant != tenant) && !q.usesTokens {

[nit] I'd consolidate this block and the third block

  if !q.usesTokens {
    if !ok || prevTenant != tenant { panic }
    if tenant.used == 0 { panic }
    tenant.used--
  } else if !ok {
    tenant = newTenantInfo(tenantID)
    q.mu.tenants[tenantID] = tenant
}

pkg/util/admission/work_queue_test.go, line 307 at r1 (raw file):

			select {
			case <-stopCh:
				done = true

[nit] can just return


pkg/util/admission/work_queue_test.go, line 309 at r1 (raw file):

				done = true
			default:
				q.gcTenantsAndResetTokens()

This will run in a hot loop, no? IMO we should add a short time.Sleep to make sure other code has a bigger chance to get the lock


pkg/util/admission/work_queue_test.go, line 313 at r1 (raw file):

		}
	}()
	time.Sleep(time.Second * 10)

10 seconds is a lot for a random test. Does it (at least sometimes) catch the problem with shorter time (e.g. 1 second)?


pkg/util/admission/work_queue_test.go, line 316 at r1 (raw file):

	close(stopCh)
	q.close()
	fmt.Printf("total: %d, err: %d\n", totalCount, errCount)

t.Log?

Copy link
Copy Markdown
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @sumeerbhola)

a discussion (no related file):
:lgtm: Thanks for tracking this down. I don't have any nits not already mentioned by Radu.



pkg/util/admission/work_queue.go, line 299 at r1 (raw file):

		tenant, ok = q.mu.tenants[tenantID]
		if (!ok || prevTenant != tenant) && !q.usesTokens {
			panic("prev tenantInfo no longer in map")

note to self: This should be impossible because we incremented tenant.used under lock and tenant.used must be 0 for us to GC (and we aren't using tokens, so no periodic reset). For the same reason, tenant.used == 0 below should be impossible.


pkg/util/admission/work_queue.go, line 357 at r1 (raw file):

					panic("tenant.used is already zero")
				}
				tenant.used--

note to self: This is safe to do not under the lock since we know that len(tenant.waitingWorkHeap) > 0 and thus the tenant we are referencing here can't have been gc'd.


pkg/util/admission/work_queue.go, line 538 at r1 (raw file):

if the request was cancelled

It looks to me like we don't do used-- for the token case if the fast-path acquisition fails, even in the non-cancellation case? Or have I misread?

The periodic reset of tenantInfo.used=0 and the early releasing
of the mutex in Admit can cause two race conditions (when the
WorkQueue is using tokens):
- A tenantInfo referenced in Admit may have been GC'd.
- The decrement of used, when work is cancelled, can
  cause the uint64 to overflow.

Both are fixed, and a test is added that fails if the original
code is instrumented with additional invariant checking. The
new code includes the extra invariant checking.

Fixes cockroachdb#72485

Release note: None
Copy link
Copy Markdown
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @RaduBerinde, @stevendanna, and @sumeerbhola)


pkg/util/admission/work_queue.go, line 298 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] I'd consolidate this block and the third block

  if !q.usesTokens {
    if !ok || prevTenant != tenant { panic }
    if tenant.used == 0 { panic }
    tenant.used--
  } else if !ok {
    tenant = newTenantInfo(tenantID)
    q.mu.tenants[tenantID] = tenant
}

Done


pkg/util/admission/work_queue.go, line 357 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

note to self: This is safe to do not under the lock since we know that len(tenant.waitingWorkHeap) > 0 and thus the tenant we are referencing here can't have been gc'd.

We do reacquire the lock just above this. And yes, because used > 0, because of the earlier increment, the tenant object must be the same.


pkg/util/admission/work_queue.go, line 538 at r1 (raw file):

Previously, stevendanna (Steven Danna) wrote…

if the request was cancelled

It looks to me like we don't do used-- for the token case if the fast-path acquisition fails, even in the non-cancellation case? Or have I misread?

Good catch. I overlooked that when revising through various iterations of the code.
I've fixed it to not decrement if it is already 0. This also introduces a bit of inaccuracy, but it is more contained -- we don't want the optimistic fast path to over count.


pkg/util/admission/work_queue_test.go, line 307 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] can just return

Done


pkg/util/admission/work_queue_test.go, line 309 at r1 (raw file):

Previously, RaduBerinde wrote…

This will run in a hot loop, no? IMO we should add a short time.Sleep to make sure other code has a bigger chance to get the lock

It needs to squeeze in multiple times to first reset used=0 and then do the gc -- I've added a comment. The other code does manage to run ~9000 times in 10s.


pkg/util/admission/work_queue_test.go, line 313 at r1 (raw file):

Previously, RaduBerinde wrote…

10 seconds is a lot for a random test. Does it (at least sometimes) catch the problem with shorter time (e.g. 1 second)?

Reduced to 1s since it is sufficient.


pkg/util/admission/work_queue_test.go, line 316 at r1 (raw file):

Previously, RaduBerinde wrote…

t.Log?

Done

@sumeerbhola
Copy link
Copy Markdown
Collaborator Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 9, 2021

Build succeeded:

@craig craig bot merged commit 8b3d6fc into cockroachdb:master Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

admission: data race in newTenantInfo

4 participants