Skip to content

sql: bugfixes in role membership caching#42998

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
rohany:cclogic-fix-attempt
Dec 5, 2019
Merged

sql: bugfixes in role membership caching#42998
craig[bot] merged 1 commit intocockroachdb:masterfrom
rohany:cclogic-fix-attempt

Conversation

@rohany
Copy link
Copy Markdown
Contributor

@rohany rohany commented Dec 5, 2019

Fixes #42098.

This PR fixes some bugs in role membership caching.

  • The role membership cache was a single global, which meant
    multiple servers in the same cockroach process could pollute
    the cache for each other, causing flaky test failures.
  • The cache updating mechanism seemed to use the incorrect variable
    to update the cache version when a concurrent request caused a cache
    invalidation as well. Although unlikely, this could have lead to incorrect
    results being served.

Release note (bug fix): This change fixes some existing caching issues
surrounding role memberships, where users could sometimes see out of date
role membership information.

@rohany rohany requested a review from mberhault December 5, 2019 18:28
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mberhault and @rohany)


pkg/sql/authorization.go, line 200 at r1 (raw file):

	// If the cache is empty, allocate space for the cache.
	if roleMembersCache.userCache == nil {
		roleMembersCache.userCache = make(map[string]userRoleMembership)

roleMembersCache is shared by all connections on a server, right? This isn't thread safe. Probably not necessary either as the cache is created on invalidation and it is fine to do a lookup on a nil-map.


pkg/sql/authorization.go, line 208 at r1 (raw file):

			Required: true,
			// We need to avoid the table descriptor cache.
			AvoidCached: true,

Is this going to destroy performance? MemberOfWithAdminOption is called by CheckPrivilege which is called on every statement.

Cc @knz who has more knowledge than I do have how descriptor caching works nowadays.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache invalidation mechanism relied on using table descriptor
versions to know when to invalidate the cache, but would not bypass
the table descriptor cache when looking up the table descriptor.

As I mentioned in a comment below, I think we need to use the table descriptor cache for performance.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mberhault and @rohany)

Copy link
Copy Markdown
Contributor Author

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mberhault, @petermattis, and @rohany)


pkg/sql/authorization.go, line 200 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

roleMembersCache is shared by all connections on a server, right? This isn't thread safe. Probably not necessary either as the cache is created on invalidation and it is fine to do a lookup on a nil-map.

Oh, I didn't know you could do that. Then we don't need this -- I'll remove it.

@rohany
Copy link
Copy Markdown
Contributor Author

rohany commented Dec 5, 2019

If we use the tableDescriptor cache, then won't we run into the problems you were worried about in your comment on the original issue? We don't seem to wait for the new table descriptor version to get propagated around. On that note, we didn't seem to see problems around this when it was using the cache...

@petermattis
Copy link
Copy Markdown
Collaborator

If we use the tableDescriptor cache, then won't we run into the problems you were worried about in your comment on the original issue? We don't seem to wait for the new table descriptor version to get propagated around. On that note, we didn't seem to see problems around this when it was using the cache...

That was a theory I was putting forward. I didn't have any evidence that it was actually a problem. I'd revert this part of the PR, or look for guidance from @lucy-zhang and @knz.

@rohany rohany force-pushed the cclogic-fix-attempt branch from aa275c0 to 7ac31cb Compare December 5, 2019 19:21
Fixes cockroachdb#42098.

This PR fixes some bugs in role membership caching.

* The role membership cache was a single global, which meant
  multiple servers in the same cockroach process could pollute
  the cache for each other, causing flaky test failures.
* The cache updating mechanism seemed to use the incorrect variable
  to update the cache version when a concurrent request caused a cache
  invalidation as well. Although unlikely, this could have lead to incorrect
  results being served.

Release note (bug fix): This change fixes some existing caching issues
surrounding role memberships, where users could sometimes see out of date
role membership information.
@rohany rohany force-pushed the cclogic-fix-attempt branch from 7ac31cb to 362a2d9 Compare December 5, 2019 19:22
@rohany
Copy link
Copy Markdown
Contributor Author

rohany commented Dec 5, 2019

That makes sense -- I'll revert that change for now, and talk to lucy about some of the specifics later. PTAL

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mberhault and @rohany)

@knz
Copy link
Copy Markdown
Contributor

knz commented Dec 5, 2019 via email

@petermattis
Copy link
Copy Markdown
Collaborator

Nice. Probably benefits from a backport.

This only affects unit tests which run multiple test clusters concurrently and use GRANTs. This first showed up in #42066 which hasn't been backported. So not currently a problem on 19.2, though I don't mind seeing it backported.

@rohany
Copy link
Copy Markdown
Contributor Author

rohany commented Dec 5, 2019

bors r=petermattis

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 5, 2019

Build failed (retrying...)

craig bot pushed a commit that referenced this pull request Dec 5, 2019
42985: issues: simplify post() r=jordanlewis a=tbg

This prepares the code for allowing the caller more control over the
message body. This makes sense since post() is called both by roachtest
and nightly stress, and it doesn't make sense to use the same message
body for both.

Release note: None

42986: roachprod: don't try to make long dns names r=dt a=dt

long cluster names break DNS. Rather than break DNS for everyone, just skip the clusters that have these long names.
DNS is a nice-to-have in any case, so it is OK to skip it -- roachprod will fallback to IPs.

Release note: none.

42991: settings: explanatory comments around cluster settings r=irfansharif a=irfansharif

Update stale references to changed method signatures and add a bit more
documentation around the usage of IsActive.

Release note: None

42998: sql: bugfixes in role membership caching r=petermattis a=rohany

Fixes #42098.

This PR fixes some bugs in role membership caching.

* The role membership cache was a single global, which meant
  multiple servers in the same cockroach process could pollute
  the cache for each other, causing flaky test failures.
* The cache updating mechanism seemed to use the incorrect variable
  to update the cache version when a concurrent request caused a cache
  invalidation as well. Although unlikely, this could have lead to incorrect
  results being served.

Release note (bug fix): This change fixes some existing caching issues
surrounding role memberships, where users could sometimes see out of date
role membership information.

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 5, 2019

Build succeeded

@knz
Copy link
Copy Markdown
Contributor

knz commented Dec 6, 2019

So not currently a problem on 19.2, though I don't mind seeing it backported.

It is a problem in 19.2 which uses multiple in-memory nodes side-by-side to implement cockroach demo --nodes.

I'm glad it got backported for that reason.

@petermattis
Copy link
Copy Markdown
Collaborator

It is a problem in 19.2 which uses multiple in-memory nodes side-by-side to implement cockroach demo --nodes.

Oh, excellent point. I had forgotten about that case.

@rohany
Copy link
Copy Markdown
Contributor Author

rohany commented Dec 6, 2019

I don't think it has an effect on cockroach demo -- I think the reason why it fails in the logic tests is because its acting as a cache for two different tables, while in cockroach demo there is only one version of the role table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ccl/logictestccl: TestCCLLogic failed under stress

4 participants