Skip to content

sql: fix data race in MemberOfWithAdminOption#98617

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
andyyang890:member_fixed_timestamp
Mar 17, 2023
Merged

sql: fix data race in MemberOfWithAdminOption#98617
craig[bot] merged 2 commits intocockroachdb:masterfrom
andyyang890:member_fixed_timestamp

Conversation

@andyyang890
Copy link
Copy Markdown
Collaborator

This patch fixes a race condition in MemberOfWithAdminOption
by using a fresh transaction within the singleflight call.

Fixes #95642
Fixes #96539

Release note: None

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Mar 14, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@andyyang890
Copy link
Copy Markdown
Collaborator Author

This PR is an implementation of the idea discussed in #98469, but it seems to be hitting an issue where the system.role_members table appears to be empty. Is the modification time perhaps not set for the permanent upgrade that creates the initial row for (admin, root)?

@rafiss
Copy link
Copy Markdown
Collaborator

rafiss commented Mar 15, 2023

table appears to be empty. Is the modification time perhaps not set for the permanent upgrade that creates the initial row for (admin, root)?

that seems likely. if the modification time is not set, then we should be able to safely not change the txn timestamp -- since this means that the system.role_members table is in it's default state.

@ajwerner
Copy link
Copy Markdown
Contributor

Modification time will always be set -- we assert it deeply. The fact that that migration fails to update it and set the version is sad. Maybe one thing we can do is detect version 1 and use a hard-coded set of data, or just keep the transaction timestamp?

@andyyang890
Copy link
Copy Markdown
Collaborator Author

or just keep the transaction timestamp?

Would you mind elaborating a bit more about what the advantages of the original suggestion to set a fixed timestamp were? Are those benefits not strictly required for correctness?

@rafiss
Copy link
Copy Markdown
Collaborator

rafiss commented Mar 15, 2023

If the version is 1, that means there have been no modifications to the table. So using a historical timestamp or the current timestamp would both return the same results.

@andyyang890
Copy link
Copy Markdown
Collaborator Author

If the version is 1, that means there have been no modifications to the table. So using a historical timestamp or the current timestamp would both return the same results.

I think the problem right now is that the permanent upgrade below inserts rows into the system.role_members table without bumping the table descriptor version. I'm guessing it would be unsafe to add code to the permanent upgrade that does the version bumping since it won't have been run on older clusters.

func addRootToAdminRole(ctx context.Context, txn isql.Txn) error {
var upsertStmt string
var upsertVals []interface{}
{
// We query the pg_attribute to determine whether the role_id and member_id
// columns are present because we can't rely on version gates here.
const pgAttributeStmt = `
SELECT * FROM system.pg_catalog.pg_attribute
WHERE attrelid = 'system.public.role_members'::REGCLASS
AND attname IN ('role_id', 'member_id')
LIMIT 1
`
if row, err := txn.QueryRow(ctx, "roleMembersColumnsGet", txn.KV(), pgAttributeStmt); err != nil {
return err
} else if row == nil {
upsertStmt = `
UPSERT INTO system.role_members ("role", "member", "isAdmin") VALUES ($1, $2, true)
`
upsertVals = []interface{}{username.AdminRole, username.RootUser}
} else {
upsertStmt = `
UPSERT INTO system.role_members ("role", "member", "isAdmin", role_id, member_id) VALUES ($1, $2, true, $3, $4)
`
upsertVals = []interface{}{username.AdminRole, username.RootUser, username.AdminRoleID, username.RootUserID}
}
}
_, err := txn.Exec(ctx, "addRootToAdminRole", txn.KV(), upsertStmt, upsertVals...)
return err
}

@rafiss
Copy link
Copy Markdown
Collaborator

rafiss commented Mar 15, 2023

Right, I'm saying that if the version is 1 then we know:

  • the permanent upgrade has run (since the cluster isn't available until permanent upgrades are done)
  • no modifications have been made to the table since the permanent upgrade ran

So we can work around this issue of the upgrade not bumping the version by just using the current timestamp for the txn timestamp.

Or to resurrect this thread #98469 (comment)

You don't want ReadTimestamp() I don't think. I think you want to take that descriptor you read for the role_memberships table and take its ModificationTime().

Maybe this is a case where we can use the outer txn's ReadTimestamp?

@andyyang890 andyyang890 force-pushed the member_fixed_timestamp branch from 4ad8802 to abf67c3 Compare March 15, 2023 18:23
@andyyang890
Copy link
Copy Markdown
Collaborator Author

Ah I see, thanks for the clarification. I've updated the PR to do that!

@andyyang890 andyyang890 marked this pull request as ready for review March 15, 2023 18:36
Copy link
Copy Markdown
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! this lgtm, but just wanted to discuss my comment from one of the other PRs:

if possible, could we add a more targeted test that catches this bug? it feels kind of lucky to me that TestBackendDialTLS is the one that caught this. i could imagine that test being refactored in the future, and no longer catching a future regression on this.

We could add a test to pkg/sql/user_test.go that surfaces the bug? It's possible that TestGetUserTimeout already would have revealed the bug, but I see that test is skipped under race

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

@andyyang890 andyyang890 force-pushed the member_fixed_timestamp branch 2 times, most recently from 5ab7028 to f55d945 Compare March 16, 2023 18:59
@andyyang890 andyyang890 requested a review from rafiss March 16, 2023 19:04
Copy link
Copy Markdown
Collaborator Author

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like TestGetUserTimeout does catch the bug so I've unskipped it under race! I also removed the backport-22.2.x tag since this issue only started happening after the internal executor refactor, which was merged in 23.1.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @rafiss)

This patch fixes a race condition in MemberOfWithAdminOption
by using a fresh transaction within the singleflight call.

Release note: None
This test is being unskipped under race to catch any regressions
for a data race in MemberOfWithAdminOption that was discovered
and subsequently patched.

Release note: None
@andyyang890 andyyang890 force-pushed the member_fixed_timestamp branch from f55d945 to 37e1066 Compare March 16, 2023 21:27
@andyyang890
Copy link
Copy Markdown
Collaborator Author

TFTR!

bors r=rafiss

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 16, 2023

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 17, 2023

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 17, 2023

Build failed:

@andyyang890
Copy link
Copy Markdown
Collaborator Author

bors retry

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 17, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ccl/sqlproxyccl: TestBackendDialTLS failed ccl/sqlproxyccl: TestBackendDialTLS failed due to race condition in catalog nstree

4 participants