Skip to content

server: prevent deadlocks in server orchestration#107824

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
knz:20230728-fix-race
Jul 28, 2023
Merged

server: prevent deadlocks in server orchestration#107824
craig[bot] merged 1 commit intocockroachdb:masterfrom
knz:20230728-fix-race

Conversation

@knz
Copy link
Copy Markdown
Contributor

@knz knz commented Jul 28, 2023

Fixes #107564.
Informs #107791.
Supersedes #107666.

The previous fix in this
area (5ca5703) correctly identified the case where createServerEntryLocked() was called concurrently with scanTenantsForRunnableServices(), in which case we ran the risk of immediately tearing down the new server because it hadn't be picked up by getExpectedRunningTenants().

However, the fix was incorrect: it was causing the controller mutex to be held through getExpectedRunningTenants(), which itself can hang. In that case, a cascading failure could result.

This patch changes the fix (and thus continues to solve the original problem) by ensuring we only look at entries to remove that existed prior to the call to getExpectedRunningTenants(). No mutex needs to be held here.

Release note: None
Epic: CRDB-28893

The previous fix in this
area (5ca5703) correctly identified
the case where `createServerEntryLocked()` was called concurrently
with `scanTenantsForRunnableServices()`, in which case we ran the risk
of immediately tearing down the new server because it hadn't be picked
up by `getExpectedRunningTenants()`.

However, the fix was incorrect: it was causing the controller mutex to
be held through `getExpectedRunningTenants()`, which itself can hang.
In that case, a cascading failure could result.

This patch changes the fix (and thus continues to solve the original
problem) by ensuring we only look at entries to remove that existed
prior to the call to `getExpectedRunningTenants()`. No mutex needs to
be held here.

Release note: None
@knz knz requested review from andrewbaptist and lidorcarmel July 28, 2023 19:55
@knz knz requested review from a team as code owners July 28, 2023 19:55
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jul 28, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@knz knz added backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only db-cy-23 labels Jul 28, 2023
@knz knz requested a review from stevendanna July 28, 2023 19:59
Copy link
Copy Markdown
Contributor

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@andrewbaptist
Copy link
Copy Markdown

This seems much better, but I had a concern when looking at this related to using the TenantName as the key in the map. First there is a comment on TenantName that says Unlike TenantID it is not necessary for every tenant to have a name.. Second there are Set methods for TenantName. For operations like starting/stopping tenants, it seems safer to key all of this on TenantId. I'm not sure why we don't do that here.

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Jul 28, 2023

For operations like starting/stopping tenants, it seems safer to key all of this on TenantId. I'm not sure why we don't do that here.

Names are immutables while tenants are candidate to be run as a service (so including when they are running).

I agree we should make this code clearer by using IDs for the map in server controller; but this is dependent on another project which is not complete yet.

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Jul 28, 2023

This is the followup issue: #107827

Copy link
Copy Markdown

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Thanks for this change, it is definitely safer and prevents holding the lock while running a query now.

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Jul 28, 2023

TFYRs!

bors r=lidorcarmel,andrewbaptist

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 28, 2023

Build succeeded:

@craig craig bot merged commit c4232a1 into cockroachdb:master Jul 28, 2023
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jul 28, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.1-107824 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/107830/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

craig bot pushed a commit that referenced this pull request Jul 28, 2023
107757: server,testcluster: make more tests work with secondary tenants r=yuzefovich a=knz


Epic: CRDB-18499
Informs #76378.

107831: roachtest: unskip acceptance/node-status r=lidorcarmel a=knz

Fixes #107791.
Test fixed in #107824

Release note: None
Epic: CRDB-28893

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@knz knz mentioned this pull request Sep 20, 2023
@knz knz deleted the 20230728-fix-race branch September 20, 2023 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only db-cy-23

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: multitenant-upgrade failed

4 participants