server: prevent deadlocks in server orchestration#107824
server: prevent deadlocks in server orchestration#107824craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
The previous fix in this area (5ca5703) correctly identified the case where `createServerEntryLocked()` was called concurrently with `scanTenantsForRunnableServices()`, in which case we ran the risk of immediately tearing down the new server because it hadn't be picked up by `getExpectedRunningTenants()`. However, the fix was incorrect: it was causing the controller mutex to be held through `getExpectedRunningTenants()`, which itself can hang. In that case, a cascading failure could result. This patch changes the fix (and thus continues to solve the original problem) by ensuring we only look at entries to remove that existed prior to the call to `getExpectedRunningTenants()`. No mutex needs to be held here. Release note: None
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
This seems much better, but I had a concern when looking at this related to using the TenantName as the key in the map. First there is a comment on |
Names are immutables while tenants are candidate to be run as a service (so including when they are running). I agree we should make this code clearer by using IDs for the map in server controller; but this is dependent on another project which is not complete yet. |
|
This is the followup issue: #107827 |
|
TFYRs! bors r=lidorcarmel,andrewbaptist |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error setting reviewers, but backport branch blathers/backport-release-23.1-107824 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/107830/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
107757: server,testcluster: make more tests work with secondary tenants r=yuzefovich a=knz Epic: CRDB-18499 Informs #76378. 107831: roachtest: unskip acceptance/node-status r=lidorcarmel a=knz Fixes #107791. Test fixed in #107824 Release note: None Epic: CRDB-28893 Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Fixes #107564.
Informs #107791.
Supersedes #107666.
The previous fix in this
area (5ca5703) correctly identified the case where
createServerEntryLocked()was called concurrently withscanTenantsForRunnableServices(), in which case we ran the risk of immediately tearing down the new server because it hadn't be picked up bygetExpectedRunningTenants().However, the fix was incorrect: it was causing the controller mutex to be held through
getExpectedRunningTenants(), which itself can hang. In that case, a cascading failure could result.This patch changes the fix (and thus continues to solve the original problem) by ensuring we only look at entries to remove that existed prior to the call to
getExpectedRunningTenants(). No mutex needs to be held here.Release note: None
Epic: CRDB-28893