server: fix the tenant server error handling by knz · Pull Request #100436 · cockroachdb/cockroach

knz · 2023-04-02T18:07:44Z

Needed for #99958.
Fixes #97661.
Fixes #98868.
Epic: CRDB-23559
First commit from #100579.

Prior to this patch, if an error occurred during the initialization or
startup of a secondary tenant server, the initialization would leak
state into the stopper defined for that tenant. Generally, reusing
a stopper across server startup failures is not safe (and API
violation).

This patch fixes it by decoupling the intermediate stopper used for
orchestration from the one used per tenant server.

cockroach-teamcity · 2023-04-02T18:07:58Z

This change is

knz · 2023-04-04T11:04:57Z

Encountering this issue while troubleshooting TestServerControllerMultiNodeTenantStartup. Will come back to this later.

knz · 2023-04-04T11:10:39Z

Adding this diff to help with #100578:

@@ -73,6 +74,12 @@ func (u Updater) update(ctx context.Context, useReadLock bool, updateFn UpdateFn
        ctx, sp := tracing.ChildSpan(ctx, "update-job")
        defer sp.Finish()

+       // Disable DistSQL query distribution. This ensures that the job operations do not
+       // require SQL servers to be ready on other nodes.
+       prevMode := u.txn.SessionData().DistSQLMode
+       defer func(prevMode sessiondatapb.DistSQLExecMode) { u.txn.SessionData().DistSQLMode = prevMode }(prevMode)
+       u.txn.SessionData().DistSQLMode = sessiondatapb.DistSQLOff
+

100579: jobs: prevent a deadlock during upgrades r=yuzefovich a=knz Needed for #100436. Might address #100578 (unsure) Epic: None Prior to this patch, if multiple SQL instances were started side-by-side but behind on migrations, they would deadlock on performing their migrations because distsql on each instance would be unable to reach other other instances. More generally, we're finding it undesirable for the jobs subsystem to operate on system.jobs / job_info using distributed queries. This patch fixes it by disabling query distribution during job operations. The testing here is implicit when the test TestServerControllerMultiNodeTenantStartup is stressed - this used to deadlock under stress without this patch. 100764: roachtest: remove old executables before installing ruby r=rafiss a=rafiss This should prevent errors during installation. fixes #95004 fixes #100428 backport fixes #100586 Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>

…enantStartup Release note: None

Release note: None

ahead of splitting "new" vs "start" during construction. Release note: None

Release note: None

This peels the call to "start" from the `newTenantServer` interface and pulls it into the orchestration retry loop. This change also incidentally reveals an earlier misdesign: we are calling `newTenantServer` _then_ `start` in the same retry loop. If `new` succeeds but `start` fails, the next retry will call `newTenantServer` again *with the same stopper*, which will leak closers from the previous call to `new`. Release note: None

Prior to this patch, if an error occurred during the initialization or startup of a secondary tenant server, the initialization would leak state into the stopper defined for that tenant. Generally, reusing a stopper across server startup failures is not safe (and API violation). This patch fixes it by decoupling the intermediate stopper used for orchestration from the one used per tenant server. Release note: None

Prior to this patch, the test was not cleaning up its server stopper reliably at the end of each sub-test. This patch fixes it. Release note: None

99958: jobs,server: graceful shutdown for secondary tenant servers r=stevendanna a=knz Epic: CRDB-23559 Fixes #92523. All commits but the last are from #100436. This change ensures that tenant servers managed by the server controller receive a graceful drain request as part of the graceful drain process of the surrounding KV node. This change, in turn, ensures that SQL clients connected to these secondary tenant servers benefit from the same guarantees (and graceful periods) as clients to the system tenant. 100726: upgrades: use TestingBinaryMinSupportedVersion in tests r=rafiss a=rafiss As described in #100552, it's important for this API to use TestingBinaryMinSupportedVersion in order to correctly bootstrap on the older version. informs #100552 Release note: None 100741: contextutil: teach TimeoutError to redact only the operation name r=andreimatei a=andreimatei Before this patch, the whole message of TimeoutError was redacted in logs. Now, only the operation name is. Release note: None Epic: None 100778: norm: update prune cols to match PruneJoinLeftCols/PruneJoinRightCols r=msirek a=msirek In #90599 adjustments where made to the PruneJoinLeftCols and PruneJoinRightCols normalization rules to avoid pruning columns which might be needed when deriving new predicates based on foreign key constraints for lookup join. However, this caused a problem where rules might sometimes fire in an infinite loop because the same columns to prune keep getting added as PruneCols in calls to DerivePruneCols. The logic in prune_cols.opt and DerivePruneCols must be kept in sync to avoid such problems, and this PR brings it back in sync. Epic: none Fixes: #100478 Release note: None 100821: cmd/roachtest: adjust disk-stalled roachtests TPS calculation r=itsbilal a=jbowens Previously, the post-stall TPS calculation included the time that the node was stalled but before the stall triggered the node's exit. During this period, overall TPS drops until the gray failure is converted into a hard failure. This commit adjusts the post-stall TPS calculation to exclude the stalled time when TPS is expected to tank. Epic: None Informs: #97705. Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Mark Sirek <sirek@cockroachlabs.com> Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>

This comment was marked as outdated.

Sign in to view

knz requested a review from stevendanna April 2, 2023 18:30

knz marked this pull request as ready for review April 2, 2023 18:30

knz requested a review from a team as a code owner April 2, 2023 18:30

knz requested a review from a team April 2, 2023 18:30

knz requested review from a team as code owners April 2, 2023 18:30

knz force-pushed the 20230402-tenant-start branch 2 times, most recently from ef0f50f to 738e5ec Compare April 4, 2023 09:46

knz mentioned this pull request Apr 4, 2023

sql: distsql planning should avoid sending flows to SQL servers in the process of starting up, and in the process of shutting down #100578

Closed

knz force-pushed the 20230402-tenant-start branch from 738e5ec to 4eaffc1 Compare April 4, 2023 11:10

knz requested a review from a team as a code owner April 4, 2023 11:10

This was referenced Apr 4, 2023

jobs: prevent a deadlock during upgrades #100579

Merged

jobs,server: graceful shutdown for secondary tenant servers #99958

Merged

knz force-pushed the 20230402-tenant-start branch from 4eaffc1 to e73eab3 Compare April 4, 2023 13:41

abarganier removed the request for review from a team April 4, 2023 14:36

knz mentioned this pull request Apr 4, 2023

jobs: waitForJobs hangs on server shutdown #100660

Closed

knz force-pushed the 20230402-tenant-start branch from e73eab3 to 927c1c4 Compare April 5, 2023 21:58

knz requested a review from a team as a code owner April 5, 2023 21:58

knz force-pushed the 20230402-tenant-start branch from 927c1c4 to dc5eb63 Compare April 5, 2023 22:00

knz force-pushed the 20230402-tenant-start branch from dc5eb63 to 07dd47f Compare April 6, 2023 12:45

knz requested a review from a team as a code owner April 6, 2023 12:45

knz force-pushed the 20230402-tenant-start branch from 07dd47f to 53f6be2 Compare April 6, 2023 13:51

knz added 2 commits April 6, 2023 16:50

serverccl: clarify the progress inside TestServerControllerMultiNodeT…

599ae07

…enantStartup Release note: None

serverccl: make TestServerControllerMultiNodeTenantStartup faster

8b7093b

Release note: None

knz added 7 commits April 6, 2023 16:50

server: unexport some functions

aa23f85

Release note: None

server: save a reference to BaseConfig in SQLServerWrapper

14f3897

Release note: None

server: extend the onDemandServer interface

2705bdf

ahead of splitting "new" vs "start" during construction. Release note: None

server: lift reportTenantInfo into *SQLServerWrapper

20f0428

Release note: None

serverccl: simplify TestServerStartupGuardrails

1491929

Prior to this patch, the test was not cleaning up its server stopper reliably at the end of each sub-test. This patch fixes it. Release note: None

knz force-pushed the 20230402-tenant-start branch from 53f6be2 to 1491929 Compare April 6, 2023 14:50

craig bot merged commit 1491929 into cockroachdb:master Apr 6, 2023

knz deleted the 20230402-tenant-start branch April 6, 2023 18:13

This was referenced Apr 10, 2023

release-23.1: improve tenant server start/stop in shared-process multitenancy #101089

Merged

release-23.1.0: improve tenant server start/stop in shared-process multitenancy #101450

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: fix the tenant server error handling#100436

server: fix the tenant server error handling#100436
craig[bot] merged 9 commits intocockroachdb:masterfrom
knz:20230402-tenant-start

knz commented Apr 2, 2023 •

edited

Loading

Uh oh!

This comment was marked as outdated.

cockroach-teamcity commented Apr 2, 2023

Uh oh!

knz commented Apr 4, 2023

Uh oh!

knz commented Apr 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

knz commented Apr 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

cockroach-teamcity commented Apr 2, 2023

Uh oh!

knz commented Apr 4, 2023

Uh oh!

knz commented Apr 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

knz commented Apr 2, 2023 •

edited

Loading