server: fix the tenant server error handling#100436
Merged
craig[bot] merged 9 commits intocockroachdb:masterfrom Apr 6, 2023
Merged
server: fix the tenant server error handling#100436craig[bot] merged 9 commits intocockroachdb:masterfrom
craig[bot] merged 9 commits intocockroachdb:masterfrom
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
Member
ef0f50f to
738e5ec
Compare
Contributor
Author
|
Encountering this issue while troubleshooting TestServerControllerMultiNodeTenantStartup. Will come back to this later. |
738e5ec to
4eaffc1
Compare
Contributor
Author
|
Adding this diff to help with #100578: @@ -73,6 +74,12 @@ func (u Updater) update(ctx context.Context, useReadLock bool, updateFn UpdateFn
ctx, sp := tracing.ChildSpan(ctx, "update-job")
defer sp.Finish()
+ // Disable DistSQL query distribution. This ensures that the job operations do not
+ // require SQL servers to be ready on other nodes.
+ prevMode := u.txn.SessionData().DistSQLMode
+ defer func(prevMode sessiondatapb.DistSQLExecMode) { u.txn.SessionData().DistSQLMode = prevMode }(prevMode)
+ u.txn.SessionData().DistSQLMode = sessiondatapb.DistSQLOff
+ |
This was referenced Apr 4, 2023
4eaffc1 to
e73eab3
Compare
e73eab3 to
927c1c4
Compare
927c1c4 to
dc5eb63
Compare
dc5eb63 to
07dd47f
Compare
craig bot
pushed a commit
that referenced
this pull request
Apr 6, 2023
100579: jobs: prevent a deadlock during upgrades r=yuzefovich a=knz Needed for #100436. Might address #100578 (unsure) Epic: None Prior to this patch, if multiple SQL instances were started side-by-side but behind on migrations, they would deadlock on performing their migrations because distsql on each instance would be unable to reach other other instances. More generally, we're finding it undesirable for the jobs subsystem to operate on system.jobs / job_info using distributed queries. This patch fixes it by disabling query distribution during job operations. The testing here is implicit when the test TestServerControllerMultiNodeTenantStartup is stressed - this used to deadlock under stress without this patch. 100764: roachtest: remove old executables before installing ruby r=rafiss a=rafiss This should prevent errors during installation. fixes #95004 fixes #100428 backport fixes #100586 Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
07dd47f to
53f6be2
Compare
…enantStartup Release note: None
Release note: None
Release note: None
Release note: None
ahead of splitting "new" vs "start" during construction. Release note: None
Release note: None
This peels the call to "start" from the `newTenantServer` interface and pulls it into the orchestration retry loop. This change also incidentally reveals an earlier misdesign: we are calling `newTenantServer` _then_ `start` in the same retry loop. If `new` succeeds but `start` fails, the next retry will call `newTenantServer` again *with the same stopper*, which will leak closers from the previous call to `new`. Release note: None
Prior to this patch, if an error occurred during the initialization or startup of a secondary tenant server, the initialization would leak state into the stopper defined for that tenant. Generally, reusing a stopper across server startup failures is not safe (and API violation). This patch fixes it by decoupling the intermediate stopper used for orchestration from the one used per tenant server. Release note: None
Prior to this patch, the test was not cleaning up its server stopper reliably at the end of each sub-test. This patch fixes it. Release note: None
53f6be2 to
1491929
Compare
craig bot
pushed a commit
that referenced
this pull request
Apr 6, 2023
99958: jobs,server: graceful shutdown for secondary tenant servers r=stevendanna a=knz Epic: CRDB-23559 Fixes #92523. All commits but the last are from #100436. This change ensures that tenant servers managed by the server controller receive a graceful drain request as part of the graceful drain process of the surrounding KV node. This change, in turn, ensures that SQL clients connected to these secondary tenant servers benefit from the same guarantees (and graceful periods) as clients to the system tenant. 100726: upgrades: use TestingBinaryMinSupportedVersion in tests r=rafiss a=rafiss As described in #100552, it's important for this API to use TestingBinaryMinSupportedVersion in order to correctly bootstrap on the older version. informs #100552 Release note: None 100741: contextutil: teach TimeoutError to redact only the operation name r=andreimatei a=andreimatei Before this patch, the whole message of TimeoutError was redacted in logs. Now, only the operation name is. Release note: None Epic: None 100778: norm: update prune cols to match PruneJoinLeftCols/PruneJoinRightCols r=msirek a=msirek In #90599 adjustments where made to the PruneJoinLeftCols and PruneJoinRightCols normalization rules to avoid pruning columns which might be needed when deriving new predicates based on foreign key constraints for lookup join. However, this caused a problem where rules might sometimes fire in an infinite loop because the same columns to prune keep getting added as PruneCols in calls to DerivePruneCols. The logic in prune_cols.opt and DerivePruneCols must be kept in sync to avoid such problems, and this PR brings it back in sync. Epic: none Fixes: #100478 Release note: None 100821: cmd/roachtest: adjust disk-stalled roachtests TPS calculation r=itsbilal a=jbowens Previously, the post-stall TPS calculation included the time that the node was stalled but before the stall triggered the node's exit. During this period, overall TPS drops until the gray failure is converted into a hard failure. This commit adjusts the post-stall TPS calculation to exclude the stalled time when TPS is expected to tank. Epic: None Informs: #97705. Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Mark Sirek <sirek@cockroachlabs.com> Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
This was referenced Apr 10, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Needed for #99958.
Fixes #97661.
Fixes #98868.
Epic: CRDB-23559
First commit from #100579.
Prior to this patch, if an error occurred during the initialization or
startup of a secondary tenant server, the initialization would leak
state into the stopper defined for that tenant. Generally, reusing
a stopper across server startup failures is not safe (and API
violation).
This patch fixes it by decoupling the intermediate stopper used for
orchestration from the one used per tenant server.