roachprod: never start cockroach processes concurrently by renatolabs · Pull Request #113718 · cockroachdb/cockroach

renatolabs · 2023-11-02T21:49:15Z

This changes startup logic to always start cockroach processes sequentially: this ensures that node IDs match hostnames. Previously, we were starting nodes concurrently for upgrade tests because initialization tasks performed by roachprod required quorum.

Fixes: #113384.

Release note: None

cockroach-teamcity · 2023-11-02T21:49:23Z

This change is

renatolabs · 2023-11-06T14:55:35Z

I verified that node IDs now consistently match hostnames with this patch, including in upgrade tests. Also ran upgrade tests on this branch to validate that they pass, since those were the main tests that explicitly set RoachprodOpts.Sequential = false before:

https://teamcity.cockroachdb.com/viewLog.html?buildId=12494442&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel

Ready for review.

@erikgrinaker could you sanity check that this fixes what you reported in #113384?

This changes removes the `Sequential` roachprod option (and command line flag) so that we always start cockroach processes sequentially: this ensures that node IDs match hostnames. Previously, we were starting nodes concurrently for upgrade tests because initialization tasks performed by roachprod required quorum. Fixes: cockroachdb#113384. Release note: None

erikgrinaker · 2023-11-06T18:15:47Z

@erikgrinaker could you sanity check that this fixes what you reported in #113384?

Sure, will do some runs after dinner. Thanks!

erikgrinaker · 2023-11-07T09:34:25Z

Confirming that #113222 passed 20/20 runs with this change.

I did see nodes occasionally fail to initialize after restarting, but the error was swallowed after 20 retries. I did not investigate this, since the clusters were nuked. For example:

[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:11 runner.go:382: ---------- STARTING (6): restart node 2 with binary version v22.2.6 ----------
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:11 runner.go:396: binary versions: [1: 22.1, 2: 22.1, 3: 22.1, 4: 22.1]
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:11 runner.go:397: cluster versions: [1: 22.1, 2: 22.1, 3: 22.1, 4: 22.1]
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:11 clusterupgrade.go:254: restarting node 2 into version v22.2.6
19:56:11 cluster.go:674: test status: stopping nodes :2
local: stopping and waiting
19:56:14 cluster.go:674: test status: stopping nodes :2
local: stopping
19:56:14 cluster.go:2205: running cmd `test -e v22.2.6/cockroach |...` on nodes [:2]; details in run_195614.787626829_n2_test-e-v2226cockroac.log
19:56:14 cluster.go:674: test status: staging binary
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:14 staging.go:358: Resolved release url for cockroach version v22.2.6: https://storage.googleapis.com/cockroach-release-artifacts-prod/cockroach-v22.2.6.linux-amd64.tgz
19:56:15 monitor.go:177: Monitor event: n2: cockroach process for system interface died (exit code unknown): expected
19:56:18 cluster.go:674: test status: 
19:56:18 cluster.go:674: test status: starting nodes :2
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:19 cockroach.go:380: local: starting nodes
local: executing sql
local: executing sql
19:56:20 monitor.go:177: Monitor event: n2: cockroach process for system interface is running (PID: 22341)
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:52 cockroach.go:1078: not creating default admin user due to error: COMMAND_PROBLEM: exit status 127
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:52 cockroach.go:1093: local: setting cluster settings
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:56 runner.go:382: ---------- FINISHED [44.467561885s] (6): restart node 2 with binary version v22.2.6 ----------
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:56:56 runner.go:382: ---------- STARTING (7): restart node 4 with binary version v22.2.6 ----------
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:56:56 runner.go:396: binary versions: [1: 22.1, 2: 22.2, 3: 22.1, 4: 22.1]
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:56:56 runner.go:397: cluster versions: [1: 22.1, 2: 22.1, 3: 22.1, 4: 22.1]
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:56:56 clusterupgrade.go:254: restarting node 4 into version v22.2.6
19:56:56 cluster.go:674: test status: stopping nodes :4
local: stopping and waiting
19:57:01 monitor.go:177: Monitor event: n4: cockroach process for system interface died (exit code unknown): expected
19:57:01 cluster.go:674: test status: stopping nodes :4
local: stopping
19:57:01 cluster.go:2205: running cmd `test -e v22.2.6/cockroach |...` on nodes [:4]; details in run_195701.303442393_n4_test-e-v2226cockroac.log
19:57:01 cluster.go:674: test status: staging binary
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:57:01 staging.go:358: Resolved release url for cockroach version v22.2.6: https://storage.googleapis.com/cockroach-release-artifacts-prod/cockroach-v22.2.6.linux-amd64.tgz
19:57:05 cluster.go:674: test status: 
19:57:05 cluster.go:674: test status: starting nodes :4
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:57:05 cockroach.go:380: local: starting nodes
local: executing sql
local: executing sql
local: executing sql
19:57:07 monitor.go:177: Monitor event: n4: cockroach process for system interface is running (PID: 23314)
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:57:37 cockroach.go:1078: not creating default admin user due to error: COMMAND_PROBLEM: exit status 127
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:57:37 cockroach.go:1093: local: setting cluster settings
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:57:41 runner.go:382: ---------- FINISHED [45.83830969s] (7): restart node 4 with binary version v22.2.6 ----------

See inline comments; attempting to run init functions when restarting just adds noise to the logs. In addition, in some situations, it might fail because the `init` node is fixed and there is an assumption on roachprod that the init node will have the binary being used to start another node. This might not be the case in mixedversion tests. Epic: none Release note: None

renatolabs · 2023-11-07T15:11:46Z

Thanks for checking!

I did see nodes occasionally fail to initialize after restarting, but the error was swallowed after 20 retries

Good catch; this happens when the ordering of events is such that n1 (init node) does not have the binary being used to start another node. The issue was not introduced in this PR, but I added a commit here to skip initialization when restarting nodes.

RaduBerinde

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)

renatolabs · 2023-11-07T18:31:32Z

TFTR!

bors r=RaduBerinde

craig · 2023-11-07T19:26:04Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-11-07T19:26:18Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 932f39b to blathers/backport-release-23.1-113718: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

In cockroachdb#113718, we changed how initialization is done in order to remove the concurrent start of cockroach process. In that change, we also unified the `shouldInit` variable that controls whether the cluster should be initialized with `cockroach init`. However, that particular change was a mistake: it conflated cluster initialization with whether "startup tasks" should run. This commit fixes that issue by separating the two concepts; now one-node clusters should still skip `cockroach init`, but they will run startup tasks as usual. Fixes: cockroachdb#114015. Release note: None

114024: catalog/lease: deflake TestTableCreationPushesTxnsInRecentPast r=fqazi a=fqazi Previously, the TestTableCreationPushesTxnsInRecentPast test could be flaky because the clock uncertainty may not have been sufficient to cause a transaction to get pushed. To address this, this patch modifies the test to inject a small amount of latency to improve reliability. Fixes: #113208 Release note: None 114025: roachprod: fix initialization of 1-node clusters r=RaduBerinde,herkolategan a=renatolabs In #113718, we changed how initialization is done in order to remove the concurrent start of cockroach process. In that change, we also unified the `shouldInit` variable that controls whether the cluster should be initialized with `cockroach init`. However, that particular change was a mistake: it conflated cluster initialization with whether "startup tasks" should run. This commit fixes that issue by separating the two concepts; now one-node clusters should still skip `cockroach init`, but they will run startup tasks as usual. Fixes: #114015. Release note: None 114068: build: build `com_github_azure_azure_sdk_for_go` on `large` pool r=rail a=rickystewart When building with `race`, this will OOM. Using the `large` pool grants it some extra memory. Epic: CRDB-8308 Release note: None Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Renato Costa <renato@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>

In #113718, we changed how initialization is done in order to remove the concurrent start of cockroach process. In that change, we also unified the `shouldInit` variable that controls whether the cluster should be initialized with `cockroach init`. However, that particular change was a mistake: it conflated cluster initialization with whether "startup tasks" should run. This commit fixes that issue by separating the two concepts; now one-node clusters should still skip `cockroach init`, but they will run startup tasks as usual. Fixes: #114015. Release note: None

renatolabs force-pushed the rc/start-nodes-sequentially branch 3 times, most recently from 7f8debb to f06fac7 Compare November 3, 2023 19:39

renatolabs marked this pull request as ready for review November 6, 2023 14:52

renatolabs requested a review from a team as a code owner November 6, 2023 14:52

renatolabs requested review from herkolategan and srosenberg and removed request for a team November 6, 2023 14:52

renatolabs force-pushed the rc/start-nodes-sequentially branch from f06fac7 to 932f39b Compare November 6, 2023 18:13

RaduBerinde approved these changes Nov 7, 2023

View reviewed changes

renatolabs added backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only labels Nov 7, 2023

craig bot merged commit b5dd3f1 into cockroachdb:master Nov 7, 2023

blathers-crl bot mentioned this pull request Nov 7, 2023

release-23.2: roachprod: never start cockroach processes concurrently #113963

Merged

renatolabs mentioned this pull request Nov 7, 2023

release-23.1: roachprod: never start cockroach processes concurrently #113964

Merged

yuzefovich mentioned this pull request Nov 8, 2023

roachtest: costfuzz/rand-tables failed #114020

Closed

renatolabs mentioned this pull request Nov 8, 2023

roachprod: fix initialization of 1-node clusters #114025

Merged

blathers-crl bot mentioned this pull request Nov 8, 2023

release-23.1: roachprod: fix initialization of 1-node clusters #114084

Merged

blathers-crl bot mentioned this pull request Nov 8, 2023

release-23.2: roachprod: fix initialization of 1-node clusters #114086

Merged

renatolabs deleted the rc/start-nodes-sequentially branch November 11, 2023 01:06

DarrylWong mentioned this pull request Nov 4, 2024

roachprod: start crdb in parallel on all cluster nodes. #133647

Closed

srosenberg mentioned this pull request Mar 5, 2025

roachprod: potential race in node id allocation #142313

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachprod: never start cockroach processes concurrently#113718

roachprod: never start cockroach processes concurrently#113718
craig[bot] merged 2 commits intocockroachdb:masterfrom
renatolabs:rc/start-nodes-sequentially

renatolabs commented Nov 2, 2023 •

edited

Loading

Uh oh!

cockroach-teamcity commented Nov 2, 2023

Uh oh!

renatolabs commented Nov 6, 2023 •

edited

Loading

Uh oh!

erikgrinaker commented Nov 6, 2023

Uh oh!

erikgrinaker commented Nov 7, 2023

Uh oh!

renatolabs commented Nov 7, 2023

Uh oh!

RaduBerinde left a comment

Uh oh!

renatolabs commented Nov 7, 2023

Uh oh!

craig bot commented Nov 7, 2023

Uh oh!

blathers-crl bot commented Nov 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

renatolabs commented Nov 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Nov 2, 2023

Uh oh!

renatolabs commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikgrinaker commented Nov 6, 2023

Uh oh!

erikgrinaker commented Nov 7, 2023

Uh oh!

renatolabs commented Nov 7, 2023

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs commented Nov 7, 2023

Uh oh!

craig bot commented Nov 7, 2023

Uh oh!

blathers-crl bot commented Nov 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

renatolabs commented Nov 2, 2023 •

edited

Loading

renatolabs commented Nov 6, 2023 •

edited

Loading