Skip to content

roachprod: never start cockroach processes concurrently#113718

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
renatolabs:rc/start-nodes-sequentially
Nov 7, 2023
Merged

roachprod: never start cockroach processes concurrently#113718
craig[bot] merged 2 commits intocockroachdb:masterfrom
renatolabs:rc/start-nodes-sequentially

Conversation

@renatolabs
Copy link
Copy Markdown

@renatolabs renatolabs commented Nov 2, 2023

This changes startup logic to always start cockroach processes sequentially: this ensures that node IDs match hostnames. Previously, we were starting nodes concurrently for upgrade tests because initialization tasks performed by roachprod required quorum.

Fixes: #113384.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@renatolabs renatolabs force-pushed the rc/start-nodes-sequentially branch 3 times, most recently from 7f8debb to f06fac7 Compare November 3, 2023 19:39
@renatolabs renatolabs marked this pull request as ready for review November 6, 2023 14:52
@renatolabs renatolabs requested a review from a team as a code owner November 6, 2023 14:52
@renatolabs renatolabs requested review from herkolategan and srosenberg and removed request for a team November 6, 2023 14:52
@renatolabs
Copy link
Copy Markdown
Author

renatolabs commented Nov 6, 2023

I verified that node IDs now consistently match hostnames with this patch, including in upgrade tests. Also ran upgrade tests on this branch to validate that they pass, since those were the main tests that explicitly set RoachprodOpts.Sequential = false before:

https://teamcity.cockroachdb.com/viewLog.html?buildId=12494442&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel

Ready for review.

@erikgrinaker could you sanity check that this fixes what you reported in #113384?

This changes removes the `Sequential` roachprod option (and command
line flag) so that we always start cockroach processes sequentially:
this ensures that node IDs match hostnames. Previously, we were
starting nodes concurrently for upgrade tests because initialization
tasks performed by roachprod required quorum.

Fixes: cockroachdb#113384.

Release note: None
@renatolabs renatolabs force-pushed the rc/start-nodes-sequentially branch from f06fac7 to 932f39b Compare November 6, 2023 18:13
@erikgrinaker
Copy link
Copy Markdown
Contributor

@erikgrinaker could you sanity check that this fixes what you reported in #113384?

Sure, will do some runs after dinner. Thanks!

@erikgrinaker
Copy link
Copy Markdown
Contributor

Confirming that #113222 passed 20/20 runs with this change.

I did see nodes occasionally fail to initialize after restarting, but the error was swallowed after 20 retries. I did not investigate this, since the clusters were nuked. For example:

[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:11 runner.go:382: ---------- STARTING (6): restart node 2 with binary version v22.2.6 ----------
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:11 runner.go:396: binary versions: [1: 22.1, 2: 22.1, 3: 22.1, 4: 22.1]
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:11 runner.go:397: cluster versions: [1: 22.1, 2: 22.1, 3: 22.1, 4: 22.1]
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:11 clusterupgrade.go:254: restarting node 2 into version v22.2.6
19:56:11 cluster.go:674: test status: stopping nodes :2
local: stopping and waiting
19:56:14 cluster.go:674: test status: stopping nodes :2
local: stopping
19:56:14 cluster.go:2205: running cmd `test -e v22.2.6/cockroach |...` on nodes [:2]; details in run_195614.787626829_n2_test-e-v2226cockroac.log
19:56:14 cluster.go:674: test status: staging binary
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:14 staging.go:358: Resolved release url for cockroach version v22.2.6: https://storage.googleapis.com/cockroach-release-artifacts-prod/cockroach-v22.2.6.linux-amd64.tgz
19:56:15 monitor.go:177: Monitor event: n2: cockroach process for system interface died (exit code unknown): expected
19:56:18 cluster.go:674: test status: 
19:56:18 cluster.go:674: test status: starting nodes :2
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:19 cockroach.go:380: local: starting nodes
local: executing sql
local: executing sql
19:56:20 monitor.go:177: Monitor event: n2: cockroach process for system interface is running (PID: 22341)
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:52 cockroach.go:1078: not creating default admin user due to error: COMMAND_PROBLEM: exit status 127
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:52 cockroach.go:1093: local: setting cluster settings
[mixed-version-test/6_restart-node-2-with-binary-version-v22.2.6] 19:56:56 runner.go:382: ---------- FINISHED [44.467561885s] (6): restart node 2 with binary version v22.2.6 ----------
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:56:56 runner.go:382: ---------- STARTING (7): restart node 4 with binary version v22.2.6 ----------
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:56:56 runner.go:396: binary versions: [1: 22.1, 2: 22.2, 3: 22.1, 4: 22.1]
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:56:56 runner.go:397: cluster versions: [1: 22.1, 2: 22.1, 3: 22.1, 4: 22.1]
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:56:56 clusterupgrade.go:254: restarting node 4 into version v22.2.6
19:56:56 cluster.go:674: test status: stopping nodes :4
local: stopping and waiting
19:57:01 monitor.go:177: Monitor event: n4: cockroach process for system interface died (exit code unknown): expected
19:57:01 cluster.go:674: test status: stopping nodes :4
local: stopping
19:57:01 cluster.go:2205: running cmd `test -e v22.2.6/cockroach |...` on nodes [:4]; details in run_195701.303442393_n4_test-e-v2226cockroac.log
19:57:01 cluster.go:674: test status: staging binary
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:57:01 staging.go:358: Resolved release url for cockroach version v22.2.6: https://storage.googleapis.com/cockroach-release-artifacts-prod/cockroach-v22.2.6.linux-amd64.tgz
19:57:05 cluster.go:674: test status: 
19:57:05 cluster.go:674: test status: starting nodes :4
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:57:05 cockroach.go:380: local: starting nodes
local: executing sql
local: executing sql
local: executing sql
19:57:07 monitor.go:177: Monitor event: n4: cockroach process for system interface is running (PID: 23314)
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
local: executing sql
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:57:37 cockroach.go:1078: not creating default admin user due to error: COMMAND_PROBLEM: exit status 127
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:57:37 cockroach.go:1093: local: setting cluster settings
[mixed-version-test/7_restart-node-4-with-binary-version-v22.2.6] 19:57:41 runner.go:382: ---------- FINISHED [45.83830969s] (7): restart node 4 with binary version v22.2.6 ----------

See inline comments; attempting to run init functions when restarting
just adds noise to the logs. In addition, in some situations, it might
fail because the `init` node is fixed and there is an assumption on
roachprod that the init node will have the binary being used to start
another node. This might not be the case in mixedversion tests.

Epic: none

Release note: None
@renatolabs
Copy link
Copy Markdown
Author

Thanks for checking!

I did see nodes occasionally fail to initialize after restarting, but the error was swallowed after 20 retries

Good catch; this happens when the ordering of events is such that n1 (init node) does not have the binary being used to start another node. The issue was not introduced in this PR, but I added a commit here to skip initialization when restarting nodes.

Copy link
Copy Markdown
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)

@renatolabs renatolabs added backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only labels Nov 7, 2023
@renatolabs
Copy link
Copy Markdown
Author

TFTR!

bors r=RaduBerinde

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 7, 2023

Build succeeded:

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Nov 7, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 932f39b to blathers/backport-release-23.1-113718: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

renatolabs pushed a commit to renatolabs/cockroach that referenced this pull request Nov 8, 2023
In cockroachdb#113718, we changed how initialization is done in order to remove
the concurrent start of cockroach process. In that change, we also
unified the `shouldInit` variable that controls whether the cluster
should be initialized with `cockroach init`. However, that particular
change was a mistake: it conflated cluster initialization with whether
"startup tasks" should run.

This commit fixes that issue by separating the two concepts; now
one-node clusters should still skip `cockroach init`, but they will
run startup tasks as usual.

Fixes: cockroachdb#114015.

Release note: None
craig bot pushed a commit that referenced this pull request Nov 8, 2023
114024: catalog/lease: deflake TestTableCreationPushesTxnsInRecentPast r=fqazi a=fqazi

Previously, the TestTableCreationPushesTxnsInRecentPast test could be flaky because the clock uncertainty may not have been sufficient to cause a transaction to get pushed. To address this, this patch modifies the test to inject a small amount of latency to improve reliability.

Fixes: #113208

Release note: None

114025: roachprod: fix initialization of 1-node clusters r=RaduBerinde,herkolategan a=renatolabs

In #113718, we changed how initialization is done in order to remove the concurrent start of cockroach process. In that change, we also unified the `shouldInit` variable that controls whether the cluster should be initialized with `cockroach init`. However, that particular change was a mistake: it conflated cluster initialization with whether "startup tasks" should run.

This commit fixes that issue by separating the two concepts; now one-node clusters should still skip `cockroach init`, but they will run startup tasks as usual.

Fixes: #114015.

Release note: None

114068: build: build `com_github_azure_azure_sdk_for_go` on `large` pool r=rail a=rickystewart

When building with `race`, this will OOM. Using the `large` pool grants it some extra memory.

Epic: CRDB-8308
Release note: None

Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Co-authored-by: Renato Costa <renato@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this pull request Nov 8, 2023
In #113718, we changed how initialization is done in order to remove
the concurrent start of cockroach process. In that change, we also
unified the `shouldInit` variable that controls whether the cluster
should be initialized with `cockroach init`. However, that particular
change was a mistake: it conflated cluster initialization with whether
"startup tasks" should run.

This commit fixes that issue by separating the two concepts; now
one-node clusters should still skip `cockroach init`, but they will
run startup tasks as usual.

Fixes: #114015.

Release note: None
blathers-crl bot pushed a commit that referenced this pull request Nov 8, 2023
In #113718, we changed how initialization is done in order to remove
the concurrent start of cockroach process. In that change, we also
unified the `shouldInit` variable that controls whether the cluster
should be initialized with `cockroach init`. However, that particular
change was a mistake: it conflated cluster initialization with whether
"startup tasks" should run.

This commit fixes that issue by separating the two concepts; now
one-node clusters should still skip `cockroach init`, but they will
run startup tasks as usual.

Fixes: #114015.

Release note: None
@renatolabs renatolabs deleted the rc/start-nodes-sequentially branch November 11, 2023 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtestutil/mixedversion: inconsistent node IDs

4 participants