roachprod: never start cockroach processes concurrently#113718
roachprod: never start cockroach processes concurrently#113718craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
7f8debb to
f06fac7
Compare
|
I verified that node IDs now consistently match hostnames with this patch, including in upgrade tests. Also ran upgrade tests on this branch to validate that they pass, since those were the main tests that explicitly set Ready for review. @erikgrinaker could you sanity check that this fixes what you reported in #113384? |
This changes removes the `Sequential` roachprod option (and command line flag) so that we always start cockroach processes sequentially: this ensures that node IDs match hostnames. Previously, we were starting nodes concurrently for upgrade tests because initialization tasks performed by roachprod required quorum. Fixes: cockroachdb#113384. Release note: None
f06fac7 to
932f39b
Compare
Sure, will do some runs after dinner. Thanks! |
|
Confirming that #113222 passed 20/20 runs with this change. I did see nodes occasionally fail to initialize after restarting, but the error was swallowed after 20 retries. I did not investigate this, since the clusters were nuked. For example: |
See inline comments; attempting to run init functions when restarting just adds noise to the logs. In addition, in some situations, it might fail because the `init` node is fixed and there is an assumption on roachprod that the init node will have the binary being used to start another node. This might not be the case in mixedversion tests. Epic: none Release note: None
|
Thanks for checking!
Good catch; this happens when the ordering of events is such that n1 (init node) does not have the binary being used to start another node. The issue was not introduced in this PR, but I added a commit here to skip initialization when restarting nodes. |
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)
|
TFTR! bors r=RaduBerinde |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 932f39b to blathers/backport-release-23.1-113718: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
In cockroachdb#113718, we changed how initialization is done in order to remove the concurrent start of cockroach process. In that change, we also unified the `shouldInit` variable that controls whether the cluster should be initialized with `cockroach init`. However, that particular change was a mistake: it conflated cluster initialization with whether "startup tasks" should run. This commit fixes that issue by separating the two concepts; now one-node clusters should still skip `cockroach init`, but they will run startup tasks as usual. Fixes: cockroachdb#114015. Release note: None
114024: catalog/lease: deflake TestTableCreationPushesTxnsInRecentPast r=fqazi a=fqazi Previously, the TestTableCreationPushesTxnsInRecentPast test could be flaky because the clock uncertainty may not have been sufficient to cause a transaction to get pushed. To address this, this patch modifies the test to inject a small amount of latency to improve reliability. Fixes: #113208 Release note: None 114025: roachprod: fix initialization of 1-node clusters r=RaduBerinde,herkolategan a=renatolabs In #113718, we changed how initialization is done in order to remove the concurrent start of cockroach process. In that change, we also unified the `shouldInit` variable that controls whether the cluster should be initialized with `cockroach init`. However, that particular change was a mistake: it conflated cluster initialization with whether "startup tasks" should run. This commit fixes that issue by separating the two concepts; now one-node clusters should still skip `cockroach init`, but they will run startup tasks as usual. Fixes: #114015. Release note: None 114068: build: build `com_github_azure_azure_sdk_for_go` on `large` pool r=rail a=rickystewart When building with `race`, this will OOM. Using the `large` pool grants it some extra memory. Epic: CRDB-8308 Release note: None Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Renato Costa <renato@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
In #113718, we changed how initialization is done in order to remove the concurrent start of cockroach process. In that change, we also unified the `shouldInit` variable that controls whether the cluster should be initialized with `cockroach init`. However, that particular change was a mistake: it conflated cluster initialization with whether "startup tasks" should run. This commit fixes that issue by separating the two concepts; now one-node clusters should still skip `cockroach init`, but they will run startup tasks as usual. Fixes: #114015. Release note: None
In #113718, we changed how initialization is done in order to remove the concurrent start of cockroach process. In that change, we also unified the `shouldInit` variable that controls whether the cluster should be initialized with `cockroach init`. However, that particular change was a mistake: it conflated cluster initialization with whether "startup tasks" should run. This commit fixes that issue by separating the two concepts; now one-node clusters should still skip `cockroach init`, but they will run startup tasks as usual. Fixes: #114015. Release note: None
This changes startup logic to always start cockroach processes sequentially: this ensures that node IDs match hostnames. Previously, we were starting nodes concurrently for upgrade tests because initialization tasks performed by roachprod required quorum.
Fixes: #113384.
Release note: None