server: fix timeouts in multi-store tests#112684
server: fix timeouts in multi-store tests#112684craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
I don't yet have the complete story here, but in cockroachdb#112385 we changed this from looking just at initialized engines to all engines. In the case of multi-node, multi-engine tests, this occasionally results in the test timing out with liveness problems. What appears to happen is that in the cases where we fail, have their second and third stores at a lower cluster version resulting in the initial state of both the second and third nodes having a lower cluster version. Additional stores for non-bootstrap node are initialized asyncronously during start up. There may be a better fix here, but I need to become better acquainted with the code before making more serious changes here. Fixes cockroachdb#112676 Release note: None
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
RaduBerinde
left a comment
There was a problem hiding this comment.
Nice find, thank you!!
I made this change because it seemed it can't hurt - we should have set the version on all the stores in the beginning of bootstrapCluster but evidently I missed something.
RaduBerinde
left a comment
There was a problem hiding this comment.
Nice find, thank you!!
I made this change because it seemed it can't hurt - we should have set the version on all the stores in the beginning of bootstrapCluster but evidently I missed something.
Ah I think there are other uses of this function that I didn't consider.
|
bors r=RaduBerinde |
|
Build failed (retrying...): |
|
bors r+ |
|
Already running a review |
|
Build succeeded: |
I don't yet have the complete story here, but in #112385 we changed this from looking just at initialized engines to all engines.
In the case of multi-node, multi-engine tests, this occasionally results in the test timing out with liveness problems.
What appears to happen is that in the cases where we fail, nodes have their second and third stores at a lower cluster version resulting in the initial state of both the second and third nodes having a lower cluster version. Additional stores for non-bootstrap node are initialized asyncronously during start up.
There may be a better fix here, but I need to become better acquainted with the code before making more serious changes here.
Fixes #112658
Fixes #112676
Release note: None