roachtest: don't reuse cluster name on creation retry#68103
Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom Jul 28, 2021
Merged
roachtest: don't reuse cluster name on creation retry#68103craig[bot] merged 1 commit intocockroachdb:masterfrom
craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
`roachtest` retries cluster creation up to three times. Previously, it was reusing the name across retries. Now it doesn't. This is simpler to reason about and less buggy. While I wrote up this commit in response to [a failure], I realized the problem was likely in code I removed in this commit, namely the assumption that if roachprod returns "cluster already exists" that the cluster is good enough to use for the test. In the linked issue, creation failed half-way through, but the cluster did exist - just with two instead of four nodes. (The creation faile due to AWS quota exceeded problems, which dev-inf had promptly investigated and fixed). It still makes sense to avoid that class of problem entirely by not reusing the name in the first place. Closes cockroachdb#67906. Explains cockroachdb#67753 (comment). [a failure]: cockroachdb#67906 (comment) Release note: None
Member
otan
approved these changes
Jul 27, 2021
Member
Author
|
bors r=otan |
Contributor
|
Build succeeded: |
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Jul 28, 2021
As of cockroachdb#68103, existing problems in the cluster creation retry logic are tickled reliably: on retry we are releasing an alloc twice, resulting in a panic. Unfortunately, the alloc is acquired several layers above the retry, so for now the best I can do is to avoid releasing the alloc if we are bound for a retry. The reason the above PR tickles it more reliably is because it accidentally removed a sensible failfast: when the cluster already exists, we shouldn't destroy & try to recreate it again (which would then explode on the double-release anyway). Release note: None
craig bot
pushed a commit
that referenced
this pull request
Jul 29, 2021
68156: roachtest: fix replicagc-changed-peers (again) r=erikgrinaker a=tbg I wrote buggy code. Since it isn't hit in every invocation, it slipped through my local testing. Fixes #68155 Fixes #68162 Release note: None 68157: CODEOWNERS: own catalog package to obs-inf-prs r=erikgrinaker a=tbg Noticed in #67866. Release note: None 68203: roachtest: avoid releasing alloc multiple times r=erikgrinaker a=tbg As of #68103, existing problems in the cluster creation retry logic are tickled reliably: on retry we are releasing an alloc twice, resulting in a panic. Unfortunately, the alloc is acquired several layers above the retry, so for now the best I can do is to avoid releasing the alloc if we are bound for a retry. The reason the above PR tickles it more reliably is because it accidentally removed a sensible failfast: when the cluster already exists, we shouldn't destroy & try to recreate it again (which would then explode on the double-release anyway). Release note: None 68224: roachtest: fix inconsistency r=erikgrinaker a=tbg The way that test was passing the args no longer works. Fixes #64806. Release note: None 68226: roachtest: allow local runs with roachstress r=erikgrinaker a=tbg It's appealing to use roachstress also as a tool to simply run a roachtest, since it avoids having to think much about the flags and incantation. This commit adds a prompt to roachstress about running in local mode. If selected, the binaries are built to target the local architecture, and the builder is not invoked. Release note: None 68227: roachtest: skip acceptance/gossip/restart-node-one r=erikgrinaker a=tbg Touches #68107. Release note: None Co-authored-by: Tobias Grieger <tobias.schottdorf@gmail.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
roachtestretries cluster creation up to three times. Previously, itwas reusing the name across retries. Now it doesn't. This is simpler to
reason about and less buggy.
While I wrote up this commit in response to a failure, I realized the
problem was likely in code I removed in this commit, namely the
assumption that if roachprod returns "cluster already exists" that the
cluster is good enough to use for the test. In the linked issue,
creation failed half-way through, but the cluster did exist - just with
two instead of four nodes. (The creation faile due to AWS quota exceeded
problems, which dev-inf had promptly investigated and fixed).
It still makes sense to avoid that class of problem entirely by
not reusing the name in the first place.
Closes #67906.
Explains #67753 (comment).
Release note: None