Skip to content

roachtest: don't reuse cluster name on creation retry#68103

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:roachtest-improve-retry
Jul 28, 2021
Merged

roachtest: don't reuse cluster name on creation retry#68103
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:roachtest-improve-retry

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Jul 27, 2021

roachtest retries cluster creation up to three times. Previously, it
was reusing the name across retries. Now it doesn't. This is simpler to
reason about and less buggy.

While I wrote up this commit in response to a failure, I realized the
problem was likely in code I removed in this commit, namely the
assumption that if roachprod returns "cluster already exists" that the
cluster is good enough to use for the test. In the linked issue,
creation failed half-way through, but the cluster did exist - just with
two instead of four nodes. (The creation faile due to AWS quota exceeded
problems, which dev-inf had promptly investigated and fixed).

It still makes sense to avoid that class of problem entirely by
not reusing the name in the first place.

Closes #67906.
Explains #67753 (comment).

Release note: None

`roachtest` retries cluster creation up to three times. Previously, it
was reusing the name across retries. Now it doesn't. This is simpler to
reason about and less buggy.

While I wrote up this commit in response to [a failure], I realized the
problem was likely in code I removed in this commit, namely the
assumption that if roachprod returns "cluster already exists" that the
cluster is good enough to use for the test. In the linked issue,
creation failed half-way through, but the cluster did exist - just with
two instead of four nodes. (The creation faile due to AWS quota exceeded
problems, which dev-inf had promptly investigated and fixed).

It still makes sense to avoid that class of problem entirely by
not reusing the name in the first place.

Closes cockroachdb#67906.
Explains cockroachdb#67753 (comment).

[a failure]: cockroachdb#67906 (comment)

Release note: None
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg tbg requested review from a team and otan and removed request for a team July 27, 2021 13:02
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Jul 28, 2021

bors r=otan

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 28, 2021

Build succeeded:

@craig craig bot merged commit cdace50 into cockroachdb:master Jul 28, 2021
tbg added a commit to tbg/cockroach that referenced this pull request Jul 28, 2021
As of cockroachdb#68103, existing problems in the cluster creation retry logic
are tickled reliably: on retry we are releasing an alloc twice,
resulting in a panic. Unfortunately, the alloc is acquired several
layers above the retry, so for now the best I can do is to avoid
releasing the alloc if we are bound for a retry.

The reason the above PR tickles it more reliably is because it
accidentally removed a sensible failfast: when the cluster already
exists, we shouldn't destroy & try to recreate it again (which
would then explode on the double-release anyway).

Release note: None
craig bot pushed a commit that referenced this pull request Jul 29, 2021
68156: roachtest: fix replicagc-changed-peers (again) r=erikgrinaker a=tbg

I wrote buggy code. Since it isn't hit in every invocation, it slipped
through my local testing.

Fixes #68155
Fixes #68162

Release note: None


68157: CODEOWNERS: own catalog package to obs-inf-prs r=erikgrinaker a=tbg

Noticed in #67866.

Release note: None


68203: roachtest: avoid releasing alloc multiple times r=erikgrinaker a=tbg

As of #68103, existing problems in the cluster creation retry logic
are tickled reliably: on retry we are releasing an alloc twice,
resulting in a panic. Unfortunately, the alloc is acquired several
layers above the retry, so for now the best I can do is to avoid
releasing the alloc if we are bound for a retry.

The reason the above PR tickles it more reliably is because it
accidentally removed a sensible failfast: when the cluster already
exists, we shouldn't destroy & try to recreate it again (which
would then explode on the double-release anyway).

Release note: None


68224: roachtest: fix inconsistency r=erikgrinaker a=tbg

The way that test was passing the args no longer works.

Fixes #64806.

Release note: None


68226: roachtest: allow local runs with roachstress r=erikgrinaker a=tbg

It's appealing to use roachstress also as a tool to simply run
a roachtest, since it avoids having to think much about the flags
and incantation.

This commit adds a prompt to roachstress about running in local
mode. If selected, the binaries are built to target the local
architecture, and the builder is not invoked.

Release note: None


68227: roachtest: skip acceptance/gossip/restart-node-one r=erikgrinaker a=tbg

Touches #68107.

Release note: None


Co-authored-by: Tobias Grieger <tobias.schottdorf@gmail.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: kv0/enc=false/nodes=3/batch=16 failed

3 participants