Skip to content

roachtest: deflake bank/{node-restart,cluster-recovery}#40997

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
irfansharif:node-restart
Sep 24, 2019
Merged

roachtest: deflake bank/{node-restart,cluster-recovery}#40997
craig[bot] merged 1 commit intocockroachdb:masterfrom
irfansharif:node-restart

Conversation

@irfansharif
Copy link
Copy Markdown
Contributor

Fixes #38785.
Fixes #35326.

Because everything roachprod does, it does through SSH, we're
particularly susceptible to network delays, packet drops, etc. We've
seen this before, or at least pointed to this being a problem before,
over at #37001. Setting timeouts around our calls to roachprod helps to
better surface these kind of errors.

The underlying issue in #38785 and in #35326 is the fact that we're
running roachprod commands that may (reasonably) fail due to connection
issues, and we're unable to retry them safely (the underlying commands
are non-idempotent). Presently we simply fail the entire test, when
really we should be able to retry the commands. This is left unaddressed.

Release justification: Category 1: Non-production code changes
Release note: None

@irfansharif irfansharif requested a review from nvb September 23, 2019 21:10
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@irfansharif irfansharif force-pushed the node-restart branch 2 times, most recently from 914d7e1 to 41a71ae Compare September 24, 2019 15:05
Fixes cockroachdb#38785.
Fixes cockroachdb#35326.

Because everything roachprod does, it does through SSH, we're
particularly susceptible to network delays, packet drops, etc. We've
seen this before, or at least pointed to this being a problem before,
over at cockroachdb#37001. Setting timeouts around our calls to roachprod helps to
better surface these kind of errors.

The underlying issue in cockroachdb#38785 and in cockroachdb#35326 is the fact that we're
running roachprod commands that may (reasonably) fail due to connection
issues, and we're unable to retry them safely (the underlying commands
are non-idempotent). Presently we simply fail the entire test, when
really we should be able to retry the commands. This is left unaddressed.

Release justification: Category 1: Non-production code changes
Release note: None
@irfansharif
Copy link
Copy Markdown
Contributor Author

bors r+

craig bot pushed a commit that referenced this pull request Sep 24, 2019
40997: roachtest: deflake bank/{node-restart,cluster-recovery} r=irfansharif a=irfansharif

Fixes #38785.
Fixes #35326.

Because everything roachprod does, it does through SSH, we're
particularly susceptible to network delays, packet drops, etc. We've
seen this before, or at least pointed to this being a problem before,
over at #37001. Setting timeouts around our calls to roachprod helps to
better surface these kind of errors.

The underlying issue in #38785 and in #35326 is the fact that we're
running roachprod commands that may (reasonably) fail due to connection
issues, and we're unable to retry them safely (the underlying commands
are non-idempotent). Presently we simply fail the entire test, when
really we should be able to retry the commands. This is left unaddressed.

Release justification: Category 1: Non-production code changes
Release note: None

41029: cli: fix the demo licensing code r=rohany a=knz

Fixes #40734.
Fixes #41024.

Release justification: fixes a flaky test, fixes UX of main new feature

Before this patch, there were multiple problems with the code:

- if the license acquisition was disabled by the env var config,
  the error message would not be clear.
- the licensing code would deadlock silently on OSS-only
  builds (because the license failure channel was not written in that
  control branch).
- the error/warning messages would be interleaved on the same line as
  the input line (missing newline at start of message).
- the test code would fail when the license server is not available.
- the set up of the example database and workload would be performed
  asynchronously, with unclear signalling of when the user
  can expect to use them interactively.

After this patch:
- it's possible to override the license acquisition URL with
  COCKROACH_DEMO_LICENSE_URL, this is used in tests.
- setting up the example database, partitioning and workload is done
  before presenting the interactive prompt.
- partitioning the example database, if requested by
  --geo-partitioned-replicas, waits for license acquisition to
  complete (license acquisition remains asynchronous otherwise).
- impossible configurations are reported early(earlier).

For example:

- OSS-only builds:

```
kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas
*
* ERROR: enterprise features are required for this demo, cannot run from OSS-only binary
*
Failed running "demo"
```

For license acquisition failures:

```
kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas
error while contacting licensing server:
Get https://192.168.2.170/api/license?clusterid=5548b310-14b7-46de-8c92-30605bfe95c4&kind=demo&version=v19.2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
*
* ERROR: license acquisition was unsuccessful.
* Note: enterprise features are needed for --geo-partitioned-replicas.
*
Failed running "demo"
```

Additionally, this change fixes test flakiness that arises from an
unavailable license server.

Release note (cli change): To enable uses of `cockroach demo` with
enterprise features in firewalled network environments, it is now
possible to redirect the license acquisition with the environment
variable COCKROACH_DEMO_LICENSE_URL to a replacement server (for
example a suitably configured HTTP proxy).

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 24, 2019

Build succeeded

@craig craig bot merged commit 5296a27 into cockroachdb:master Sep 24, 2019
@irfansharif irfansharif deleted the node-restart branch September 24, 2019 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: acceptance/bank/cluster-recovery failed roachtest: acceptance/bank/node-restart failed

3 participants