roachtest: deflake bank/{node-restart,cluster-recovery}#40997
Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom Sep 24, 2019
Merged
roachtest: deflake bank/{node-restart,cluster-recovery}#40997craig[bot] merged 1 commit intocockroachdb:masterfrom
craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
Member
nvb
approved these changes
Sep 23, 2019
914d7e1 to
41a71ae
Compare
Fixes cockroachdb#38785. Fixes cockroachdb#35326. Because everything roachprod does, it does through SSH, we're particularly susceptible to network delays, packet drops, etc. We've seen this before, or at least pointed to this being a problem before, over at cockroachdb#37001. Setting timeouts around our calls to roachprod helps to better surface these kind of errors. The underlying issue in cockroachdb#38785 and in cockroachdb#35326 is the fact that we're running roachprod commands that may (reasonably) fail due to connection issues, and we're unable to retry them safely (the underlying commands are non-idempotent). Presently we simply fail the entire test, when really we should be able to retry the commands. This is left unaddressed. Release justification: Category 1: Non-production code changes Release note: None
41a71ae to
5296a27
Compare
Contributor
Author
|
bors r+ |
craig bot
pushed a commit
that referenced
this pull request
Sep 24, 2019
40997: roachtest: deflake bank/{node-restart,cluster-recovery} r=irfansharif a=irfansharif
Fixes #38785.
Fixes #35326.
Because everything roachprod does, it does through SSH, we're
particularly susceptible to network delays, packet drops, etc. We've
seen this before, or at least pointed to this being a problem before,
over at #37001. Setting timeouts around our calls to roachprod helps to
better surface these kind of errors.
The underlying issue in #38785 and in #35326 is the fact that we're
running roachprod commands that may (reasonably) fail due to connection
issues, and we're unable to retry them safely (the underlying commands
are non-idempotent). Presently we simply fail the entire test, when
really we should be able to retry the commands. This is left unaddressed.
Release justification: Category 1: Non-production code changes
Release note: None
41029: cli: fix the demo licensing code r=rohany a=knz
Fixes #40734.
Fixes #41024.
Release justification: fixes a flaky test, fixes UX of main new feature
Before this patch, there were multiple problems with the code:
- if the license acquisition was disabled by the env var config,
the error message would not be clear.
- the licensing code would deadlock silently on OSS-only
builds (because the license failure channel was not written in that
control branch).
- the error/warning messages would be interleaved on the same line as
the input line (missing newline at start of message).
- the test code would fail when the license server is not available.
- the set up of the example database and workload would be performed
asynchronously, with unclear signalling of when the user
can expect to use them interactively.
After this patch:
- it's possible to override the license acquisition URL with
COCKROACH_DEMO_LICENSE_URL, this is used in tests.
- setting up the example database, partitioning and workload is done
before presenting the interactive prompt.
- partitioning the example database, if requested by
--geo-partitioned-replicas, waits for license acquisition to
complete (license acquisition remains asynchronous otherwise).
- impossible configurations are reported early(earlier).
For example:
- OSS-only builds:
```
kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas
*
* ERROR: enterprise features are required for this demo, cannot run from OSS-only binary
*
Failed running "demo"
```
For license acquisition failures:
```
kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas
error while contacting licensing server:
Get https://192.168.2.170/api/license?clusterid=5548b310-14b7-46de-8c92-30605bfe95c4&kind=demo&version=v19.2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
*
* ERROR: license acquisition was unsuccessful.
* Note: enterprise features are needed for --geo-partitioned-replicas.
*
Failed running "demo"
```
Additionally, this change fixes test flakiness that arises from an
unavailable license server.
Release note (cli change): To enable uses of `cockroach demo` with
enterprise features in firewalled network environments, it is now
possible to redirect the license acquisition with the environment
variable COCKROACH_DEMO_LICENSE_URL to a replacement server (for
example a suitably configured HTTP proxy).
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Contributor
Build succeeded |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #38785.
Fixes #35326.
Because everything roachprod does, it does through SSH, we're
particularly susceptible to network delays, packet drops, etc. We've
seen this before, or at least pointed to this being a problem before,
over at #37001. Setting timeouts around our calls to roachprod helps to
better surface these kind of errors.
The underlying issue in #38785 and in #35326 is the fact that we're
running roachprod commands that may (reasonably) fail due to connection
issues, and we're unable to retry them safely (the underlying commands
are non-idempotent). Presently we simply fail the entire test, when
really we should be able to retry the commands. This is left unaddressed.
Release justification: Category 1: Non-production code changes
Release note: None