Skip to content

[DNM] kv: add environment variable to inject retries#54695

Draft
ajwerner wants to merge 1 commit intocockroachdb:masterfrom
ajwerner:ajwerner/inject-random-restart-errors
Draft

[DNM] kv: add environment variable to inject retries#54695
ajwerner wants to merge 1 commit intocockroachdb:masterfrom
ajwerner:ajwerner/inject-random-restart-errors

Conversation

@ajwerner
Copy link
Copy Markdown
Contributor

This patch extends the testserver to inject retry errors in calls to send
batch requests at a rate indicated by an environment variable. This should
help us uncover bugs due to bad handling of transaction retries.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/inject-random-restart-errors branch from e0a3c10 to e76fbc3 Compare September 23, 2020 14:16
@ajwerner ajwerner marked this pull request as draft September 23, 2020 14:19
@ajwerner ajwerner force-pushed the ajwerner/inject-random-restart-errors branch from e76fbc3 to 0a68a74 Compare September 23, 2020 15:37
@ajwerner ajwerner changed the title [DNM] testserver: add environment variable to inject retries [DNM] kv: add environment variable to inject retries Sep 23, 2020
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Sep 23, 2020
…ptors

Before this patch, if a restart occurred while trying to publish descriptors,
we would short-circuit publishing the descriptors on the subsequent attempt.

The approach was validated by stressing `TestBackupRestoreResume` with the
patch in cockroachdb#54695 which previously
reproduced the failure immediately.

Release note (bug fix): Fixed a bug whereby a transaction restart at the
wrong moment during a restore could leave descriptors offline after the
restore completed successfully.
This patch extends the testserver to inject retry errors in calls to send
batch requests at a rate indicated by an environment variable. This should
help us uncover bugs due to bad handling of transaction retries.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/inject-random-restart-errors branch from 0a68a74 to a53eb00 Compare September 23, 2020 21:14
craig bot pushed a commit that referenced this pull request Sep 24, 2020
54748: backupccl: fix bug due to transaction restart while publishing descri… r=ajwerner a=ajwerner

…ptors

Before this patch, if a restart occurred while trying to publish descriptors,
we would short-circuit publishing the descriptors on the subsequent attempt.

The approach was validated by stressing `TestBackupRestoreResume` with the
patch in #54695 which previously
reproduced the failure immediately.

Release note (bug fix): Fixed a bug whereby a transaction restart at the
wrong moment during a restore could leave descriptors offline after the
restore completed successfully.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Sep 24, 2020
This bug was caught by testing with cockroachdb#54695. Before that change, it would fail
almost immediately, now it does not fail under stress. I'm open to suggestions
on how to more generally test this.

Release note (bug fix): Fixed a rare bug which can lead to index backfills
failing in the face of transaction restarts.
craig bot pushed a commit that referenced this pull request Sep 24, 2020
54261: ui: add transactions page to Admin UI r=dhartunian a=elkmaster

This commit adds the new Transactions Page to the Admin UI.

This page is much like the Statements Page but differs in that
it shows Transaction-level statistics in tabular and detail form.

Every Transaction is able to display its contained Statements
for more detailed analysis.

The page and its components are imported from the
`admin-ui-components` library.

Depends on: cockroachdb/yarn-vendored#38

Release justification: low-risk high impact addition to Admin UI

Release note (admin ui change): add Transactions and Transactions
details pages. These pages allow for viewing stats at the
transaction level.

54749: cli/demo: provision an initial non-`root` "demo" user  r=irfansharif a=knz

Fixes #54557.
Requested by @jseldess 

First 2 commits from   #54741 (can be ignored during review)

Release note (cli change): `cockroach demo` now pre-creates a `demo`
user account with a random password, instead of letting (and
encouraging) the user to use the `root` account directly.
The `demo` account is currently granted the `admin` role.

54755: sql: fix bug whereby backfiller would drop spans on txn restart r=ajwerner a=ajwerner

This bug was caught by testing with #54695. Before that change, it would fail
almost immediately, now it does not fail under stress. I'm open to suggestions
on how to more generally test this.

Release note (bug fix): Fixed a rare bug which can lead to index backfills
failing in the face of transaction restarts.

Co-authored-by: Vlad <carrott9@gmail.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Sep 28, 2020
This bug was caught by testing with cockroachdb#54695. Before that change, it would fail
almost immediately, now it does not fail under stress. I'm open to suggestions
on how to more generally test this.

Release note (bug fix): Fixed a rare bug which can lead to index backfills
failing in the face of transaction restarts.
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Sep 28, 2020
This bug was caught by testing with cockroachdb#54695. Before that change, it would fail
almost immediately, now it does not fail under stress. I'm open to suggestions
on how to more generally test this.

Release note (bug fix): Fixed a rare bug which can lead to index backfills
failing in the face of transaction restarts.
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Sep 29, 2020
This bug was caught by testing with cockroachdb#54695. Before that change, it would fail
almost immediately, now it does not fail under stress. I'm open to suggestions
on how to more generally test this.

Release note (bug fix): Fixed a rare bug which can lead to index backfills
failing in the face of transaction restarts.
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Sep 29, 2020
This bug was caught by testing with cockroachdb#54695. Before that change, it would fail
almost immediately, now it does not fail under stress. I'm open to suggestions
on how to more generally test this.

Release note (bug fix): Fixed a rare bug which can lead to index backfills
failing in the face of transaction restarts.
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Sep 29, 2020
This bug was caught by testing with cockroachdb#54695. Before that change, it would fail
almost immediately, now it does not fail under stress. I'm open to suggestions
on how to more generally test this.

Release note (bug fix): Fixed a rare bug which can lead to index backfills
failing in the face of transaction restarts.
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Sep 29, 2020
…ptors

Before this patch, if a restart occurred while trying to publish descriptors,
we would short-circuit publishing the descriptors on the subsequent attempt.

The approach was validated by stressing `TestBackupRestoreResume` with the
patch in cockroachdb#54695 which previously
reproduced the failure immediately.

Release note (bug fix): Fixed a bug whereby a transaction restart at the
wrong moment during a restore could leave descriptors offline after the
restore completed successfully.
pbardea added a commit to pbardea/cockroach that referenced this pull request Oct 19, 2020
…lishing tables

Before this patch, if a restart occurred while trying to publish
descriptors, we would short-circuit publishing the descriptors on the
subsequent attempt.

The approach was validated by stressing TestBackupRestoreResume with the
patch in cockroachdb#54695 which previously reproduced the failure immediately.

Release note (bug fix): Fixed a bug whereby a transaction restart at the
wrong moment during a restore could leave descriptors offline after the
restore completed successfully.
@pbardea
Copy link
Copy Markdown
Contributor

pbardea commented Oct 30, 2020

I've been playing around with improving bulk jobs' testing around these txn retries and am curious if there's a path forward to getting something like this merged/what can be done to help our testing in this area. I think that a setting like this is useful. Do we have any worries about introducing this setting generally?

I played around with a very early prototype of introducing these kinds of retires specifically when we create new txns in bulk jobs: #56074, but it might be nice to get the more general testing that this provides.

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

X-noremind Bots won't notify about PRs with X-noremind

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants