Skip to content

kv: split EndTxn into sub-batch on auto-retry after successful refresh#52885

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/refreshSplitEndTxn
Aug 21, 2020
Merged

kv: split EndTxn into sub-batch on auto-retry after successful refresh#52885
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/refreshSplitEndTxn

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Aug 17, 2020

Fixes #51294.

This commit updates the txnSpanRefresher to split off EndTxn requests into their own partial batches on auto-retries after successful refreshes as a means of preventing starvation. This avoids starvation in two ways. First, it helps ensure that we lay down intents if any of the other requests in the batch are writes. Second, it ensures that if any writes are getting pushed due to contention with reads or due to the closed timestamp, they will still succeed and allow the batch to make forward progress. Without this, each retry attempt may get pushed because of writes in the batch and then rejected wholesale when the EndTxn tries to evaluate the pushed batch. When split, the writes will be pushed but succeed, the transaction will be refreshed, and the EndTxn will succeed.

I still need to confirm that this fixes this indefinite stall here, but I suspect that it will.

Release note (bug fix): A change in v20.1 caused a certain class of bulk UPDATEs and DELETE statements to hang indefinitely if run in an implicit transaction. We now break up these statements to avoid starvation and prevent them from hanging indefinitely.

@nvb nvb requested a review from andreimatei August 17, 2020 00:25
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb nvb force-pushed the nvanbenschoten/refreshSplitEndTxn branch from 3c300ae to c373a77 Compare August 17, 2020 01:38
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Aug 17, 2020

I still need to confirm that this fixes this indefinite stall here, but I suspect that it will.

I just confirmed that this resolves the hanging implicit txn in https://github.com/cockroachlabs/misc_projects_glenn/tree/master/rw_blockage#implicit-query-hangs--explict-query-works. Without this change, the query hangs indefinitely and I see txn.refresh.success grow without bound. I then kill the query (at 02:04:20), upgrade the binary to include this change, and try again. The implicit txn then completes in 8 seconds.

Screen Shot 2020-08-16 at 10 08 08 PM

cc. @glennfawcett

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go, line 380 at r3 (raw file):

	}

	// We've refreshed all of the read spans successfully and bumped

move this comment up


pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go, line 405 at r3 (raw file):

	// Issue a batch up to but not including the EndTxn request.
	etIdx := len(ba.Requests) - 1

put a Eventf around here talking about the split pls

@nvb nvb force-pushed the nvanbenschoten/refreshSplitEndTxn branch from c373a77 to 37b5dd2 Compare August 21, 2020 01:50
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go, line 380 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

move this comment up

Done.


pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go, line 405 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

put a Eventf around here talking about the split pls

Good idea. Done.

Fixes cockroachdb#51294.
First two commits from cockroachdb#52884.

This commit updates the txnSpanRefresher to split off EndTxn requests
into their own partial batches on auto-retries after successful
refreshes as a means of preventing starvation. This avoids starvation in
two ways. First, it helps ensure that we lay down intents if any of the
other requests in the batch are writes. Second, it ensures that if any
writes are getting pushed due to contention with reads or due to the
closed timestamp, they will still succeed and allow the batch to make
forward progress. Without this, each retry attempt may get pushed
because of writes in the batch and then rejected wholesale when the
EndTxn tries to evaluate the pushed batch. When split, the writes will
be pushed but succeed, the transaction will be refreshed, and the EndTxn
will succeed.

I still need to confirm that this fixes this indefinite stall
[here](https://github.com/cockroachlabs/misc_projects_glenn/tree/master/rw_blockage#implicit-query-hangs--explict-query-works),
but I suspect that it will.

Release note (bug fix): A change in v20.1 caused a certain class of bulk
UPDATEs and DELETE statements to hang indefinitely if run in an implicit
transaction. We now break up these statements to avoid starvation and
prevent them from hanging indefinitely.
@nvb nvb force-pushed the nvanbenschoten/refreshSplitEndTxn branch from 37b5dd2 to cc233c5 Compare August 21, 2020 04:47
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Aug 21, 2020

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 21, 2020

Build succeeded:

@craig craig bot merged commit 95a13a8 into cockroachdb:master Aug 21, 2020
@nvb nvb deleted the nvanbenschoten/refreshSplitEndTxn branch August 22, 2020 03:24
nvb added a commit to nvb/cockroach that referenced this pull request Aug 22, 2020
Fixes cockroachdb#53249.

The race was caused by the reintroduction of DeprecatedCanCommitAtHigherTimestamp
in cockroachdb#53220, mixed with the new splitEndTxnAndRetrySend logic introduced in cockroachdb#52885.
craig bot pushed a commit that referenced this pull request Aug 22, 2020
53275: kv: avoid race in txnSpanRefresher.SendLocked r=nvanbenschoten a=nvanbenschoten

Fixes #53249.

The race was caused by the reintroduction of DeprecatedCanCommitAtHigherTimestamp
in #53220, mixed with the new splitEndTxnAndRetrySend logic introduced in #52885.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kv: closed timestamp can starve read-write txns that take longer than 600ms

3 participants