kv: split EndTxn into sub-batch on auto-retry after successful refresh#52885
Conversation
3c300ae to
c373a77
Compare
I just confirmed that this resolves the hanging implicit txn in https://github.com/cockroachlabs/misc_projects_glenn/tree/master/rw_blockage#implicit-query-hangs--explict-query-works. Without this change, the query hangs indefinitely and I see cc. @glennfawcett |
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go, line 380 at r3 (raw file):
} // We've refreshed all of the read spans successfully and bumped
move this comment up
pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go, line 405 at r3 (raw file):
// Issue a batch up to but not including the EndTxn request. etIdx := len(ba.Requests) - 1
put a Eventf around here talking about the split pls
c373a77 to
37b5dd2
Compare
nvb
left a comment
There was a problem hiding this comment.
TFTR!
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go, line 380 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
move this comment up
Done.
pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go, line 405 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
put a Eventf around here talking about the split pls
Good idea. Done.
Fixes cockroachdb#51294. First two commits from cockroachdb#52884. This commit updates the txnSpanRefresher to split off EndTxn requests into their own partial batches on auto-retries after successful refreshes as a means of preventing starvation. This avoids starvation in two ways. First, it helps ensure that we lay down intents if any of the other requests in the batch are writes. Second, it ensures that if any writes are getting pushed due to contention with reads or due to the closed timestamp, they will still succeed and allow the batch to make forward progress. Without this, each retry attempt may get pushed because of writes in the batch and then rejected wholesale when the EndTxn tries to evaluate the pushed batch. When split, the writes will be pushed but succeed, the transaction will be refreshed, and the EndTxn will succeed. I still need to confirm that this fixes this indefinite stall [here](https://github.com/cockroachlabs/misc_projects_glenn/tree/master/rw_blockage#implicit-query-hangs--explict-query-works), but I suspect that it will. Release note (bug fix): A change in v20.1 caused a certain class of bulk UPDATEs and DELETE statements to hang indefinitely if run in an implicit transaction. We now break up these statements to avoid starvation and prevent them from hanging indefinitely.
37b5dd2 to
cc233c5
Compare
|
bors r+ |
|
Build succeeded: |
Fixes cockroachdb#53249. The race was caused by the reintroduction of DeprecatedCanCommitAtHigherTimestamp in cockroachdb#53220, mixed with the new splitEndTxnAndRetrySend logic introduced in cockroachdb#52885.
53275: kv: avoid race in txnSpanRefresher.SendLocked r=nvanbenschoten a=nvanbenschoten Fixes #53249. The race was caused by the reintroduction of DeprecatedCanCommitAtHigherTimestamp in #53220, mixed with the new splitEndTxnAndRetrySend logic introduced in #52885. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Fixes #51294.
This commit updates the txnSpanRefresher to split off EndTxn requests into their own partial batches on auto-retries after successful refreshes as a means of preventing starvation. This avoids starvation in two ways. First, it helps ensure that we lay down intents if any of the other requests in the batch are writes. Second, it ensures that if any writes are getting pushed due to contention with reads or due to the closed timestamp, they will still succeed and allow the batch to make forward progress. Without this, each retry attempt may get pushed because of writes in the batch and then rejected wholesale when the EndTxn tries to evaluate the pushed batch. When split, the writes will be pushed but succeed, the transaction will be refreshed, and the EndTxn will succeed.
I still need to confirm that this fixes this indefinite stall here, but I suspect that it will.
Release note (bug fix): A change in v20.1 caused a certain class of bulk UPDATEs and DELETE statements to hang indefinitely if run in an implicit transaction. We now break up these statements to avoid starvation and prevent them from hanging indefinitely.