kv: support recovery from retry error with partial rollback under RC by nvb · Pull Request #105161 · cockroachdb/cockroach

nvb · 2023-06-20T01:59:48Z

This commit adds support for recovering from a retry error without starting a transaction from the beginning under Read Committed isolation.

The commit does so by first adjusting how Read Committed transactions (or any isolation level that uses a per-statement read snapshot) react to retry errors. Instead of immediately advancing their epochs, they simply adjust their read snapshot to the timestamp of the retry error. The commit then introduces a new TxnMustRestartFromBeginning method on TransactionRetryWithProtoRefreshError, to inform callers whether the transaction's writes were discarded due to the retry error, meaning that it must restart from the beginning. When this method returns false, callers are permitted to employ savepoints to selectively discard the writes from the current statement and then call a new Txn.PrepareForPartialRetry API to escape the retryable error state and continue using the transaction.

Release note: None

cockroach-teamcity · 2023-06-20T01:59:59Z

This change is

cockroach-teamcity · 2023-08-10T15:49:28Z

All committers have signed the CLA.

blathers-crl · 2023-08-10T15:49:51Z

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
Please ensure your git commit message contains a release note.
When CI has completed, please ensure no errors have appeared.

I have added a few people who may be able to assist in reviewing:

@nvanbenschoten (author of kv: support savepoint rollback recovery from retry error under RC #104233)

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/sender.go line 183 at r7 (raw file):

	// error that will cause the transaction to be retried.
	//
	// The transaction's epoch is bumped and its timestamp is upgraded to the

is it still true that the epoch is bumped with this interface if this is used in a RC transaction? the test i added in #107044 doesn't work right now since it seems that the epoch stays at 0.

nvb

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)

pkg/kv/sender.go line 183 at r7 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

is it still true that the epoch is bumped with this interface if this is used in a RC transaction? the test i added in #107044 doesn't work right now since it seems that the epoch stays at 0.

No, this is no longer true, now that we've made GenerateForcedRetryableErr generate a statement-level retryable error for RC transactions. An epoch bump implies that all of the writes in a transaction have been discarded and that the transaction must restart from the very beginning. Transactions can't retry a single statement to get around these.

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/sender.go line 183 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

No, this is no longer true, now that we've made GenerateForcedRetryableErr generate a statement-level retryable error for RC transactions. An epoch bump implies that all of the writes in a transaction have been discarded and that the transaction must restart from the very beginning. Transactions can't retry a single statement to get around these.

ah ok, thanks, that's helpful for me to understand.

since inject_retry_errors_enabled is just a testing knob, i can do something else in the SQL layer to limit it to 3 injections at most for RC txns

nvb · 2023-08-11T22:27:55Z

@arulajmani this should now be good for a review. I've rebased away all of the intermediate cleanup that's already been landed on master, and added a few layers of testing to ensure that the savepoint+retry error dance works correctly. Let me know if there's anything you'd like to talk about.

@rafiss feel free to rebase #107044 on this. There shouldn't be much churn going forward.

arulajmani · 2023-08-12T22:56:44Z

@arulajmani this should now be good for a review. I've rebased away all of the intermediate cleanup that's already been landed on master, and added a few layers of testing to ensure that the savepoint+retry error dance works correctly. Let me know if there's anything you'd like to talk about.

I'll give this a look Monday morning and maybe bother you with questions if I have any.

pkg/sql/conn_executor_test.go

pkg/kv/txn_external_test.go

arulajmani

Thanks for walking me through this, made reviewing this a breeze!

Reviewed 1 of 8 files at r6, 18 of 18 files at r10, 17 of 17 files at r11, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @rafiss)

pkg/kv/txn.go line 1046 at r11 (raw file):

	if !retryErr.TxnMustRestartFromBeginning() {
		// If the retry error does not require the transaction to restart from

Should we throw in a verbose log in this case?

pkg/kv/kvclient/kvcoord/txn_coord_sender.go line 868 at r11 (raw file):

	tc.mu.storedRetryableErr = retErr

	// If the ID changed, it means we had to start a new transaction and the

nit: If the previous transaction is aborted, it means ...

pkg/kv/kvpb/errors.proto line 479 at r11 (raw file):

  // The ID of the transaction being retried. The client can compare this ID
  // that of the new transaction (below) to check whether the transaction was
  // aborted and will restart from the beginning with a new identity.

This commentary cleanup feels long overdue. Thanks for doing it!

Did we want to preserve some of the old commentary around making sure errors from inner transactions are handled correctly when running nested transactions? On one hand this is pretty niche, but it's also subtle usage we need to guard against. Up to you.

pkg/kv/kvpb/errors.proto line 504 at r11 (raw file):

  // - the same Transaction as before (same ID) with the same epoch as before,
  // but with an incremented read timestamp. The transaction can proceed without
  // the loss of writes, although clients may employ savepoints to selectively

nit: "without the loss of any writes"

pkg/kv/kvnemesis/validator_test.go line 38 at r11 (raw file):

var retryableError = kvpb.NewTransactionRetryWithProtoRefreshError(
	``, uuid.MakeV4(), 0, roachpb.Transaction{})

nit: /* prevTxnEpoch */

pkg/kv/kvpb/data_test.go line 240 at r11 (raw file):

		},
	}
	isolation.RunEachLevel(t, func(t *testing.T, isoLevel isolation.Level) {

I may be missing something, but this isoLevel field looks unused. Did you mean to create the txn inside of this loop and inject it into the test cases?

pkg/kv/kvserver/reports/reporter_test.go line 465 at r11 (raw file):

		var err error
		err = kvpb.NewTransactionRetryWithProtoRefreshError(
			"injected err", uuid.Nil, 0, roachpb.Transaction{})

nit: /* prevTxnEpoch */

…ransaction Also rename `TxnID` to `PrevTxnID`. Pure renames.

Fixes cockroachdb#104233. This commit adds support for recovering from a retry error without starting a transaction from the beginning under Read Committed isolation. The commit does so by first adjusting how Read Committed transactions (or any isolation level that uses a per-statement read snapshot) react to retry errors. Instead of immediately advancing their epochs, they simply adjust their read snapshot to the timestamp of the retry error. The commit then introduces a new `TxnMustRestartFromBeginning` method on `TransactionRetryWithProtoRefreshError`, to inform callers whether the transaction's writes were discarded due to the retry error, meaning that it must restart from the beginning. When this method returns false, callers are permitted to employ savepoints to selectively discard the writes from the current statement and then call a new `Txn.PrepareForPartialRetry` API to escape the retryable error state and continue using the transaction. Release note: None

nvb

TFTR!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani and @rafiss)

pkg/kv/txn.go line 1046 at r11 (raw file):