sql: rollback transaction eagerly upon error by rafiss · Pull Request #140160 · cockroachdb/cockroach

rafiss · 2025-01-30T22:32:36Z

sql: rollback transaction eagerly upon error

If we have not used any savepoints, it's safe to rollback the
transaction immediately when a non-retryable error is encountered.

This releases locks sooner, which is useful so we avoid needing to wait
until the client sends a ROLLBACK statement.

Release note (bug fix): Transactions that enter the aborted state will
now release locks they are holding immediately, as long as there is no
SAVEPOINT active in the transaction.

sql: make transaction_timeout abort transaction eagerly

Previously, the transaction_timeout setting would only cause the
transaction to abort if the timeout had already expired by the time the
next statement is received by CRDB. Now we create a timer that fires if
the timeout expires while we are waiting for the next statement.

There's no release note, since up until the previous commit, which
changed the behavior so transactions eagerly release locks, there was no
functional difference caused by lazily timing out the transaction.

Release note: None

sql: add test for implicit txn holding locks across retries

Release note: None

fixes #140234

blathers-crl · 2025-01-30T22:32:40Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2025-01-30T22:32:48Z

This change is

michae2

Looks pretty good!

Some test ideas:

something with cockroach_restart
something with pausable portals
something with implicit txn retry

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @rafiss)

pkg/sql/conn_executor_savepoints.go line 34 at r1 (raw file):

	ctx context.Context, s *tree.Savepoint, res RestrictedCommandResult,
) (fsm.Event, fsm.EventPayload, error) {
	ex.state.mu.hasSavepoints = true

Would we want to set hasSavepoints to false somewhere in execRelease to allow the eager rollback of a transaction that fails after all created savepoints have been released?

rafiss

something with cockroach_restart

I added a test with savepoints. I don't think cockroach_restart needs any other special testing, but I could be missing something, so just let me know what else you had in mind.

something with pausable portals

I was thinking this would be covered by our metamorphic testing. But I can add a test for a specific scenario if you can think of one.

something with implicit txn retry

Added this test.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)

pkg/sql/conn_executor_savepoints.go line 34 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Would we want to set hasSavepoints to false somewhere in execRelease to allow the eager rollback of a transaction that fails after all created savepoints have been released?

done. good point.

michae2

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)

pkg/sql/conn_fsm.go line 418 at r2 (raw file):

					ts.mu.Lock()
					defer ts.mu.Unlock()
					if !ts.mu.hasSavepoints {

Something occurred to me: if this is a nested connExecutor created by InternalExecutor.initConnEx, sharing a transaction with its parent connExecutor, then will it know about the savepoints in its parent connExecutor?

Maybe we also need to check that this is a top-level connExecutor? Or we need to add savepoint info to InternalExecutor.extraTxnState?

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)

pkg/sql/conn_fsm.go line 418 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Something occurred to me: if this is a nested connExecutor created by InternalExecutor.initConnEx, sharing a transaction with its parent connExecutor, then will it know about the savepoints in its parent connExecutor?

Maybe we also need to check that this is a top-level connExecutor? Or we need to add savepoint info to InternalExecutor.extraTxnState?

When we're running an internal executor delegated from a parent transaction, the connExecutor uses a different state machine. It's further down below:

cockroach/pkg/sql/conn_fsm.go

Line 704 in 5a8f03f

var BoundTxnStateTransitions = fsm.Compile(fsm.Pattern{

The transaction is never rolled back in those state transitions.

michae2

Reviewed 7 of 7 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)

pkg/sql/conn_fsm.go line 418 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

When we're running an internal executor delegated from a parent transaction, the connExecutor uses a different state machine. It's further down below:

cockroach/pkg/sql/conn_fsm.go

Line 704 in 5a8f03f

var BoundTxnStateTransitions = fsm.Compile(fsm.Pattern{

The transaction is never rolled back in those state transitions.

Great, thank you!

pkg/sql/conn_executor_test.go line 1905 at r2 (raw file):

		err = conn1.QueryRowContext(ctx, `SELECT v FROM t WHERE k = 4`).Scan(&v)
		require.NoError(t, err)
		require.Equal(t, 400, v)

one idea for a cockroach_restart test is something like:

-- conn 1
INSERT INTO t VALUES (5, 5), (6, 6);
BEGIN;
SAVEPOINT cockroach_restart;
SELECT * FROM t WHERE k = 5;
    -- conn 2 (to break serializability)
    UPDATE t SET v = 50 WHERE k = 5;
    SELECT * FROM t WHERE k = 6;
-- conn 1
UPDATE t SET v = 60 WHERE k = 6;
    -- conn 2 (blocks, to prove locks are still held)
    UPDATE t SET v = v + 600 WHERE k = 6;
-- conn 1 (should fail with RETRY_SERIALIZABLE)
RELEASE SAVEPOINT cockroach_restart;
-- simulate the advanced retry
ROLLBACK TO SAVEPOINT cockroach_restart;
SAVEPOINT cockroach_restart;
SELECT * FROM t WHERE k = 5;
UPDATE t SET v = 61 WHERE k = 6;
RELEASE SAVEPOINT cockroach_restart;
    -- conn 2 (to prove that ordering was correct and locks are now released)
    SELECT * FROM t WHERE k = 6;
-- conn 1
COMMIT;

pkg/sql/conn_executor_test.go line 1916 at r4 (raw file):

// layer, but for now this is the expected behavior.
// See https://github.com/cockroachdb/cockroach/issues/117020.
func TestRetriableErrorDuringTransactionHoldsLocks(t *testing.T) {

Thanks for adding this!

pkg/sql/conn_executor_exec.go line 186 at r3 (raw file):

	ex.startIdleInSessionTimeout()

	if ex.sessionData().IdleInTransactionSessionTimeout > 0 {

If both IdleInTransactionSessionTimeout and txnTimeoutRemaining are > 0, do we need to set both timers? Seems like we only need to set the one that will fire first?

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)

pkg/sql/conn_executor_exec.go line 186 at r3 (raw file):

Previously, michae2 (Michael Erickson) wrote…

If both IdleInTransactionSessionTimeout and txnTimeoutRemaining are > 0, do we need to set both timers? Seems like we only need to set the one that will fire first?

the two timers have slightly different behaviors. as per the commend i added below:

				// The transaction_timeout setting should move the transaction to the
				// aborted state.
				// NOTE: In Postgres, the transaction_timeout causes the entire session
				// to be terminated. We intentionally diverge from that behavior.

however, the IdleInTransactionSessionTimeout timer causes the session to be terminated.

pkg/sql/conn_executor_test.go line 1905 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

one idea for a cockroach_restart test is something like:

-- conn 1
INSERT INTO t VALUES (5, 5), (6, 6);
BEGIN;
SAVEPOINT cockroach_restart;
SELECT * FROM t WHERE k = 5;
    -- conn 2 (to break serializability)
    UPDATE t SET v = 50 WHERE k = 5;
    SELECT * FROM t WHERE k = 6;
-- conn 1
UPDATE t SET v = 60 WHERE k = 6;
    -- conn 2 (blocks, to prove locks are still held)
    UPDATE t SET v = v + 600 WHERE k = 6;
-- conn 1 (should fail with RETRY_SERIALIZABLE)
RELEASE SAVEPOINT cockroach_restart;
-- simulate the advanced retry
ROLLBACK TO SAVEPOINT cockroach_restart;
SAVEPOINT cockroach_restart;
SELECT * FROM t WHERE k = 5;
UPDATE t SET v = 61 WHERE k = 6;
RELEASE SAVEPOINT cockroach_restart;
    -- conn 2 (to prove that ordering was correct and locks are now released)
    SELECT * FROM t WHERE k = 6;
-- conn 1
COMMIT;

nice! let me give this a try

rafiss

thanks for the reviews! this is ready for another look.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)

michae2

Just one suggestion. This is great!

Reviewed 4 of 4 files at r5, 3 of 3 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mgartner)

pkg/sql/conn_executor_exec.go line 186 at r3 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

the two timers have slightly different behaviors. as per the commend i added below:
				// The transaction_timeout setting should move the transaction to the
				// aborted state.
				// NOTE: In Postgres, the transaction_timeout causes the entire session
				// to be terminated. We intentionally diverge from that behavior.
however, the IdleInTransactionSessionTimeout timer causes the session to be terminated.

Ah, I missed that. Thanks!

pkg/sql/conn_executor_test.go line 1936 at r5 (raw file):

		_, err = conn1.ExecContext(ctx, `RELEASE SAVEPOINT cockroach_restart`)
		require.ErrorContains(t, err, "failed preemptive refresh due to encountered recently written committed value")

nit: Because we're not letting the conn2 update block the entire time, I think it would be a slightly better test to try the conn2 update right after releasing the cockroach_restart savepoint. This would validate that locks are still held after releasing the cockroach_restart savepoint.

If we have not used any savepoints, it's safe to rollback the transaction immediately when a non-retryable error is encountered. This releases locks sooner, which is useful so we avoid needing to wait until the client sends a ROLLBACK statement. Release note (bug fix): Transactions that enter the aborted state will now release locks they are holding immediately, as long as there is no SAVEPOINT active in the transaction.

Previously, the transaction_timeout setting would only cause the transaction to abort if the timeout had already expired by the time the next statement is received by CRDB. Now we create a timer that fires if the timeout expires while we are waiting for the next statement. There's no release note, since up until the previous commit, which changed the behavior so transactions eagerly release locks, there was no functional difference caused by lazily timing out the transaction. Release note: None

Release note: None

rafiss

tftr!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @michae2)

pkg/sql/conn_executor_test.go line 1936 at r5 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: Because we're not letting the conn2 update block the entire time, I think it would be a slightly better test to try the conn2 update right after releasing the cockroach_restart savepoint. This would validate that locks are still held after releasing the cockroach_restart savepoint.

good point! added the assertion after too

craig · 2025-02-25T22:33:58Z

Build succeeded:

mgartner

Thanks!!!

Reviewed 3 of 7 files at r2, 4 of 4 files at r8, 3 of 3 files at r9, 1 of 1 files at r10, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)

A comment has been added to explain logic added in cockroachdb#140160. Release note: None

142062: test: add test filter environment variable to cloud unit tests r=jeffswenson a=kev-cao Epic: none Release note: None 142079: sql: add comment for auto-rollback r=mgartner a=mgartner A comment has been added to explain logic added in #140160. Epic: None Release note: None Co-authored-by: Kevin Cao <kevin.cao@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>

rafiss requested a review from mgartner January 30, 2025 22:32

rafiss added the do-not-merge bors won't merge a PR with this label. label Jan 30, 2025

rafiss force-pushed the abort-txn branch from 4a94c91 to a34004f Compare January 31, 2025 03:12

mgartner mentioned this pull request Jan 31, 2025

sql: transactions that enter the "aborted" state due to SQL-level errors should relinquish locks #140234

Closed

michae2 reviewed Jan 31, 2025

View reviewed changes

rafiss force-pushed the abort-txn branch from a34004f to 269553d Compare February 10, 2025 18:43

rafiss removed the do-not-merge bors won't merge a PR with this label. label Feb 10, 2025

rafiss marked this pull request as ready for review February 10, 2025 18:43

rafiss commented Feb 10, 2025

View reviewed changes

rafiss requested a review from michae2 February 10, 2025 18:46

rafiss changed the title ~~WIP: sql: rollback transaction eagerly upon error~~ sql: rollback transaction eagerly upon error Feb 10, 2025

rafiss force-pushed the abort-txn branch 3 times, most recently from f6b1e73 to 5d4bcf2 Compare February 11, 2025 06:07

michae2 reviewed Feb 12, 2025

View reviewed changes

rafiss commented Feb 12, 2025

View reviewed changes

michae2 reviewed Feb 13, 2025

View reviewed changes

rafiss commented Feb 19, 2025

View reviewed changes

rafiss force-pushed the abort-txn branch from 5d4bcf2 to a712452 Compare February 19, 2025 21:16

rafiss commented Feb 19, 2025

View reviewed changes

rafiss requested a review from michae2 February 19, 2025 21:17

michae2 approved these changes Feb 24, 2025

View reviewed changes

rafiss added 3 commits February 25, 2025 17:02

sql: add test for implicit txn holding locks across retries

1d7dc32

Release note: None

rafiss force-pushed the abort-txn branch from a712452 to 1d7dc32 Compare February 25, 2025 22:02

rafiss commented Feb 25, 2025

View reviewed changes

craig bot merged commit 19290ae into cockroachdb:master Feb 25, 2025
23 of 24 checks passed

celeste-cockroachdb bot added the target-release-25.2.0 label Feb 25, 2025

rafiss deleted the abort-txn branch February 25, 2025 22:43

mgartner reviewed Feb 27, 2025

View reviewed changes

mgartner added a commit to mgartner/cockroach that referenced this pull request Feb 27, 2025

sql: add comment for auto-rollback

635d4c1

A comment has been added to explain logic added in cockroachdb#140160. Release note: None

mgartner mentioned this pull request Feb 27, 2025

sql: add comment for auto-rollback #142079

Merged

celeste-cockroachdb bot added v25.2.0-prerelease and removed target-release-25.2.0 labels Mar 24, 2025

michae2 mentioned this pull request Aug 1, 2025

sql: add guardrail to prevent txn use after error within internal executor #151255

Open

Conversation

rafiss commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!