Skip to content

sql: rollback transaction eagerly upon error#140160

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
rafiss:abort-txn
Feb 25, 2025
Merged

sql: rollback transaction eagerly upon error#140160
craig[bot] merged 3 commits intocockroachdb:masterfrom
rafiss:abort-txn

Conversation

@rafiss
Copy link
Copy Markdown
Collaborator

@rafiss rafiss commented Jan 30, 2025

sql: rollback transaction eagerly upon error

If we have not used any savepoints, it's safe to rollback the
transaction immediately when a non-retryable error is encountered.

This releases locks sooner, which is useful so we avoid needing to wait
until the client sends a ROLLBACK statement.

Release note (bug fix): Transactions that enter the aborted state will
now release locks they are holding immediately, as long as there is no
SAVEPOINT active in the transaction.

sql: make transaction_timeout abort transaction eagerly

Previously, the transaction_timeout setting would only cause the
transaction to abort if the timeout had already expired by the time the
next statement is received by CRDB. Now we create a timer that fires if
the timeout expires while we are waiting for the next statement.

There's no release note, since up until the previous commit, which
changed the behavior so transactions eagerly release locks, there was no
functional difference caused by lazily timing out the transaction.

Release note: None

sql: add test for implicit txn holding locks across retries

Release note: None


fixes #140234

@rafiss rafiss requested a review from mgartner January 30, 2025 22:32
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jan 30, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@rafiss rafiss added the do-not-merge bors won't merge a PR with this label. label Jan 30, 2025
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good!

Some test ideas:

  • something with cockroach_restart
  • something with pausable portals
  • something with implicit txn retry

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @rafiss)


pkg/sql/conn_executor_savepoints.go line 34 at r1 (raw file):

	ctx context.Context, s *tree.Savepoint, res RestrictedCommandResult,
) (fsm.Event, fsm.EventPayload, error) {
	ex.state.mu.hasSavepoints = true

Would we want to set hasSavepoints to false somewhere in execRelease to allow the eager rollback of a transaction that fails after all created savepoints have been released?

@rafiss rafiss removed the do-not-merge bors won't merge a PR with this label. label Feb 10, 2025
@rafiss rafiss marked this pull request as ready for review February 10, 2025 18:43
Copy link
Copy Markdown
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something with cockroach_restart

I added a test with savepoints. I don't think cockroach_restart needs any other special testing, but I could be missing something, so just let me know what else you had in mind.

something with pausable portals

I was thinking this would be covered by our metamorphic testing. But I can add a test for a specific scenario if you can think of one.

something with implicit txn retry

Added this test.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/sql/conn_executor_savepoints.go line 34 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Would we want to set hasSavepoints to false somewhere in execRelease to allow the eager rollback of a transaction that fails after all created savepoints have been released?

done. good point.

@rafiss rafiss requested a review from michae2 February 10, 2025 18:46
@rafiss rafiss changed the title WIP: sql: rollback transaction eagerly upon error sql: rollback transaction eagerly upon error Feb 10, 2025
@rafiss rafiss force-pushed the abort-txn branch 3 times, most recently from f6b1e73 to 5d4bcf2 Compare February 11, 2025 06:07
Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/conn_fsm.go line 418 at r2 (raw file):

					ts.mu.Lock()
					defer ts.mu.Unlock()
					if !ts.mu.hasSavepoints {

Something occurred to me: if this is a nested connExecutor created by InternalExecutor.initConnEx, sharing a transaction with its parent connExecutor, then will it know about the savepoints in its parent connExecutor?

Maybe we also need to check that this is a top-level connExecutor? Or we need to add savepoint info to InternalExecutor.extraTxnState?

Copy link
Copy Markdown
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/sql/conn_fsm.go line 418 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Something occurred to me: if this is a nested connExecutor created by InternalExecutor.initConnEx, sharing a transaction with its parent connExecutor, then will it know about the savepoints in its parent connExecutor?

Maybe we also need to check that this is a top-level connExecutor? Or we need to add savepoint info to InternalExecutor.extraTxnState?

When we're running an internal executor delegated from a parent transaction, the connExecutor uses a different state machine. It's further down below:

var BoundTxnStateTransitions = fsm.Compile(fsm.Pattern{

The transaction is never rolled back in those state transitions.

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/conn_fsm.go line 418 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

When we're running an internal executor delegated from a parent transaction, the connExecutor uses a different state machine. It's further down below:

var BoundTxnStateTransitions = fsm.Compile(fsm.Pattern{

The transaction is never rolled back in those state transitions.

Great, thank you!


pkg/sql/conn_executor_test.go line 1905 at r2 (raw file):

		err = conn1.QueryRowContext(ctx, `SELECT v FROM t WHERE k = 4`).Scan(&v)
		require.NoError(t, err)
		require.Equal(t, 400, v)

one idea for a cockroach_restart test is something like:

-- conn 1
INSERT INTO t VALUES (5, 5), (6, 6);
BEGIN;
SAVEPOINT cockroach_restart;
SELECT * FROM t WHERE k = 5;
    -- conn 2 (to break serializability)
    UPDATE t SET v = 50 WHERE k = 5;
    SELECT * FROM t WHERE k = 6;
-- conn 1
UPDATE t SET v = 60 WHERE k = 6;
    -- conn 2 (blocks, to prove locks are still held)
    UPDATE t SET v = v + 600 WHERE k = 6;
-- conn 1 (should fail with RETRY_SERIALIZABLE)
RELEASE SAVEPOINT cockroach_restart;
-- simulate the advanced retry
ROLLBACK TO SAVEPOINT cockroach_restart;
SAVEPOINT cockroach_restart;
SELECT * FROM t WHERE k = 5;
UPDATE t SET v = 61 WHERE k = 6;
RELEASE SAVEPOINT cockroach_restart;
    -- conn 2 (to prove that ordering was correct and locks are now released)
    SELECT * FROM t WHERE k = 6;
-- conn 1
COMMIT;

pkg/sql/conn_executor_test.go line 1916 at r4 (raw file):

// layer, but for now this is the expected behavior.
// See https://github.com/cockroachdb/cockroach/issues/117020.
func TestRetriableErrorDuringTransactionHoldsLocks(t *testing.T) {

Thanks for adding this!


pkg/sql/conn_executor_exec.go line 186 at r3 (raw file):

	ex.startIdleInSessionTimeout()

	if ex.sessionData().IdleInTransactionSessionTimeout > 0 {

If both IdleInTransactionSessionTimeout and txnTimeoutRemaining are > 0, do we need to set both timers? Seems like we only need to set the one that will fire first?

Copy link
Copy Markdown
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/sql/conn_executor_exec.go line 186 at r3 (raw file):

Previously, michae2 (Michael Erickson) wrote…

If both IdleInTransactionSessionTimeout and txnTimeoutRemaining are > 0, do we need to set both timers? Seems like we only need to set the one that will fire first?

the two timers have slightly different behaviors. as per the commend i added below:

				// The transaction_timeout setting should move the transaction to the
				// aborted state.
				// NOTE: In Postgres, the transaction_timeout causes the entire session
				// to be terminated. We intentionally diverge from that behavior.

however, the IdleInTransactionSessionTimeout timer causes the session to be terminated.


pkg/sql/conn_executor_test.go line 1905 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

one idea for a cockroach_restart test is something like:

-- conn 1
INSERT INTO t VALUES (5, 5), (6, 6);
BEGIN;
SAVEPOINT cockroach_restart;
SELECT * FROM t WHERE k = 5;
    -- conn 2 (to break serializability)
    UPDATE t SET v = 50 WHERE k = 5;
    SELECT * FROM t WHERE k = 6;
-- conn 1
UPDATE t SET v = 60 WHERE k = 6;
    -- conn 2 (blocks, to prove locks are still held)
    UPDATE t SET v = v + 600 WHERE k = 6;
-- conn 1 (should fail with RETRY_SERIALIZABLE)
RELEASE SAVEPOINT cockroach_restart;
-- simulate the advanced retry
ROLLBACK TO SAVEPOINT cockroach_restart;
SAVEPOINT cockroach_restart;
SELECT * FROM t WHERE k = 5;
UPDATE t SET v = 61 WHERE k = 6;
RELEASE SAVEPOINT cockroach_restart;
    -- conn 2 (to prove that ordering was correct and locks are now released)
    SELECT * FROM t WHERE k = 6;
-- conn 1
COMMIT;

nice! let me give this a try

Copy link
Copy Markdown
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the reviews! this is ready for another look.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)

@rafiss rafiss requested a review from michae2 February 19, 2025 21:17
Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Just one suggestion. This is great!

Reviewed 4 of 4 files at r5, 3 of 3 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/conn_executor_exec.go line 186 at r3 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

the two timers have slightly different behaviors. as per the commend i added below:

				// The transaction_timeout setting should move the transaction to the
				// aborted state.
				// NOTE: In Postgres, the transaction_timeout causes the entire session
				// to be terminated. We intentionally diverge from that behavior.

however, the IdleInTransactionSessionTimeout timer causes the session to be terminated.

Ah, I missed that. Thanks!


pkg/sql/conn_executor_test.go line 1936 at r5 (raw file):

		_, err = conn1.ExecContext(ctx, `RELEASE SAVEPOINT cockroach_restart`)
		require.ErrorContains(t, err, "failed preemptive refresh due to encountered recently written committed value")

nit: Because we're not letting the conn2 update block the entire time, I think it would be a slightly better test to try the conn2 update right after releasing the cockroach_restart savepoint. This would validate that locks are still held after releasing the cockroach_restart savepoint.

If we have not used any savepoints, it's safe to rollback the
transaction immediately when a non-retryable error is encountered.

This releases locks sooner, which is useful so we avoid needing to wait
until the client sends a ROLLBACK statement.

Release note (bug fix): Transactions that enter the aborted state will
now release locks they are holding immediately, as long as there is no
SAVEPOINT active in the transaction.
Previously, the transaction_timeout setting would only cause the
transaction to abort if the timeout had already expired by the time the
next statement is received by CRDB. Now we create a timer that fires if
the timeout expires while we are waiting for the next statement.

There's no release note, since up until the previous commit, which
changed the behavior so transactions eagerly release locks, there was no
functional difference caused by lazily timing out the transaction.

Release note: None
Copy link
Copy Markdown
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tftr!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner and @michae2)


pkg/sql/conn_executor_test.go line 1936 at r5 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: Because we're not letting the conn2 update block the entire time, I think it would be a slightly better test to try the conn2 update right after releasing the cockroach_restart savepoint. This would validate that locks are still held after releasing the cockroach_restart savepoint.

good point! added the assertion after too

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 25, 2025

@craig craig bot merged commit 19290ae into cockroachdb:master Feb 25, 2025
23 of 24 checks passed
@rafiss rafiss deleted the abort-txn branch February 25, 2025 22:43
Copy link
Copy Markdown
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thanks!!!

Reviewed 3 of 7 files at r2, 4 of 4 files at r8, 3 of 3 files at r9, 1 of 1 files at r10, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale)

mgartner added a commit to mgartner/cockroach that referenced this pull request Feb 27, 2025
A comment has been added to explain logic added in cockroachdb#140160.

Release note: None
craig bot pushed a commit that referenced this pull request Feb 27, 2025
142062: test: add test filter environment variable to cloud unit tests r=jeffswenson a=kev-cao

Epic: none

Release note: None

142079: sql: add comment for auto-rollback r=mgartner a=mgartner

A comment has been added to explain logic added in #140160.

Epic: None
Release note: None

Co-authored-by: Kevin Cao <kevin.cao@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: transactions that enter the "aborted" state due to SQL-level errors should relinquish locks

4 participants