Skip to content

sql: increase observability of automatic statement retries#147682

Merged
craig[bot] merged 5 commits intocockroachdb:masterfrom
michae2:retry_obs
Jun 5, 2025
Merged

sql: increase observability of automatic statement retries#147682
craig[bot] merged 5 commits intocockroachdb:masterfrom
michae2:retry_obs

Conversation

@michae2
Copy link
Copy Markdown
Collaborator

@michae2 michae2 commented Jun 3, 2025

In #107044 we added statement-level automatic retries to the conn executor. These retries are used under Read Committed isolation to try to prevent retryable errors from escaping to the application.

This PR adds more observability to statement-level retries, in some cases bringing it up to par with transaction-level retries. See individual commits for details.

Informs: #145377

Release note: None

@michae2 michae2 requested review from a team, rafiss and yuzefovich June 3, 2025 05:20
@michae2 michae2 requested review from a team as code owners June 3, 2025 05:20
@michae2 michae2 requested review from angles-n-daemons and removed request for a team June 3, 2025 05:20
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@michae2
Copy link
Copy Markdown
Collaborator Author

michae2 commented Jun 3, 2025

This will be followed by one more PR with the actual improvements for #145377.

I'm still trying to decide how much of this to backport. Maybe just the last commit?

The fourth commit (fixing dispatchReadCommittedStmtToExecutionEngine for pausable portals) is kind of a half-hearted attempt. I think there are more things broken for pausable portals, but I really didn't want to go down that rabbit hole.

@michae2 michae2 force-pushed the retry_obs branch 2 times, most recently from db37f30 to 0407266 Compare June 3, 2025 05:53
@michae2 michae2 added backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.1.x backport-25.2.x Flags PRs that need to be backported to 25.2 labels Jun 3, 2025
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements! :lgtm:

In terms of backports, only the last one would be the safest option, but the whole PR seems pretty safe to me too, and we did backport #142692 to 24.1 which is similar to the third commit. I'll leave to your judgement.

Reviewed 7 of 7 files at r1, 27 of 27 files at r2, 4 of 4 files at r3, 2 of 2 files at r4, 5 of 5 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @angles-n-daemons and @rafiss)


-- commits line 45 at r5:
nit: it'd be nice to extend TestRetryFields to test the EXPLAIN ANALYZE addition. Would that be too annoying to set up in a unit test?


pkg/ccl/logictestccl/testdata/logic_test/txn_retry line 1 at r2 (raw file):

# Test that automatic transaction retries and automatic statement retries are

nit: this file is in ccl because of READ COMMITTED? Do we still have the ccl check in place? I imagine that given core deprecation effort we should be able to put new tests in pkg/sql/logictest. (My questions are more to probe in case you have more up-to-date context on this / have looked into this recently.)


pkg/sql/sessionphase/session_phase.go line 182 at r3 (raw file):

// GetStatementRetryLatency returns the time that was spent retrying the
// statement.

nit: perhaps mention explicitly that only applicable to RC isolation.


pkg/sql/sem/builtins/builtins.go line 6067 at r1 (raw file):

	// of the transaction it returns a retryable error. If not, 0 is returned
	// instead of an error.
	// The second version allows one to create an error intended for a transaction

nit: this sentence is stale.


pkg/sql/sem/builtins/builtins.go line 6072 at r1 (raw file):

		tree.FunctionProperties{
			Category:         builtinconstants.CategorySystemInfo,
			DistsqlBlocklist: true, // applicable only on the gateway

nit: we could mark only the second overload as distsql-blocklisted. Up to you.

@michae2 michae2 force-pushed the retry_obs branch 2 times, most recently from b6a58f7 to 7a935c3 Compare June 4, 2025 17:54
michae2 added 2 commits June 4, 2025 21:37
Add an overload of `crdb_internal.force_retry` that continues to throw
retryable errors until the number of retries matches the argument.

This overload is a bit easier to work with than the overload that uses
an interval. Future commits will make use of this overload for testing.

To power this overload, we add the current autoRetryCounter to planner
before every statement execution, which means
`crdb_internal.force_retry` is no longer DistSQL-safe.

Informs: cockroachdb#145377

Release note: None
Under Read Committed isolation the conn executor will automatically
retry statements in addition to transactions. Both kinds of retry were
incrementing the same counter, causing the retry count to be wrong for
all statements following a retried statement in a Read Committed
transaction.

This commit splits autoRetryStmtCounter from autoRetryCounter so that we
can reset the former after each statement. For observability purposes we
use autoRetryStmtCounter + autoRetryCounter as the total number of
retries for the current statement.

(Unfortunately, this calculation is also wrong sometimes. For example,
if a multi-statement transaction mixes both statement retries and
transaction retries, the retry count for the current statement will only
include statement retries from the most recent transaction retry. Still,
using two counters seems like an improvement over a single counter for
most cases.)

Informs: cockroachdb#145377

Release note (ui change): Retry counts for statements executing under
Read Committed isolation are now more accurate.
michae2 added 3 commits June 4, 2025 23:52
Add statement retry count and duration to EXPLAIN ANALYZE output.

Informs: cockroachdb#145377

Release note: None
If we want to preserve retry information then I think we need to save
these fields in the pauseInfo.

Informs: cockroachdb#145377

Release note: None
Add metrics counting the number of automatic retries of transactions and
statements within conn executor.

Informs: cockroachdb#145377

Release note (sql change): Add new metrics `sql.txn.auto_retry.count`
and `sql.statements.auto_retry.count` which count the number of
automatic retries of SQL transactions and statements, respectively,
within the database.

These metrics differ from the related `txn.restarts.*` metrics in that
the latter count retryable errors emitted by the KV layer of the
database, which must be retried, and the former count auto-retry actions
taken by the SQL layer of the database in response to some of those
retryable errors.
Copy link
Copy Markdown
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think for now I will only backport the last commit.

bors r=yuzefovich

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @rafiss, and @yuzefovich)


-- commits line 45 at r5:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: it'd be nice to extend TestRetryFields to test the EXPLAIN ANALYZE addition. Would that be too annoying to set up in a unit test?

Thanks for pointing this out! Done.


pkg/sql/sem/builtins/builtins.go line 6067 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: this sentence is stale.

Done.


pkg/sql/sem/builtins/builtins.go line 6072 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: we could mark only the second overload as distsql-blocklisted. Up to you.

Good point, done.


pkg/sql/sessionphase/session_phase.go line 182 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: perhaps mention explicitly that only applicable to RC isolation.

Done.


pkg/ccl/logictestccl/testdata/logic_test/txn_retry line 1 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: this file is in ccl because of READ COMMITTED? Do we still have the ccl check in place? I imagine that given core deprecation effort we should be able to put new tests in pkg/sql/logictest. (My questions are more to probe in case you have more up-to-date context on this / have looked into this recently.)

Yes, that's the reason.

I tried enabling CCL features in sql/logictest with something like #147808 but apparently it isn't quite enough to get Read Committed working. And I think several logic tests actually check that Read Committed is not available... so those would need to be adjusted. I opened #147830 so that we can remember to do this at some point.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 5, 2025

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 5, 2025

@craig craig bot merged commit 44563fe into cockroachdb:master Jun 5, 2025
22 checks passed
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jun 5, 2025

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from c66548a to blathers/backport-release-24.3-147682: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.3.x failed. See errors above.


💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from c66548a to blathers/backport-release-25.1-147682: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 25.1.x failed. See errors above.


💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details.

error creating merge commit from d2899ca to blathers/backport-release-25.2-147682: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 25.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.2.x Flags PRs that need to be backported to 25.2 backport-failed v25.3.0-prerelease

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants