sql: increase observability of automatic statement retries#147682
sql: increase observability of automatic statement retries#147682craig[bot] merged 5 commits intocockroachdb:masterfrom
Conversation
|
This will be followed by one more PR with the actual improvements for #145377. I'm still trying to decide how much of this to backport. Maybe just the last commit? The fourth commit (fixing dispatchReadCommittedStmtToExecutionEngine for pausable portals) is kind of a half-hearted attempt. I think there are more things broken for pausable portals, but I really didn't want to go down that rabbit hole. |
db37f30 to
0407266
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
In terms of backports, only the last one would be the safest option, but the whole PR seems pretty safe to me too, and we did backport #142692 to 24.1 which is similar to the third commit. I'll leave to your judgement.
Reviewed 7 of 7 files at r1, 27 of 27 files at r2, 4 of 4 files at r3, 2 of 2 files at r4, 5 of 5 files at r5, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @angles-n-daemons and @rafiss)
-- commits line 45 at r5:
nit: it'd be nice to extend TestRetryFields to test the EXPLAIN ANALYZE addition. Would that be too annoying to set up in a unit test?
pkg/ccl/logictestccl/testdata/logic_test/txn_retry line 1 at r2 (raw file):
# Test that automatic transaction retries and automatic statement retries are
nit: this file is in ccl because of READ COMMITTED? Do we still have the ccl check in place? I imagine that given core deprecation effort we should be able to put new tests in pkg/sql/logictest. (My questions are more to probe in case you have more up-to-date context on this / have looked into this recently.)
pkg/sql/sessionphase/session_phase.go line 182 at r3 (raw file):
// GetStatementRetryLatency returns the time that was spent retrying the // statement.
nit: perhaps mention explicitly that only applicable to RC isolation.
pkg/sql/sem/builtins/builtins.go line 6067 at r1 (raw file):
// of the transaction it returns a retryable error. If not, 0 is returned // instead of an error. // The second version allows one to create an error intended for a transaction
nit: this sentence is stale.
pkg/sql/sem/builtins/builtins.go line 6072 at r1 (raw file):
tree.FunctionProperties{ Category: builtinconstants.CategorySystemInfo, DistsqlBlocklist: true, // applicable only on the gateway
nit: we could mark only the second overload as distsql-blocklisted. Up to you.
b6a58f7 to
7a935c3
Compare
Add an overload of `crdb_internal.force_retry` that continues to throw retryable errors until the number of retries matches the argument. This overload is a bit easier to work with than the overload that uses an interval. Future commits will make use of this overload for testing. To power this overload, we add the current autoRetryCounter to planner before every statement execution, which means `crdb_internal.force_retry` is no longer DistSQL-safe. Informs: cockroachdb#145377 Release note: None
Under Read Committed isolation the conn executor will automatically retry statements in addition to transactions. Both kinds of retry were incrementing the same counter, causing the retry count to be wrong for all statements following a retried statement in a Read Committed transaction. This commit splits autoRetryStmtCounter from autoRetryCounter so that we can reset the former after each statement. For observability purposes we use autoRetryStmtCounter + autoRetryCounter as the total number of retries for the current statement. (Unfortunately, this calculation is also wrong sometimes. For example, if a multi-statement transaction mixes both statement retries and transaction retries, the retry count for the current statement will only include statement retries from the most recent transaction retry. Still, using two counters seems like an improvement over a single counter for most cases.) Informs: cockroachdb#145377 Release note (ui change): Retry counts for statements executing under Read Committed isolation are now more accurate.
Add statement retry count and duration to EXPLAIN ANALYZE output. Informs: cockroachdb#145377 Release note: None
If we want to preserve retry information then I think we need to save these fields in the pauseInfo. Informs: cockroachdb#145377 Release note: None
Add metrics counting the number of automatic retries of transactions and statements within conn executor. Informs: cockroachdb#145377 Release note (sql change): Add new metrics `sql.txn.auto_retry.count` and `sql.statements.auto_retry.count` which count the number of automatic retries of SQL transactions and statements, respectively, within the database. These metrics differ from the related `txn.restarts.*` metrics in that the latter count retryable errors emitted by the KV layer of the database, which must be retried, and the former count auto-retry actions taken by the SQL layer of the database in response to some of those retryable errors.
michae2
left a comment
There was a problem hiding this comment.
Thanks! I think for now I will only backport the last commit.
bors r=yuzefovich
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @angles-n-daemons, @rafiss, and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: it'd be nice to extend
TestRetryFieldsto test the EXPLAIN ANALYZE addition. Would that be too annoying to set up in a unit test?
Thanks for pointing this out! Done.
pkg/sql/sem/builtins/builtins.go line 6067 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this sentence is stale.
Done.
pkg/sql/sem/builtins/builtins.go line 6072 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: we could mark only the second overload as distsql-blocklisted. Up to you.
Good point, done.
pkg/sql/sessionphase/session_phase.go line 182 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: perhaps mention explicitly that only applicable to RC isolation.
Done.
pkg/ccl/logictestccl/testdata/logic_test/txn_retry line 1 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this file is in
cclbecause of READ COMMITTED? Do we still have the ccl check in place? I imagine that given core deprecation effort we should be able to put new tests inpkg/sql/logictest. (My questions are more to probe in case you have more up-to-date context on this / have looked into this recently.)
Yes, that's the reason.
I tried enabling CCL features in sql/logictest with something like #147808 but apparently it isn't quite enough to get Read Committed working. And I think several logic tests actually check that Read Committed is not available... so those would need to be adjusted. I opened #147830 so that we can remember to do this at some point.
|
This PR was included in a batch that was canceled, it will be automatically retried |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. 💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details. error creating merge commit from c66548a to blathers/backport-release-24.3-147682: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 24.3.x failed. See errors above. 💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details. error creating merge commit from c66548a to blathers/backport-release-25.1-147682: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 25.1.x failed. See errors above. 💡 Consider backporting to the fork repo instead of the main repo. See instructions for more details. error creating merge commit from d2899ca to blathers/backport-release-25.2-147682: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 25.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
In #107044 we added statement-level automatic retries to the conn executor. These retries are used under Read Committed isolation to try to prevent retryable errors from escaping to the application.
This PR adds more observability to statement-level retries, in some cases bringing it up to par with transaction-level retries. See individual commits for details.
Informs: #145377
Release note: None