txnkv: Add a named error ErrCommitTSLag when waiting for commitWaitUntilTSO timeout#1842
Conversation
|
/retest |
f138e86 to
7025a3a
Compare
|
/retest |
1 similar comment
|
/retest |
97741b2 to
85e6b4c
Compare
|
/retest |
1 similar comment
|
/retest |
03d20f8 to
5b1dba3
Compare
ErrCommitTSLag when waiting commitWaitUntilTSO timeoutErrCommitTSLag when waiting for commitWaitUntilTSO timeout
|
Tip For best results, initiate chat on the files or code changes.
Please let me know what you need! |
|
@coderabbitai please review this PR |
|
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughAdds a named error Changes
Sequence Diagram(s)sequenceDiagram
participant Txn as Transaction
participant Backoff as Backoffer
participant PD as PD/TSO
participant Metrics as Metrics
Txn->>PD: GetTimestampForCommit (first attempt)
PD-->>Txn: startTS
Txn->>Txn: if startTS < commitWaitUntilTSO -> record FirstLagTS, start timer
loop retry until TS >= commitWaitUntilTSO or timeout
Txn->>Backoff: apply BoCommitTSLag backoff
Backoff-->>Txn: delay
Txn->>PD: GetTimestampForCommit (retry)
PD-->>Txn: updated TS
end
alt success
Txn->>Metrics: record WaitHistogram (ok) and AttemptHistogram (ok)
Txn-->>Caller: return TS with LagDetails
else timeout/failure
Txn->>Metrics: record WaitHistogram (err) and AttemptHistogram (err)
Txn-->>Caller: return ErrCommitTSLag
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧹 Recent nitpick comments
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (10)
💤 Files with no reviewable changes (1)
🚧 Files skipped from review as they are similar to previous changes (4)
🧰 Additional context used🧬 Code graph analysis (2)config/retry/config.go (2)
metrics/shortcuts.go (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
🔇 Additional comments (12)
✏️ Tip: You can disable this entire section by setting Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@integration_tests/option_test.go`:
- Line 216: The assertion is checking the method value
txn.KVTxn.SetCommitWaitUntilTSO (a method expression) which is always non-zero
in Go; remove the line asserting s.NotZero(txn.KVTxn.SetCommitWaitUntilTSO) and
rely on the existing check txn.GetCommitWaitUntilTSO() (already at line ~206) to
verify the commitWaitUntilTSO state.
In `@util/execdetails.go`:
- Line 158: Add a clarifying comment on the CommitDetails type explaining that
the LagDetails (CommitTSLagDetails) field is intentionally excluded from Clone()
and Merge(): it records metrics for a single commit attempt and is freshly
populated by fillCommitTSLagDetails() for each attempt rather than being
accumulated or preserved across clones/merges, so Clone() and Merge()
deliberately do not copy or merge LagDetails to avoid stale or double-counted
metrics.
🧹 Nitpick comments (1)
integration_tests/option_test.go (1)
72-88: The asymmetric lock/unlock pattern is confusing but correct.The locking pattern where unlock happens either in an
elsebranch (line 78) or viadefer(line 78 for the matching case) works correctly but is hard to follow. Consider restructuring for clarity.♻️ Suggested refactor for clarity
func (o *mockCommitTSOracle) GetTimestamp(ctx context.Context, option *oracle.Option) (uint64, error) { if ts := ctx.Value(transaction.CtxInGetTimestampForCommitKey); ts != nil { o.mu.Lock() - if ts != o.mu.startTS { - o.mu.Unlock() - } else { - defer o.mu.Unlock() + defer o.mu.Unlock() + if ts == o.mu.startTS { if len(o.mu.commitTSOffsets) == 0 { panic("empty mock oracle ts set") } offset := o.mu.commitTSOffsets[0] o.mu.commitTSOffsets = o.mu.commitTSOffsets[1:] return o.mu.startTS + offset, nil } } return o.Oracle.GetTimestamp(ctx, option) }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
config/retry/config.goerror/error.goexamples/txnkv/go.modintegration_tests/option_test.gometrics/metrics.gometrics/shortcuts.gotxnkv/transaction/2pc.gotxnkv/transaction/test_probe.gotxnkv/transaction/txn.goutil/execdetails.go
💤 Files with no reviewable changes (1)
- txnkv/transaction/test_probe.go
🧰 Additional context used
🧬 Code graph analysis (2)
config/retry/config.go (2)
metrics/shortcuts.go (1)
BackoffHistogramEmpty(75-75)error/error.go (1)
ErrCommitTSLag(95-95)
metrics/shortcuts.go (1)
metrics/metrics.go (2)
TiKVTxnLagCommitTSWaitHistogram(130-130)TiKVTxnLagCommitTSAttemptHistogram(131-131)
🔇 Additional comments (15)
examples/txnkv/go.mod (1)
3-3: LGTM - Dependency updates look appropriate.Go toolchain patch bump (1.23.0 → 1.23.12) and indirect dependency updates align with the new commit timestamp lag features introduced in this PR.
Also applies to: 25-25, 34-34
error/error.go (2)
94-95: LGTM - Error declaration follows existing patterns.The new
ErrCommitTSLagsentinel error is well-documented and placed appropriately alongside other TiKV-related errors.
425-429: LGTM - Helper function correctly useserrors.Is.The
IsErrorCommitTSLaghelper follows the established pattern (similar toIsErrorUndetermined) and properly supports wrapped error chains.util/execdetails.go (1)
141-152: LGTM - Well-documented struct for lag metrics.The
CommitTSLagDetailsstruct clearly captures all necessary information for commit timestamp lag tracking with appropriate field comments.metrics/metrics.go (3)
130-131: LGTM - Histogram declarations follow existing patterns.The new histogram vectors are consistently named and typed, matching the established conventions in this file.
952-970: LGTM - Histogram configuration is appropriate.The bucket ranges are well-suited for their respective use cases:
- Wait time: 0.5ms–16s covers typical PD TSO lag scenarios
- Attempt count: 1–32 is reasonable for retry tracking
Both metrics follow naming conventions (
_secondsfor duration,_countfor counts) and use theLblResultlabel consistently.
1075-1076: LGTM - Metrics properly registered.Both new histograms are correctly added to
RegisterMetrics().config/retry/config.go (1)
136-136: LGTM - Backoff config follows established patterns.
BoCommitTSLaguses configuration values (base=2ms, cap=500ms, NoJitter) consistent with similar timestamp-related backoff configs likeBoMaxTsNotSynced. UsingBackoffHistogramEmptyis appropriate since dedicated lag metrics are tracked separately viaTiKVTxnLagCommitTSWaitHistogram.metrics/shortcuts.go (1)
195-199: LGTM!The new metric shortcuts follow the established pattern in this file and correctly initialize the histogram observers with "ok" and "err" labels, consistent with other metrics like
TxnHeartBeatHistogramOK/TxnHeartBeatHistogramError.Also applies to: 361-365
txnkv/transaction/txn.go (3)
224-229: LGTM!The
commitLagWaitStatsstruct is well-designed to capture the lag statistics needed for metrics and diagnostics.
1955-2043: LGTM!The
GetTimestampForCommitimplementation correctly handles the commit TS lag scenario:
- Early return when no lag is detected (line 1966)
- Fast-fail for excessive lag that exceeds timeout (lines 1991-2004)
- Retry loop with backoff for recoverable lag (lines 2009-2025)
- Proper error wrapping with diagnostic information (lines 2027-2040)
- Metrics are recorded via defer for all lag cases
The loop exit condition
ts <= txn.commitWaitUntilTSOcorrectly continues until we get a TS greater than the target, andlastAttemptTSis properly updated after each successful fetch.
2045-2052: LGTM!The helper correctly guards with
stats.waitDuration > 0to only populate lag details when the lag path was actually exercised.txnkv/transaction/2pc.go (2)
1772-1772: LGTM!Correctly populates lag details after the async commit path's timestamp retrieval.
1908-1908: LGTM!Correctly populates lag details after the regular 2PC commit timestamp retrieval.
integration_tests/option_test.go (1)
141-231: Well-structured parameterized test with comprehensive coverage.The test cases cover key scenarios:
- No lag (single attempt success)
- Lag with 1-2 retries and success
- Excessive lag (fail-fast)
- Timeout during retries
The metric collection pattern using
doWithCollectHistSamplesIncis a clean way to verify histogram increments.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
|
@tiancaiamao: adding LGTM is restricted to approvers and reviewers in OWNERS files. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
5b1dba3 to
4dc5b95
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@integration_tests/option_test.go`:
- Around line 72-88: The comparison in mockCommitTSOracle.GetTimestamp is
comparing an interface{} (ctx.Value(...)) to a uint64, which always fails;
change it to assert the value from
ctx.Value(transaction.CtxInGetTimestampForCommitKey) to (uint64, ok) and only
proceed if ok && assertedValue == o.mu.startTS; update the branch that currently
checks `if ts := ...; ts != nil { ... if ts != o.mu.startTS { ... } else { ... }
}` to use the asserted uint64 value, keep the existing locking/unlocking and
panic/offset logic intact, and fall back to o.Oracle.GetTimestamp when the
assertion fails or the value is nil.
🧹 Nitpick comments (2)
integration_tests/option_test.go (1)
56-63: Consider documenting the mock oracle's expected usage.The
mockCommitTSOraclestruct embedsoracle.Oracleanonymously and stores mock state. A brief doc comment explaining that this mock interceptsGetTimestampcalls for commit-phase testing (whenCtxInGetTimestampForCommitKeyis present) would improve clarity for future maintainers.txnkv/transaction/txn.go (1)
373-377: Minor documentation clarification needed.The doc comment states "If it returns zero, it means it is not set and the default timeout should be used", but
commitWaitUntilTSOTimeoutis initialized totime.SecondinNewTiKVTxn(line 248). This means a newly created transaction will never return zero. Consider updating the documentation to reflect that the default value istime.Second, or clarify the intended semantics.📝 Suggested documentation update
-// GetCommitWaitUntilTSOTimeout returns the timeout to wait for the commit tso reach the expected value. -// If it returns zero, it means it is not set and the default timeout should be used +// GetCommitWaitUntilTSOTimeout returns the timeout to wait for the commit tso reach the expected value. +// The default value is time.Second if not explicitly overridden via SetCommitWaitUntilTSOTimeout. func (txn *KVTxn) GetCommitWaitUntilTSOTimeout() time.Duration {
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
config/retry/config.goerror/error.goexamples/txnkv/go.modintegration_tests/option_test.gometrics/metrics.gometrics/shortcuts.gotxnkv/transaction/2pc.gotxnkv/transaction/test_probe.gotxnkv/transaction/txn.goutil/execdetails.go
💤 Files with no reviewable changes (1)
- txnkv/transaction/test_probe.go
🚧 Files skipped from review as they are similar to previous changes (3)
- error/error.go
- metrics/metrics.go
- examples/txnkv/go.mod
🧰 Additional context used
🧬 Code graph analysis (3)
integration_tests/option_test.go (7)
txnkv/transaction/txn.go (1)
CtxInGetTimestampForCommitKey(1951-1951)tikv/test_probe.go (1)
StoreProbe(54-56)tikv/kv.go (1)
KVStore(124-166)integration_tests/util_test.go (1)
NewTestStore(71-85)util/execdetails.go (3)
CommitDetails(168-193)CommitDetailCtxKey(60-60)CommitTSLagDetails(143-152)metrics/shortcuts.go (4)
LagCommitTSWaitHistogramWithOK(196-196)LagCommitTSAttemptHistogramWithOK(198-198)LagCommitTSWaitHistogramWithError(197-197)LagCommitTSAttemptHistogramWithError(199-199)error/error.go (1)
IsErrorCommitTSLag(427-429)
metrics/shortcuts.go (1)
metrics/metrics.go (2)
TiKVTxnLagCommitTSWaitHistogram(130-130)TiKVTxnLagCommitTSAttemptHistogram(131-131)
txnkv/transaction/txn.go (7)
config/retry/backoff.go (2)
Backoffer(60-76)NewBackoffer(84-90)tikv/backoff.go (2)
Backoffer(45-45)NewBackoffer(61-63)metrics/shortcuts.go (3)
LagCommitTSAttemptHistogramWithError(199-199)LagCommitTSAttemptHistogramWithOK(198-198)LagCommitTSWaitHistogramWithOK(196-196)oracle/oracle.go (1)
GetTimeFromTS(118-121)error/error.go (2)
ErrCommitTSLag(95-95)IsErrorCommitTSLag(427-429)config/retry/config.go (1)
BoCommitTSLag(136-136)util/execdetails.go (1)
CommitTSLagDetails(143-152)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: integration-raw-tikv (v2)
- GitHub Check: integration-local-race
- GitHub Check: integration-raw-tikv (v1ttl)
- GitHub Check: integration-next-gen-tikv
- GitHub Check: integration-tikv
- GitHub Check: race-test
- GitHub Check: integration-local
🔇 Additional comments (11)
txnkv/transaction/2pc.go (1)
1772-1772: LGTM - Lag details properly captured after timestamp fetching.The
fillCommitTSLagDetailscalls are correctly placed immediately after the timestamp fetching operations (GetLatestTsTimeandGetCommitTsTime), ensuring lag metrics are captured at the right points in the commit flow for both async commit/1PC and regular 2PC paths.Also applies to: 1908-1908
config/retry/config.go (1)
136-136: LGTM - Backoff configuration for commit TS lag.The
BoCommitTSLagconfiguration follows the established pattern with appropriate parameters: a short base (2ms) for quick retries, reasonable cap (500ms), andNoJitterconsistent with similar time-sensitive backoffs likeBoMaxTsNotSynced.metrics/shortcuts.go (1)
195-199: LGTM - Metric shortcuts follow established patterns.The new lag-related histogram observer shortcuts are correctly declared and initialized, following the existing conventions for OK/error label variants (e.g.,
TxnHeartBeatHistogramOK/Error).Also applies to: 361-365
util/execdetails.go (2)
141-165: LGTM - NewCommitTSLagDetailsstruct with proper merge semantics.The struct correctly captures lag metrics, and the
Mergemethod appropriately accumulates wait time and backoff count while using the latest timestamp values.One minor note:
other.FirstLagTS <= 0on auint64is equivalent to== 0since unsigned integers cannot be negative. This works correctly but== 0would be clearer.
171-171: LGTM -LagDetailsintegration intoCommitDetails.The
LagDetailsfield is correctly added and properly handled in bothMerge(line 199) andClone(line 260). SinceCommitTSLagDetailscontains only value types, the direct struct assignment inCloneprovides correct copy semantics.Also applies to: 199-199, 260-260
integration_tests/option_test.go (1)
120-232: Well-structured table-driven test covering lag scenarios.The test comprehensively covers:
- No lag (single TSO fetch succeeds)
- Lag with retry success (1 and 2 retries)
- Excessive lag (immediate failure)
- Timeout during retry
The metric increment collection via
doWithCollectHistSamplesIncand lag detail assertions are thorough.txnkv/transaction/txn.go (5)
71-71: LGTM!Import of the
intestpackage for test-time instrumentation is appropriate.
224-229: LGTM!The
commitLagWaitStatsstruct cleanly encapsulates the lag tracking state, with fields that align well withutil.CommitTSLagDetails.
1949-1951: LGTM!The context key pattern using an unexported type with an exported variable is idiomatic Go for context keys, preventing key collisions.
1955-2043: LGTM with minor observation.The lag handling logic is well-structured:
- Early return when no lag detected (line 1966-1968)
- Early error if clock drift exceeds timeout (line 1996-2004)
- Backoff loop with proper timeout budget (line 2007-2025)
- Metrics recorded appropriately in defer with OK/Error distinction
The approach of creating a new
Backofferat line 2007 withmaxSleepas its budget is appropriate, as the lag retry has its own dedicated timeout separate from the original operation's budget.
2045-2053: LGTM!The helper correctly populates lag details only when lag actually occurred (guarded by
stats.firstAttemptTS > 0), with proper field mapping toutil.CommitTSLagDetails.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
…tilTSO` timeout Signed-off-by: Chao Wang <cclcwangchao@hotmail.com>
4dc5b95 to
bdc3c3a
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cfzjywxk, ekexium, tiancaiamao The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
close #1843
This PR:
ErrCommitTSLagif the PD TSO lags too much compared with the value set bytxn. SetCommitWaitUntilTSOTimeoutIsErrorCommitTSLagto determine the above errortxn_lag_commit_ts_wait_seconds: the total time waiting for the lag commit ts to reach the expected valuetxn_lag_commit_ts_attempt_count: total attempts to get the PD TSO when lag is detected.CommitTsLagWaitTimeforCommitDetails, which is the same astxn_lag_commit_ts_wait_seconds. It can be displayed in TiDB slow log details.Summary by CodeRabbit
New Features
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.