Support adaptive update interval for low resolution ts#1484
Conversation
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
56c7714 to
19c6546
Compare
| // time. | ||
| // WARNING: This method does not guarantee whether the generated timestamp is legal for accessing the data. | ||
| // Neither is it safe to use it for verifying the legality of another calculated timestamp. | ||
| // Be sure to validate the timestamp before using it to access the data. |
There was a problem hiding this comment.
How about deprecating the usages of this interface in the future, or integrate the check of ValidateSnapshotReadTS into it?
It's more robust to return a verifed timestamp.
There was a problem hiding this comment.
I've considered this. It makes sence, but it can't be easily done due to the current usage in TiDB repo.
There was a problem hiding this comment.
Can we modify GetStaleTimestamp to return the latest fetched timestamp, without adding the arrival duration?
There was a problem hiding this comment.
After discussion with @MyonKeminta , the refactor done could be done in another PR as it requires a lot of work.
There was a problem hiding this comment.
Can we modify
GetStaleTimestampto return the latest fetched timestamp, without adding the arrival duration?
What you said is exactly what GetLowResolutionTSO does. If we deprecate the GetStaleTimestamp method, the change can be complictated and difficult considering the current usage in TiDB repo. Simply changing all usages of GetStaleTimestamp to GetLowResolutionTSO may significantly reduce the precision of user specified staleness.
|
|
||
| adaptiveUpdateIntervalState struct { | ||
| // The mutex to avoid racing between updateTS goroutine and SetLowResolutionTimestampUpdateInterval. | ||
| mu sync.Mutex |
There was a problem hiding this comment.
How about moving the to be protected fields into the mu like
mu {
var1
var2
}
There was a problem hiding this comment.
But it's actually not protected. There are also some accesses without locking.🤔
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
oracle/oracles/pd.go
Outdated
| o.adaptiveLastTSUpdateInterval.Store(int64(configuredInterval)) | ||
| } | ||
| if o.adaptiveUpdateIntervalState.state != adaptiveUpdateTSIntervalStateUnadjustable { | ||
| logutil.Logger(context.Background()).Info("update low resolution ts interval is not being adaptive because the configured interval is too short", |
There was a problem hiding this comment.
| logutil.Logger(context.Background()).Info("update low resolution ts interval is not being adaptive because the configured interval is too short", | |
| logutil.BgLogger().Info("update low resolution ts interval is not being adaptive because the configured interval is too short", |
oracle/oracles/pd.go
Outdated
| prevAdaptiveInterval := currentAdaptiveInterval | ||
| currentAdaptiveInterval = max(requiredStaleness-adaptiveUpdateTSIntervalShrinkingPreserve, minAllowedAdaptiveUpdateTSInterval) | ||
| o.adaptiveLastTSUpdateInterval.Store(int64(currentAdaptiveInterval)) | ||
| logutil.Logger(context.Background()).Info("shrink low resolution ts update interval immediately", |
oracle/oracles/pd.go
Outdated
| zap.Duration("requestedStaleness", requiredStaleness), | ||
| zap.Duration("prevAdaptiveUpdateInterval", prevAdaptiveInterval), |
There was a problem hiding this comment.
why those log field names don't match the field name...
oracle/oracles/pd.go
Outdated
|
|
||
| func (o *pdOracle) getCurrentTSForValidation(ctx context.Context, opt *oracle.Option) (uint64, error) { | ||
| ch := o.tsForValidation.DoChan(opt.TxnScope, func() (interface{}, error) { | ||
| //metrics.ValidateReadTSFromPDCount.Inc() |
There was a problem hiding this comment.
Ah, there should be a metrics here.
oracle/oracles/pd.go
Outdated
| // current ctx. | ||
| res, err := o.GetTimestamp(context.Background(), opt) | ||
| // After finishing the current call, allow the next call to trigger fetching a new TS. | ||
| o.tsForValidation.Forget(opt.TxnScope) |
There was a problem hiding this comment.
No it's not needed. I just realized that I misunderstood the usage of singleflight.
| // If the call that triggers the execution of this function is canceled by the context, other calls that are | ||
| // waiting for reusing the same result should not be canceled. So pass context.Background() instead of the | ||
| // current ctx. | ||
| res, err := o.GetTimestamp(context.Background(), opt) |
There was a problem hiding this comment.
| res, err := o.GetTimestamp(context.Background(), opt) | |
| res, err := o.GetTimestamp(ctx, opt) |
There was a problem hiding this comment.
Using context.Background() here is expected, as said in the above comments.
| estimatedCurrentTS, err := o.getStaleTimestamp(opt.TxnScope, 0) | ||
| if err != nil { | ||
| logutil.Logger(ctx).Warn("failed to estimate current ts by getSlateTimestamp for auto-adjusting update low resolution ts interval", | ||
| zap.Error(err), zap.Uint64("readTS", readTS), zap.String("txnScope", opt.TxnScope)) | ||
| } else { | ||
| o.adjustUpdateLowResolutionTSIntervalWithRequestedStaleness(readTS, estimatedCurrentTS, time.Now()) | ||
| } |
There was a problem hiding this comment.
I'm sorry, I don't understand why we need those logic?
There was a problem hiding this comment.
We estimate the gap between the current time and the readTS, to see if the gap is shorter than the update interval, so that we need to shrink the update interval.
| // time. | ||
| // WARNING: This method does not guarantee whether the generated timestamp is legal for accessing the data. | ||
| // Neither is it safe to use it for verifying the legality of another calculated timestamp. | ||
| // Be sure to validate the timestamp before using it to access the data. |
There was a problem hiding this comment.
Can we modify GetStaleTimestamp to return the latest fetched timestamp, without adding the arrival duration?
oracle/oracles/pd.go
Outdated
|
|
||
| configuredInterval := time.Duration(o.lastTSUpdateInterval.Load()) | ||
| currentAdaptiveInterval := time.Duration(o.adaptiveLastTSUpdateInterval.Load()) | ||
| if configuredInterval <= minAllowedAdaptiveUpdateTSInterval { |
There was a problem hiding this comment.
If configuredInterval <= minAllowedAdaptiveUpdateTSInterval is true, then adaptiveLastTSUpdateInterval can be set to configuredInterval. The variable minAllowedAdaptiveUpdateTSInterval doesn't behave like its name.
How about defining a minUpdateTSInterval and both lastTSUpdateInterval and adaptiveLastTSUpdateInterval should less than minUpdateTSInterval.
There was a problem hiding this comment.
The name minAllowedAdaptiveUpdateTSInterval expected to limit the adaptive update ts interval, to avoid it automatically performs the update too frequently. If the user configures a short interval intentionally, we adopt the user's choice.
There was a problem hiding this comment.
By the way I tried to refine the code of the nextUpdateInterval function, which extracted each branches into small closures and unified the log printing. Please see if it's better than the previous one.
| const ( | ||
| // minAllowedAdaptiveUpdateTSInterval is the lower bound of the adaptive update ts interval for avoiding an abnormal | ||
| // read operation causing the update interval to be too short. | ||
| minAllowedAdaptiveUpdateTSInterval = 500 * time.Millisecond |
There was a problem hiding this comment.
The minimal value of tidb variable tidb_low_resolution_tso_update_interval is 10ms, why minAllowedAdaptiveUpdateTSInterval is 500ms?
There was a problem hiding this comment.
To avoid the adaptive update interval mechanism automatically and silently chooses a too low interval can causing too high frequency of getting ts from PD. But we do not reject it if the user configures it explicitly and the low update interval is just what the user expects.
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
| nextState := func(checkFuncs ...func() (adaptiveUpdateTSIntervalState, time.Duration)) time.Duration { | ||
| for _, f := range checkFuncs { | ||
| state, newInterval := f() | ||
| if state == none { |
There was a problem hiding this comment.
Should it be returned here if the state is adaptiveUpdateTSIntervalStateUnadjustable instead of continue processing?
There was a problem hiding this comment.
If state is not none, it should return at line 454
| latestTS, err := o.GetLowResolutionTimestamp(ctx, opt) | ||
| // If we fail to get latestTS or the readTS exceeds it, get a timestamp from PD to double-check. | ||
| // But we don't need to strictly fetch the latest TS. So if there are already concurrent calls to this function | ||
| // loading the latest TS, we can just reuse the same result to avoid too many concurrent GetTS calls. |
There was a problem hiding this comment.
The current implementation is enough I think. This is just a minor suggestion.
Use singleflight to reduce the GetTS calls is reasonable, but it's possible that a valid ts is invalid in this case.
- thread 1 validate current_ts and send a get ts request, PD allocates its latest TSO and the response is slow.
- thread 2 validate current_ts and the singleflight will reuse the thread 1's TSO.
- The GetTS request of thread 1's is responded.
Then thread 2 can get a stale TSO from PD which increases the failure possibility.
Maybe an enhancement is to resend a singleflight GetTS request if readTS > latestTS to recheck the timestamp in this case.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cfzjywxk, crazycs520 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
|
(Note: further cherry picks needs to carry #1502 at the same time) |
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: you06 <you1474600@gmail.com> Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com> Co-authored-by: you06 <you1474600@gmail.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com> Signed-off-by: ekexium <eke@fastmail.com>
…) (#1531) Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com> Signed-off-by: ekexium <eke@fastmail.com> Co-authored-by: MyonKeminta <9948422+MyonKeminta@users.noreply.github.com>
Ref: pingcap/tidb#56809