VReplication/VTAdmin: Clean up VReplication lag related metrics#18802
VReplication/VTAdmin: Clean up VReplication lag related metrics#18802
Conversation
The vreplication lag metrics are liveness metrics and not what a VTAdmin is interested in when trying to understand if a vreplication workflow is caught up, lagging, etc. The liveness metric can be 0 while the workflow is really lagging by hours, which is reflected in the vreplication transaction lag which compares the timestamp of the transaction from the source with the timestamp for when we executed it in the vstream. h/t to Claude Code for doing all of the vtadmin work! Signed-off-by: Matt Lord <mattalord@gmail.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
392462a to
a8edcbc
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18802 +/- ##
==========================================
+ Coverage 69.68% 69.70% +0.02%
==========================================
Files 1605 1607 +2
Lines 214492 214562 +70
==========================================
+ Hits 149467 149561 +94
+ Misses 65025 65001 -24 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Matt Lord <mattalord@gmail.com>
04eb60a to
a95b5db
Compare
a95b5db to
e7e2029
Compare
3f38378 to
01befd8
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
| func (g *Gauges) Stop() { | ||
| g.cancel() | ||
| } |
There was a problem hiding this comment.
nit: should we wait here for track() to return? using a channel?
There was a problem hiding this comment.
I don't think so. track() will return when the context is cancelled here:
case <-g.ctx.Done():
return
|
|
||
| // Create a Gauges with 3 samples, sampling every 1 second. | ||
| g := NewGauges(3, 1*time.Second) | ||
| defer g.Stop() |
There was a problem hiding this comment.
nit: I know it doesn't matter here as it's a very short test, but if we don't want any snapshots from track. Should we call g.Stop() without a defer here?
There was a problem hiding this comment.
I don't think so. It's fine if we get a non-manual snapshot (5 second interval is used everywhere) -- we only check that we have at least as many as we do manually.
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Description
VTAdmin Workflow Details View
The
max_v_replication_lagvalue returned in the workflow'sshowvtctldclientcommand output (e.g.MoveTables) is a liveness metric that reflects how long ago we last processed an event from the source across all of the streams and not what aVTAdminuser is ultimately interested in when trying to understand if a vreplication workflow is caught up, lagging, etc. The liveness value can be 0 while the workflow is really lagging by hours, which is reflected in the returnedmax_v_replication_transaction_lagvalue which is built by comparing the timestamp of the last transaction that we applied from the source with the current timestamp on the target across all of the streams. See the test case in the issue for a demonstration of this.On this PR branch you can see the max vreplication transaction lag (across all streams) value returned from the

workflow showcommand now used in the workflow detail view when using the manual test in #18804:VReplication Timings Loss
In #13824 we started closing the binlogplayer stats aggressively everywhere. That was generally correct, but for workflows we should only close them when the controller is deleted. There are various scenarios where the controller and/or binlogplayer stats are re-used. Once you close the binlogplayer stats you cannot re-start the goroutines which operate the timings. So e.g. if you updated a workflow to change the state from
CopyingtoRunning— which is what happens automatically when you create a new workflow and it finishes catching up — you would no longer have any stats forVReplicationLagsorVReplicationQPS. You can actually see that if you run the manual test in #18804 and look at the metrics:They are forever empty. So we change how these stats are managed and we only close the binlogplayer stats when we know that they will no longer be used (primarily when we remove the controller).
VReplicationLag* Metrics
This metric was really measuring how long the target
vplayerwas lagging behind the sourcevstreameras it was comparing the time the event was created on thevstreamerwith the current time on thevplayerwhen it was processed. That means that whenever the stream is healthy and running the lag will always be very low, whether the workflow is not actually lagging or if it's lagging by hours or even days. This is wrong. We were already calculating the "transaction lag" as described above when we estimated the lag due to being throttled, now we also calculate the lag when we're processing transactions as "transaction lag".But still... the
VReplicationQPSandVReplicationLagmetric values were exactly the same! For example:That was because we were using a timeseries of Rates, and we were updating the lag value on every event which was a batched transaction. This means that the rate of lag values being added was equal to the rate of batched transactions. So we move
VReplicationLagto use a timeseries of Gauges, which is what we really want here for the lag — a sampling of the actual lag values added rather than the rate at which they were added. In a manual test that ensures workflow lag, we can now see that the lag metrics all line up now (see the tail end of theVReplicationLaglist):And the lag metrics line up with what we see for the

max_v_replication_transaction_lagoutput in thevtctldclient MoveTables showcommand while the test runs. And because of these changes, theVTAdminstream lag graph looks correct now too:Important
The
VReplicationLagmetric is whatVTAdminuses for its workflow stream lag graph. So fixing this also addresses unexpected/incorrect info reported there.Note
The
VReplicationLagmetric isn't even documented here: https://vitess.io/docs/reference/vreplication/metrics/Docs PR: vitessio/website#2013
Related Issue(s)
Checklist
AI Disclosure
h/t to Claude Code for doing all of the vtadmin code changes!