Skip to content

VReplication/VTAdmin: Clean up VReplication lag related metrics#18802

Merged
mattlord merged 13 commits intomainfrom
vtadmin_use_vrepl_trx_lag
Oct 27, 2025
Merged

VReplication/VTAdmin: Clean up VReplication lag related metrics#18802
mattlord merged 13 commits intomainfrom
vtadmin_use_vrepl_trx_lag

Conversation

@mattlord
Copy link
Copy Markdown
Member

@mattlord mattlord commented Oct 24, 2025

Description

VTAdmin Workflow Details View

The max_v_replication_lag value returned in the workflow's show vtctldclient command output (e.g. MoveTables) is a liveness metric that reflects how long ago we last processed an event from the source across all of the streams and not what a VTAdmin user is ultimately interested in when trying to understand if a vreplication workflow is caught up, lagging, etc. The liveness value can be 0 while the workflow is really lagging by hours, which is reflected in the returned max_v_replication_transaction_lag value which is built by comparing the timestamp of the last transaction that we applied from the source with the current timestamp on the target across all of the streams. See the test case in the issue for a demonstration of this.

On this PR branch you can see the max vreplication transaction lag (across all streams) value returned from the workflow show command now used in the workflow detail view when using the manual test in #18804:
Screenshot 2025-10-25 at 12 31 58 AM

VReplication Timings Loss

In #13824 we started closing the binlogplayer stats aggressively everywhere. That was generally correct, but for workflows we should only close them when the controller is deleted. There are various scenarios where the controller and/or binlogplayer stats are re-used. Once you close the binlogplayer stats you cannot re-start the goroutines which operate the timings. So e.g. if you updated a workflow to change the state from Copying to Running — which is what happens automatically when you create a new workflow and it finishes catching up — you would no longer have any stats for VReplicationLags or VReplicationQPS. You can actually see that if you run the manual test in #18804 and look at the metrics:

curl -s http://localhost:15$(vtctldclient GetTablets --keyspace customer --tablet-type primary --shard "0" | awk '{print $1}' | cut -d- -f2 | bc)/debug/vars | grep -E 'VReplicationLag|VReplicationQPS'

They are forever empty. So we change how these stats are managed and we only close the binlogplayer stats when we know that they will no longer be used (primarily when we remove the controller).

VReplicationLag* Metrics

This metric was really measuring how long the target vplayer was lagging behind the source vstreamer as it was comparing the time the event was created on the vstreamer with the current time on the vplayer when it was processed. That means that whenever the stream is healthy and running the lag will always be very low, whether the workflow is not actually lagging or if it's lagging by hours or even days. This is wrong. We were already calculating the "transaction lag" as described above when we estimated the lag due to being throttled, now we also calculate the lag when we're processing transactions as "transaction lag".

But still... the VReplicationQPS and VReplicationLag metric values were exactly the same! For example:

"VReplicationLag": {"1":[110.5,93,109.25,85.4],"All":[110.5,93,109.25,85.4]},
"VReplicationQPS": {"All":[110.5,93,109.25,85.4],"BatchTransaction":[110.5,93,109.25,85.4],"Query":[0,0,0,0],"Transaction":[0,0,0,0]},

That was because we were using a timeseries of Rates, and we were updating the lag value on every event which was a batched transaction. This means that the rate of lag values being added was equal to the rate of batched transactions. So we move VReplicationLag to use a timeseries of Gauges, which is what we really want here for the lag — a sampling of the actual lag values added rather than the rate at which they were added. In a manual test that ensures workflow lag, we can now see that the lag metrics all line up now (see the tail end of the VReplicationLag list):

❯ curl -s http://localhost:15$(vtctldclient GetTablets --keyspace customer --tablet-type primary --shard "0" | awk '{print $1}' | cut -d- -f2 | bc)/debug/vars | grep -E 'VReplicationLag|VReplicationQPS'
"VReplicationLag": {"1":[9223372036854776000,188,124,127,132,136,140,145,149,154,158,162,167,171,176,180,184,189,193,198,202,206,211,215,220,225,229,232,237,241,246,251,254,259,264,268,271,277,281,285,290,293,297,302,306,312,316,320,323,329,333,337,341,345,349,354,358,362,367,370,375,379,384,389,393,396,401,406,409,415,418,422,427,432,435,440,444,448,452,458,461,465,470,475,478,482,487,492,495,500,505,509,513,517,522,525,530,535,538,542,546,552,556,560,564,568,573,578,581,585,591,594,598,604,607,611,617,620,625,628,634,638,642,647,651,654,659,663,668,672,676,680,684,688,693,698,702,707,712,716,721,725,729,734,738,743,748,753,757,761,766,770,775,779,784,788,793,798,802,807,811,816,820,825,829,835,838,843,848,853,856,862,865,871,875,880,885,889,894,898,903]},
"VReplicationLagSeconds": {"commerce.0.commerce2customer.1": 904},
"VReplicationLagSecondsMax": 904,
"VReplicationLagSecondsTotal": 904,
"VReplicationQPS": {"All":[3.4,2.2,2.8,3.5,3.75,3.75,2.6,3.5,3.5,2.6,3,3.5,3.75,2.6,2.8,3,3.5,2.8,3.5,3.25,2.8,3.5,3.5,3.75,3.2,4.25,3.75,3.2,3.75,3.2,3.2,3.75,4.25,3.2,3.2,4,3.2,4,3.2,4,3.2,3.2,3,4,4.25,4.5,4.25,3,3.4,4.25,3.2,4,3.4,4,4.25,3,3.75,4.25,3,4,3.2,4,4.25,3.2,2.8,3.2,4,4.25,4.25,3.2,3,3.75,3.4,4,3.4,3,3.4,4.25,4.25,3.2,4,4.25,3.2,3,3.2,4,4.5,3.4,4.25,3.4,3.4,3.2,4.25,4.25,3.2,4.25,3.4,4,4.25,3.2,3.2,3.4,4,3.75,2.8,4,3.4,3.2,2.8,4.25,3.2,3.4,3.4,3,4,3.4,3.75,3.4,2,3.5,4.5,3.2,3.6,4.25,3.75,3.4,4,4.5,3.8,3,3.2,2.8,2.2,1.8,2.2,2.2,2.25,1.8,2,2.2,2.2,2,2.5,2.25,2,2.25,2.2,2,2,2.2,1.8,2.25,2.25,2.4,1.6,2.75,2.5,1.8,2,2.5,2.75,1.8,2.75,1.8,2.75,2.25,2.2,2,2.75,2,2.2,1.6,3,1.8,3,1.8,2.5,2.4,2.5,2.2],"BatchTransaction":[0.6,2.2,2.8,3.5,3.75,3.75,2.6,3.5,3.5,2.6,3,3.5,3.75,2.6,2.8,3,3.5,2.8,3.5,3.25,2.8,3.5,3.5,3.75,3.2,4.25,3.75,3.2,3.75,3.2,3.2,3.75,4.25,3.2,3.2,4,3.2,4,3.2,4,3.2,3.2,3,4,4.25,4.5,4.25,3,3.4,4.25,3.2,4,3.4,4,4.25,3,3.75,4.25,3,4,3.2,4,4.25,3.2,2.8,3.2,4,4.25,4.25,3.2,3,3.75,3.4,4,3.4,3,3.4,4.25,4.25,3.2,4,4.25,3.2,3,3.2,4,4.5,3.4,4.25,3.4,3.4,3.2,4.25,4.25,3.2,4.25,3.4,4,4.25,3.2,3.2,3.4,4,3.75,2.8,4,3.4,3.2,2.8,4.25,3.2,3.4,3.4,3,4,3.4,3.75,3.4,2,3.5,4.5,3.2,3.6,4.25,3.75,3.4,4,4.5,3.8,3,3.2,2.8,2.2,1.8,2.2,2.2,2.25,1.8,2,2.2,2.2,2,2.5,2.25,2,2.25,2.2,2,2,2.2,1.8,2.25,2.25,2.4,1.6,2.75,2.5,1.8,2,2.5,2.75,1.8,2.75,1.8,2.75,2.25,2.2,2,2.75,2,2.2,1.6,3,1.8,3,1.8,2.5,2.4,2.5,2.2],"Query":[2.8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"Transaction":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]},

And the lag metrics line up with what we see for the max_v_replication_transaction_lag output in the vtctldclient MoveTables show command while the test runs. And because of these changes, the VTAdmin stream lag graph looks correct now too:
Screenshot 2025-10-26 at 5 08 19 PM

Important

The VReplicationLag metric is what VTAdmin uses for its workflow stream lag graph. So fixing this also addresses unexpected/incorrect info reported there.

Note

The VReplicationLag metric isn't even documented here: https://vitess.io/docs/reference/vreplication/metrics/
Docs PR: vitessio/website#2013

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added: Add VReplicationLag metric to docs website#2013

AI Disclosure

h/t to Claude Code for doing all of the vtadmin code changes!

The vreplication lag metrics are liveness metrics and not what a
VTAdmin is interested in when trying to understand if a vreplication
workflow is caught up, lagging, etc. The liveness metric can be 0
while the workflow is really lagging by hours, which is reflected in
the vreplication transaction lag which compares the timestamp of the
transaction from the source with the timestamp for when we executed
it in the vstream.

h/t to Claude Code for doing all of the vtadmin work!

Signed-off-by: Matt Lord <mattalord@gmail.com>
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot bot commented Oct 24, 2025

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Oct 24, 2025
@github-actions github-actions bot added this to the v24.0.0 milestone Oct 24, 2025
@mattlord mattlord added Component: VReplication Component: VTAdmin VTadmin interface Type: Enhancement Logical improvement (somewhere between a bug and feature) and removed NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work labels Oct 24, 2025
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the vtadmin_use_vrepl_trx_lag branch from 392462a to a8edcbc Compare October 24, 2025 19:21
@codecov
Copy link
Copy Markdown

codecov bot commented Oct 24, 2025

Codecov Report

❌ Patch coverage is 94.21488% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.70%. Comparing base (b1709cf) to head (ad165d6).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/vttablet/tabletmanager/vreplication/stats.go 33.33% 4 Missing ⚠️
go/stats/gauges.go 96.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18802      +/-   ##
==========================================
+ Coverage   69.68%   69.70%   +0.02%     
==========================================
  Files        1605     1607       +2     
  Lines      214492   214562      +70     
==========================================
+ Hits       149467   149561      +94     
+ Misses      65025    65001      -24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mattlord mattlord removed the NeedsIssue A linked issue is missing for this Pull Request label Oct 24, 2025
@mattlord mattlord marked this pull request as ready for review October 24, 2025 21:41
@mattlord mattlord marked this pull request as draft October 25, 2025 04:47
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord changed the title VTAdmin: Use vreplication transaction lag VTAdmin: Use max vreplication transaction lag in details view Oct 25, 2025
@mattlord mattlord force-pushed the vtadmin_use_vrepl_trx_lag branch from 04eb60a to a95b5db Compare October 26, 2025 16:24
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the vtadmin_use_vrepl_trx_lag branch from a95b5db to e7e2029 Compare October 26, 2025 16:27
@mattlord mattlord changed the title VTAdmin: Use max vreplication transaction lag in details view VReplication/VTAdmin: Clean up vreplication stats timings Oct 26, 2025
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord force-pushed the vtadmin_use_vrepl_trx_lag branch from 3f38378 to 01befd8 Compare October 26, 2025 17:46
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord changed the title VReplication/VTAdmin: Clean up vreplication stats timings VReplication/VTAdmin: Clean up VReplication lag related metrics Oct 26, 2025
@mattlord mattlord marked this pull request as ready for review October 26, 2025 21:22
Copy link
Copy Markdown
Member

@beingnoble03 beingnoble03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +81 to +83
func (g *Gauges) Stop() {
g.cancel()
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we wait here for track() to return? using a channel?

Copy link
Copy Markdown
Member Author

@mattlord mattlord Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. track() will return when the context is cancelled here:

case <-g.ctx.Done():
    return


// Create a Gauges with 3 samples, sampling every 1 second.
g := NewGauges(3, 1*time.Second)
defer g.Stop()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know it doesn't matter here as it's a very short test, but if we don't want any snapshots from track. Should we call g.Stop() without a defer here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. It's fine if we get a non-manual snapshot (5 second interval is used everywhere) -- we only check that we have at least as many as we do manually.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes right!

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord merged commit 7666b8e into main Oct 27, 2025
106 of 108 checks passed
@mattlord mattlord deleted the vtadmin_use_vrepl_trx_lag branch October 27, 2025 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: VReplication Component: VTAdmin VTadmin interface Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: VTAdmin workflow detail page uses max_v_replication_lag value which can be misleading

3 participants