-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Feature Request: change vtbackup_duration_by_phase to binary vtbackup_duration #12972
Copy link
Copy link
Closed
Closed
Copy link
Labels
Component: Backup and RestoreType: EnhancementLogical improvement (somewhere between a bug and feature)Logical improvement (somewhere between a bug and feature)Type: Feature
Description
Feature Description
After using vtbackup_duration_by_phase for a few weeks in production, I can confidently say that they are pretty awkward to use.
I recommend changing this metric to vtbackup_phase, a binary valued gauge similar to K8s metrics like kube_pod_status_phase. Here's an example of what these metrics could look like:
# HELP vtbackup_phase Active phase.
# TYPE vtbackup_phase gauge
vtbackup_phase{phase="CatchUpReplication"} 0
vtbackup_phase{phase="InitialBackup"} 0
vtbackup_phase{phase="RestoreLastBackup"} 0
vtbackup_phase{phase="TakeNewBackup"} 1
At any given moment, only one phase would be active. In order to calculate how long a phase has been active, you could do something like this:
sum_over_time(vtbackup_phase{phase="TakeNewBackup"}) * <interval>
Where <interval> is the number of seconds between data points.
Use Case(s)
Some issues that would be resolved by the proposed change.
vtbackupcurrently doesn't report that a phase as active. It only reports the phase duration once that phase completes. This means that there's no way to tell what phasevtbackupis currently in, unless you know enough about the internals of the program to infer the current state from other metrics and logs.- If
vtbackupexits before completing a phase, it won't report the time it spent in that phase. - After completing the last phase (
TakeNewBackup),vtbackupexits pretty much right away. This means that there might only be a few seconds betweenvtbackupreporting that phase for the first time andvtbackupexiting, which might not be enough time for the metric collector (e.g. Prometheus) to have a chance to collect that metric. This necessitates using something awkward like--keep-alive-timeoutto get keepvtbackupalive long enough for the collector to do at least one scrape.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Component: Backup and RestoreType: EnhancementLogical improvement (somewhere between a bug and feature)Logical improvement (somewhere between a bug and feature)Type: Feature