Conversation
mjudeikis-bot
left a comment
There was a problem hiding this comment.
The root cause analysis is spot on — Set(1) per reconcile will always write 1, not accumulate. Consolidating into the MetricsCollector is the right fix. Overall LGTM with a couple of observations below.
One gap worth addressing: updateCacheServerCounts has all CacheServer objects available but doesn't call recordConditionStatuses, so ConditionStatus will never be populated for CacheServer objects. CacheServer's reconciler also never called RecordObjectMetrics (pre-existing gap), so this isn't a regression — but since this PR is cleaning up metrics comprehensively, it'd be nice to add it here rather than leave it as a follow-up.
| @@ -54,7 +55,16 @@ func (mc *MetricsCollector) Start(ctx context.Context) { | |||
| } | |||
There was a problem hiding this comment.
Minor: ConditionStatus.Reset() followed immediately by re-population creates a brief window during a Prometheus scrape where condition metrics temporarily disappear. For a 30s-interval collector this is usually acceptable, but operators with high scrape frequency + alerting rules on ConditionStatus could see transient alert fires. Just worth noting as a known tradeoff — the alternative (build a map first, then atomically Reset()+repopulate) is significantly more complex for minimal real-world gain.
There was a problem hiding this comment.
Ideally the collector should just be guarded by a lock, but that's beyond the scope of this PR. You can create a new ticket for this good first issue, though.
| phaseCounts[phase] = make(map[string]int) | ||
| } | ||
| phaseCounts[phase][fp.Namespace]++ | ||
|
|
There was a problem hiding this comment.
CacheServer objects are listed here but recordConditionStatuses is not called. Since ConditionStatus.Reset() was already called at the top of updateObjectCounts, CacheServer condition metrics will be absent from every scrape. Could add:
recordConditionStatuses(CacheServerResourceType, cs.Name, cs.Namespace, cs.Status.Conditions)...inside the loop here to bring CacheServer in line with the other resource types.
There was a problem hiding this comment.
CacheServers currently have no conditions in their status subresource.
|
LGTM label has been added. DetailsGit tree hash: 5d9fd6d3f73468a50c13c268577979f2f350a445 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mjudeikis The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
After every reconciliation, we call RecordObjectMetrics, which would set the *Count metric for the given (namespace,phase) combination to 1. So after reconciling 5 objects, the metric is 1, not 5.
At the same time, the MetricsCollector regularly updates the *Count metrics with the correct values. However it races with the reconcilers, and even if locking were used, I do not understand why the reconcilers would reset the *Count metrics every single time to 1.
Hence this PR removes the updating of the *Count metrics in each reconciliation, and makes the code a bit more readable.
I then also went ahead and removed the last bits of the RecordObjectMetrics function because when an object gets deleted, the metrics would remain forever (well, until the operator is restarted). The MetricsCollector has the right approach (fetch all objects on a schedule, update all the counts at once), so I simply moved the status handling over there. It already has a list of all objects with their full bodies, so why not.
What Type of PR Is This?
/kind bug
Release Notes