Skip to content

Fix metrics#159

Merged
kcp-ci-bot merged 2 commits intokcp-dev:mainfrom
xrstf:fix-metrics
Feb 25, 2026
Merged

Fix metrics#159
kcp-ci-bot merged 2 commits intokcp-dev:mainfrom
xrstf:fix-metrics

Conversation

@xrstf
Copy link
Copy Markdown
Contributor

@xrstf xrstf commented Feb 23, 2026

Summary

After every reconciliation, we call RecordObjectMetrics, which would set the *Count metric for the given (namespace,phase) combination to 1. So after reconciling 5 objects, the metric is 1, not 5.

At the same time, the MetricsCollector regularly updates the *Count metrics with the correct values. However it races with the reconcilers, and even if locking were used, I do not understand why the reconcilers would reset the *Count metrics every single time to 1.

Hence this PR removes the updating of the *Count metrics in each reconciliation, and makes the code a bit more readable.

I then also went ahead and removed the last bits of the RecordObjectMetrics function because when an object gets deleted, the metrics would remain forever (well, until the operator is restarted). The MetricsCollector has the right approach (fetch all objects on a schedule, update all the counts at once), so I simply moved the status handling over there. It already has a list of all objects with their full bodies, so why not.

What Type of PR Is This?

/kind bug

Release Notes

Fix stale and incorrect metric values.

On-behalf-of: @SAP christoph.mewes@sap.com
@kcp-ci-bot kcp-ci-bot added kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the DCO. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 23, 2026
Copy link
Copy Markdown

@mjudeikis-bot mjudeikis-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause analysis is spot on — Set(1) per reconcile will always write 1, not accumulate. Consolidating into the MetricsCollector is the right fix. Overall LGTM with a couple of observations below.

One gap worth addressing: updateCacheServerCounts has all CacheServer objects available but doesn't call recordConditionStatuses, so ConditionStatus will never be populated for CacheServer objects. CacheServer's reconciler also never called RecordObjectMetrics (pre-existing gap), so this isn't a regression — but since this PR is cleaning up metrics comprehensively, it'd be nice to add it here rather than leave it as a follow-up.

@@ -54,7 +55,16 @@ func (mc *MetricsCollector) Start(ctx context.Context) {
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: ConditionStatus.Reset() followed immediately by re-population creates a brief window during a Prometheus scrape where condition metrics temporarily disappear. For a 30s-interval collector this is usually acceptable, but operators with high scrape frequency + alerting rules on ConditionStatus could see transient alert fires. Just worth noting as a known tradeoff — the alternative (build a map first, then atomically Reset()+repopulate) is significantly more complex for minimal real-world gain.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the collector should just be guarded by a lock, but that's beyond the scope of this PR. You can create a new ticket for this good first issue, though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjudeikis-bot create follow-up issue please

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — created #160 to track adding a sync.RWMutex to MetricsCollector to prevent transient scrape gaps during Reset() + repopulate. Marked as good first issue per @xrstf suggestion.

phaseCounts[phase] = make(map[string]int)
}
phaseCounts[phase][fp.Namespace]++

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CacheServer objects are listed here but recordConditionStatuses is not called. Since ConditionStatus.Reset() was already called at the top of updateObjectCounts, CacheServer condition metrics will be absent from every scrape. Could add:

recordConditionStatuses(CacheServerResourceType, cs.Name, cs.Namespace, cs.Status.Conditions)

...inside the loop here to bring CacheServer in line with the other resource types.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CacheServers currently have no conditions in their status subresource.

Copy link
Copy Markdown
Contributor

@mjudeikis mjudeikis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@kcp-ci-bot kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 25, 2026
@kcp-ci-bot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 5d9fd6d3f73468a50c13c268577979f2f350a445

@kcp-ci-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mjudeikis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kcp-ci-bot kcp-ci-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 25, 2026
@kcp-ci-bot kcp-ci-bot merged commit af6faf0 into kcp-dev:main Feb 25, 2026
12 checks passed
@xrstf xrstf deleted the fix-metrics branch February 25, 2026 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has signed the DCO. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants