Fix metrics by xrstf · Pull Request #159 · kcp-dev/kcp-operator

xrstf · 2026-02-23T13:40:08Z

Summary

After every reconciliation, we call RecordObjectMetrics, which would set the *Count metric for the given (namespace,phase) combination to 1. So after reconciling 5 objects, the metric is 1, not 5.

At the same time, the MetricsCollector regularly updates the *Count metrics with the correct values. However it races with the reconcilers, and even if locking were used, I do not understand why the reconcilers would reset the *Count metrics every single time to 1.

Hence this PR removes the updating of the *Count metrics in each reconciliation, and makes the code a bit more readable.

I then also went ahead and removed the last bits of the RecordObjectMetrics function because when an object gets deleted, the metrics would remain forever (well, until the operator is restarted). The MetricsCollector has the right approach (fetch all objects on a schedule, update all the counts at once), so I simply moved the status handling over there. It already has a list of all objects with their full bodies, so why not.

What Type of PR Is This?

/kind bug

Release Notes

Fix stale and incorrect metric values.

On-behalf-of: @SAP christoph.mewes@sap.com

mjudeikis-bot

The root cause analysis is spot on — Set(1) per reconcile will always write 1, not accumulate. Consolidating into the MetricsCollector is the right fix. Overall LGTM with a couple of observations below.

One gap worth addressing: updateCacheServerCounts has all CacheServer objects available but doesn't call recordConditionStatuses, so ConditionStatus will never be populated for CacheServer objects. CacheServer's reconciler also never called RecordObjectMetrics (pre-existing gap), so this isn't a regression — but since this PR is cleaning up metrics comprehensively, it'd be nice to add it here rather than leave it as a follow-up.

mjudeikis-bot · 2026-02-23T13:54:10Z

internal/metrics/collector.go

@@ -54,7 +55,16 @@ func (mc *MetricsCollector) Start(ctx context.Context) {
 	}


Minor: ConditionStatus.Reset() followed immediately by re-population creates a brief window during a Prometheus scrape where condition metrics temporarily disappear. For a 30s-interval collector this is usually acceptable, but operators with high scrape frequency + alerting rules on ConditionStatus could see transient alert fires. Just worth noting as a known tradeoff — the alternative (build a map first, then atomically Reset()+repopulate) is significantly more complex for minimal real-world gain.

Ideally the collector should just be guarded by a lock, but that's beyond the scope of this PR. You can create a new ticket for this good first issue, though.

@mjudeikis-bot create follow-up issue please

Done — created #160 to track adding a sync.RWMutex to MetricsCollector to prevent transient scrape gaps during Reset() + repopulate. Marked as good first issue per @xrstf suggestion.

mjudeikis-bot · 2026-02-23T13:54:11Z

internal/metrics/collector.go

 			phaseCounts[phase] = make(map[string]int)
 		}
 		phaseCounts[phase][fp.Namespace]++
+


CacheServer objects are listed here but recordConditionStatuses is not called. Since ConditionStatus.Reset() was already called at the top of updateObjectCounts, CacheServer condition metrics will be absent from every scrape. Could add:

recordConditionStatuses(CacheServerResourceType, cs.Name, cs.Namespace, cs.Status.Conditions)

...inside the loop here to bring CacheServer in line with the other resource types.

CacheServers currently have no conditions in their status subresource.

mjudeikis

/lgtm
/approve

kcp-ci-bot · 2026-02-25T09:18:24Z

LGTM label has been added.

Details

Git tree hash: 5d9fd6d3f73468a50c13c268577979f2f350a445

kcp-ci-bot · 2026-02-25T09:18:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mjudeikis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mjudeikis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xrstf added 2 commits February 23, 2026 14:23

fix race condition in metrics

40fc929

On-behalf-of: @SAP christoph.mewes@sap.com

fix condition metrics never expiring when objects get deleted

a75ffc9

On-behalf-of: @SAP christoph.mewes@sap.com

mjudeikis-bot reviewed Feb 23, 2026

View reviewed changes

mjudeikis-bot mentioned this pull request Feb 23, 2026

metrics: guard MetricsCollector with a mutex to prevent transient scrape gaps #160

Open

mjudeikis approved these changes Feb 25, 2026

View reviewed changes

kcp-ci-bot assigned mjudeikis Feb 25, 2026

kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 25, 2026

kcp-ci-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 25, 2026

kcp-ci-bot merged commit af6faf0 into kcp-dev:main Feb 25, 2026
12 checks passed

xrstf deleted the fix-metrics branch February 25, 2026 13:03

xrstf added this to Platform Mesh & Kube Projects - Backlog Feb 25, 2026

xrstf moved this to Done in Platform Mesh & Kube Projects - Backlog Feb 25, 2026

ghdrope mentioned this pull request Mar 28, 2026

160 guard metrics collector with mutex #186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix metrics#159

Fix metrics#159
kcp-ci-bot merged 2 commits intokcp-dev:mainfrom
xrstf:fix-metrics

xrstf commented Feb 23, 2026

Uh oh!

mjudeikis-bot left a comment

Uh oh!

mjudeikis-bot Feb 23, 2026

Uh oh!

xrstf Feb 23, 2026

Uh oh!

mjudeikis Feb 23, 2026

Uh oh!

mjudeikis-bot Feb 23, 2026

Uh oh!

mjudeikis-bot Feb 23, 2026

Uh oh!

xrstf Feb 23, 2026

Uh oh!

mjudeikis left a comment

Uh oh!

kcp-ci-bot commented Feb 25, 2026

Uh oh!

kcp-ci-bot commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -54,7 +55,16 @@ func (mc *MetricsCollector) Start(ctx context.Context) {
		}

Conversation

xrstf commented Feb 23, 2026

Summary

What Type of PR Is This?

Release Notes

Uh oh!

mjudeikis-bot left a comment

Choose a reason for hiding this comment

Uh oh!

mjudeikis-bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

xrstf Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

mjudeikis Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

mjudeikis-bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

mjudeikis-bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

xrstf Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

mjudeikis left a comment

Choose a reason for hiding this comment

Uh oh!

kcp-ci-bot commented Feb 25, 2026

Uh oh!

kcp-ci-bot commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants