Skip to content

feat: decouple metrics scraping from metrics computation#8003

Merged
leonardoce merged 25 commits intomainfrom
dev/jsilvela-metrics
Oct 28, 2025
Merged

feat: decouple metrics scraping from metrics computation#8003
leonardoce merged 25 commits intomainfrom
dev/jsilvela-metrics

Conversation

@jsilvela
Copy link
Collaborator

@jsilvela jsilvela commented Jul 9, 2025

Closes #4962

@jsilvela jsilvela requested a review from a team as a code owner July 9, 2025 15:24
@jsilvela jsilvela marked this pull request as draft July 9, 2025 15:25
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jul 9, 2025
@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.22 release-1.25 release-1.26 labels Jul 9, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Jul 9, 2025

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@dosubot dosubot bot added the enhancement 🪄 New feature or request label Jul 9, 2025
@jsilvela jsilvela force-pushed the dev/jsilvela-metrics branch 2 times, most recently from 7f9da77 to 196f54f Compare July 10, 2025 13:49
@jsilvela
Copy link
Collaborator Author

/test depth=push test_level=4 feature_type=basic

@github-actions
Copy link
Contributor

@jsilvela, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/16196975122

@jsilvela jsilvela marked this pull request as ready for review July 10, 2025 13:52
for name, userQuery := range q.userQueries {
queryLogger := log.WithValues("query", name)
collector := QueryCollector{
queryRunner := QueryRunner{
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, way too many things in the monitoring code are called a collector.
Computing the next value of a metric is not "collecting".
If I get the OK from maintainers, would like to do a lot of renaming. Perhaps a separate PR would be better though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree on the changes; the metrics code was hard to follow. I'd support it on this PR or a separate one... My gut is that a separate PR might be easier to review [rename only, no logic changes].

Copy link
Collaborator Author

@jsilvela jsilvela Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut is that a separate PR might be easier to review [rename only, no logic changes].

Yep, my thoughts too

@cnpg-bot cnpg-bot added the ok to merge 👌 This PR can be merged label Jul 10, 2025
@jsilvela
Copy link
Collaborator Author

/test depth=push test_level=4 feature_type=observability

@github-actions
Copy link
Contributor

@jsilvela, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/16197557109

@jsilvela
Copy link
Collaborator Author

Gah, metrics E2E failing. Manually taking out the OK-to-merge label (from smoke tests)

@jsilvela jsilvela removed the ok to merge 👌 This PR can be merged label Jul 10, 2025
Copy link
Contributor

@jmealo jmealo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to leave this as a PR review not a comment. Ooops.

for name, userQuery := range q.userQueries {
queryLogger := log.WithValues("query", name)
collector := QueryCollector{
queryRunner := QueryRunner{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree on the changes; the metrics code was hard to follow. I'd support it on this PR or a separate one... My gut is that a separate PR might be easier to review [rename only, no logic changes].

Copy link
Contributor

@jmealo jmealo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall!
When I was troubleshooting some issues with CNPG, I received help from some Gophers in the Go slack who hinted that the project may have some race conditions in patterns used throughout. This was months ago and I don't remember the specifics. All of th Go I've done is just trying to fix bugs in CNPG; so I could be way off-base here.

Possible Race Condition

I think there may be a race condition between Update() and Collect(). The Update() method writes to computedMetrics while Collect() reads
from it. It looks like Update() is called from a ticker goroutine (every 30s) and Collect() is called by Prometheus HTTP handler (meaning they can execute concurrently without synchronization)? 🤔

I think something like this could work?

  type QueriesCollector struct {
      // ... existing fields ...
      computedMetrics []prometheus.Metric
      mu              sync.RWMutex  // Add this
  }

Then use q.mu.Lock() in Update() and q.mu.RLock() in Collect().

Alternatively:
Instead of a mutex, we could possibly use an atomic pointer to swap the entire metrics slice atomically? (I would lean towards this)

  type QueriesCollector struct {
      // ... existing fields ...
      computedMetrics atomic.Pointer[[]prometheus.Metric]
  }

This would eliminate lock contention between Update and Collect operations.

Error handling

Since the user queries are user supplied, would it make sense to add more context to errors in createMetricsFromUserQueries()? For example: when instance.IsPrimary() fails, it might be helpful to wrap the error with more context about what operation was being attempted.

@jsilvela
Copy link
Collaborator Author

@jmealo you're right about the race condition. While it wouldn't lead to corruption/destruction, we could call Collect at a point where Update was in progress, and we could lose metrics during collection.
The mutex is right, I think

@jsilvela
Copy link
Collaborator Author

@jmealo thinking about the lock more, I think we might not need it. Instead of clearing the computedMetrics field at the beginning of Update, and progressively appending to it, we can build a slice with everything, and then do q.computedMetrics = generatedMetrics in one fell swoop.

@jmealo
Copy link
Contributor

jmealo commented Jul 10, 2025

@jmealo thinking about the lock more, I think we might not need it. Instead of clearing the computedMetrics field at the beginning of Update, and progressively appending to it, we can build a slice with everything, and then do q.computedMetrics = generatedMetrics in one fell swoop.

This looks right to me. 👍

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 11, 2025
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
@leonardoce leonardoce force-pushed the dev/jsilvela-metrics branch from 3f9cabf to 23ccdb9 Compare October 28, 2025 08:34
jsilvela and others added 17 commits October 28, 2025 11:04
Co-authored-by: Jeff Mealo <jmealo@protonmail.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
@leonardoce leonardoce force-pushed the dev/jsilvela-metrics branch from 23ccdb9 to cb0436c Compare October 28, 2025 10:05
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
@leonardoce
Copy link
Contributor

/ok-to-merge

@leonardoce leonardoce added do not backport This PR must not be backported - it will be in the next minor release and removed backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.26 release-1.27 labels Oct 28, 2025
@leonardoce leonardoce merged commit fde2400 into main Oct 28, 2025
44 checks passed
@leonardoce leonardoce deleted the dev/jsilvela-metrics branch October 28, 2025 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not backport This PR must not be backported - it will be in the next minor release enhancement 🪄 New feature or request lgtm This PR has been approved by a maintainer ok to merge 👌 This PR can be merged size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: decouple metrics scraping from metrics recomputation

7 participants