feat: decouple metrics scraping from metrics computation#8003
feat: decouple metrics scraping from metrics computation#8003leonardoce merged 25 commits intomainfrom
Conversation
|
❗ By default, the pull request is configured to backport to all release branches.
|
7f9da77 to
196f54f
Compare
|
/test depth=push test_level=4 feature_type=basic |
|
@jsilvela, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/16196975122 |
| for name, userQuery := range q.userQueries { | ||
| queryLogger := log.WithValues("query", name) | ||
| collector := QueryCollector{ | ||
| queryRunner := QueryRunner{ |
There was a problem hiding this comment.
OK, way too many things in the monitoring code are called a collector.
Computing the next value of a metric is not "collecting".
If I get the OK from maintainers, would like to do a lot of renaming. Perhaps a separate PR would be better though.
There was a problem hiding this comment.
100% agree on the changes; the metrics code was hard to follow. I'd support it on this PR or a separate one... My gut is that a separate PR might be easier to review [rename only, no logic changes].
There was a problem hiding this comment.
My gut is that a separate PR might be easier to review [rename only, no logic changes].
Yep, my thoughts too
|
/test depth=push test_level=4 feature_type=observability |
|
@jsilvela, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/16197557109 |
|
Gah, metrics E2E failing. Manually taking out the OK-to-merge label (from smoke tests) |
| for name, userQuery := range q.userQueries { | ||
| queryLogger := log.WithValues("query", name) | ||
| collector := QueryCollector{ | ||
| queryRunner := QueryRunner{ |
There was a problem hiding this comment.
100% agree on the changes; the metrics code was hard to follow. I'd support it on this PR or a separate one... My gut is that a separate PR might be easier to review [rename only, no logic changes].
jmealo
left a comment
There was a problem hiding this comment.
Looks good overall!
When I was troubleshooting some issues with CNPG, I received help from some Gophers in the Go slack who hinted that the project may have some race conditions in patterns used throughout. This was months ago and I don't remember the specifics. All of th Go I've done is just trying to fix bugs in CNPG; so I could be way off-base here.
Possible Race Condition
I think there may be a race condition between Update() and Collect(). The Update() method writes to computedMetrics while Collect() reads
from it. It looks like Update() is called from a ticker goroutine (every 30s) and Collect() is called by Prometheus HTTP handler (meaning they can execute concurrently without synchronization)? 🤔
I think something like this could work?
type QueriesCollector struct {
// ... existing fields ...
computedMetrics []prometheus.Metric
mu sync.RWMutex // Add this
}Then use q.mu.Lock() in Update() and q.mu.RLock() in Collect().
Alternatively:
Instead of a mutex, we could possibly use an atomic pointer to swap the entire metrics slice atomically? (I would lean towards this)
type QueriesCollector struct {
// ... existing fields ...
computedMetrics atomic.Pointer[[]prometheus.Metric]
}This would eliminate lock contention between Update and Collect operations.
Error handling
Since the user queries are user supplied, would it make sense to add more context to errors in createMetricsFromUserQueries()? For example: when instance.IsPrimary() fails, it might be helpful to wrap the error with more context about what operation was being attempted.
|
@jmealo you're right about the race condition. While it wouldn't lead to corruption/destruction, we could call Collect at a point where Update was in progress, and we could lose metrics during collection. |
|
@jmealo thinking about the lock more, I think we might not need it. Instead of clearing the |
This looks right to me. 👍 |
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
3f9cabf to
23ccdb9
Compare
Co-authored-by: Jeff Mealo <jmealo@protonmail.com> Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@mailfence.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
23ccdb9 to
cb0436c
Compare
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
|
/ok-to-merge |
Closes #4962