-
Notifications
You must be signed in to change notification settings - Fork 4k
Interleaved metrics & duplicate HELP/TYPE for ra metrics in Prometheus #15600
Description
Describe the bug
Hi RabbitMQ Team,
In RabbitMQ 4.2.4, I have Khepri emitting metrics like this one:
rabbitmq_detailed_raft_commit_index{module="rabbit_khepri",ra_system="coordination"} 100.0
rabbitmq_raft_commit_latency_seconds{module="rabbit_khepri",ra_system="coordination"} 0.002
The first one is from a call to /detailed, the second one from a call to /per-object.
There is a crucial difference between them (that I can't find myself in the code), in that /detailed returns them in a single block, while /per-object interleaves them.
The Prom (& in turn OpenMetrics) specs, say:
From Prom:
Only one TYPE line may exist for a given metric name.
Only one HELP line may exist for any given metric name.
From OpenMetrics:
There MUST NOT be more than one of each type of metadata line for a MetricFamily.
Metrics MUST NOT be interleaved.
Both might be able to be fixed in a single swoop, as I think the duplicate happens because the metrics are interleaved in the /per-object endpoint.
The metrics at issue are: rabbitmq_raft_commit_index, rabbitmq_raft_commit_latency_seconds, rabbitmq_raft_last_applied, rabbitmq_raft_last_written_index, rabbitmq_raft_num_segments, rabbitmq_raft_snapshot_index, rabbitmq_raft_term, which are coincidentally (?) the same as #9665
Reproduction steps
- Install RabbitMQ cluster
- Run a local
curl https://127.0.0.1:15691/metrics/per-object?family=queue_metrics\&family=ra_metrics --insecure 2>/dev/null|grep TYPE out-per|sort|uniq -c|sort|tail -10- Notice the 6 different metrics with a duplicate TYPE (& in turn, HELP)
Expected behavior
No TYPE/HELP should be duplicated, and metrics should not be interleaved
Additional context
There are metrics present in each "set", that are themselves not duplicated - so the issue is different from the above-mentioned:
curl https://127.0.0.1:15691/metrics/per-object?family=queue_metrics\&family=ra_metrics --insecure 2>/dev/null|grep -n rabbitmq_raft_commit_index
507:# TYPE rabbitmq_raft_commit_index counter
508:# HELP rabbitmq_raft_commit_index The current commit index.
509:rabbitmq_raft_commit_index{module="rabbit_khepri",ra_system="coordination"} 126.0
667:# TYPE rabbitmq_raft_commit_index counter
668:# HELP rabbitmq_raft_commit_index The current commit index.
669:rabbitmq_raft_commit_index{queue="quorum-1",vhost="/"} 5897.0The correct behaviour, from /detailed:
curl https://127.0.0.1:15691/metrics/detailed?family=queue_metrics\&family=ra_metrics --insecure 2>/dev/null|grep -n rabbitmq_detailed_raft_commit_index
56:# TYPE rabbitmq_detailed_raft_commit_index counter
57:# HELP rabbitmq_detailed_raft_commit_index The current commit index.
58:rabbitmq_detailed_raft_commit_index{module="rabbit_khepri",ra_system="coordination"} 126.0
59:rabbitmq_detailed_raft_commit_index{queue="quorum-1",vhost="/"} 6947.0As far as I can tell, return_per_object_metrics does not impact the issue whatsoever, the rest of the configuration is pretty innocuous - user & clustering & TLS config.
Note: Prometheus itself doesn't complain on ingestion, despite it being out of spec, but Telegraf, on the other hand, does, for exemple:
decoding response failed: text format parsing error in line 704: second TYPE line for metric name "rabbitmq_raft_commit_index", or TYPE reported after samples
Narrowing it down, it is only observed in RabbitMQ 4.2.x (I've only tested .4 - I might be able to test others later, but I would assume 4.2.0+), leaving 4.1 unaffected (tested w/ Khepri enabled) as those metrics don't exist, and the closest, rabbitmq_raft_log_commit_index is happily exporting.