Skip to content

Interleaved metrics & duplicate HELP/TYPE for ra metrics in Prometheus #15600

@elo-magnier-7s

Description

@elo-magnier-7s

Describe the bug

Hi RabbitMQ Team,

In RabbitMQ 4.2.4, I have Khepri emitting metrics like this one:
rabbitmq_detailed_raft_commit_index{module="rabbit_khepri",ra_system="coordination"} 100.0
rabbitmq_raft_commit_latency_seconds{module="rabbit_khepri",ra_system="coordination"} 0.002

The first one is from a call to /detailed, the second one from a call to /per-object.

There is a crucial difference between them (that I can't find myself in the code), in that /detailed returns them in a single block, while /per-object interleaves them.

The Prom (& in turn OpenMetrics) specs, say:
From Prom:

Only one TYPE line may exist for a given metric name.
Only one HELP line may exist for any given metric name.

From OpenMetrics:

There MUST NOT be more than one of each type of metadata line for a MetricFamily.
Metrics MUST NOT be interleaved.

Both might be able to be fixed in a single swoop, as I think the duplicate happens because the metrics are interleaved in the /per-object endpoint.

The metrics at issue are: rabbitmq_raft_commit_index, rabbitmq_raft_commit_latency_seconds, rabbitmq_raft_last_applied, rabbitmq_raft_last_written_index, rabbitmq_raft_num_segments, rabbitmq_raft_snapshot_index, rabbitmq_raft_term, which are coincidentally (?) the same as #9665

Reproduction steps

  1. Install RabbitMQ cluster
  2. Run a local
curl https://127.0.0.1:15691/metrics/per-object?family=queue_metrics\&family=ra_metrics --insecure 2>/dev/null|grep TYPE out-per|sort|uniq -c|sort|tail -10
  1. Notice the 6 different metrics with a duplicate TYPE (& in turn, HELP)

Expected behavior

No TYPE/HELP should be duplicated, and metrics should not be interleaved

Additional context

There are metrics present in each "set", that are themselves not duplicated - so the issue is different from the above-mentioned:

curl https://127.0.0.1:15691/metrics/per-object?family=queue_metrics\&family=ra_metrics --insecure 2>/dev/null|grep -n rabbitmq_raft_commit_index
507:# TYPE rabbitmq_raft_commit_index counter
508:# HELP rabbitmq_raft_commit_index The current commit index.
509:rabbitmq_raft_commit_index{module="rabbit_khepri",ra_system="coordination"} 126.0
667:# TYPE rabbitmq_raft_commit_index counter
668:# HELP rabbitmq_raft_commit_index The current commit index.
669:rabbitmq_raft_commit_index{queue="quorum-1",vhost="/"} 5897.0

The correct behaviour, from /detailed:

curl https://127.0.0.1:15691/metrics/detailed?family=queue_metrics\&family=ra_metrics --insecure 2>/dev/null|grep -n rabbitmq_detailed_raft_commit_index
56:# TYPE rabbitmq_detailed_raft_commit_index counter
57:# HELP rabbitmq_detailed_raft_commit_index The current commit index.
58:rabbitmq_detailed_raft_commit_index{module="rabbit_khepri",ra_system="coordination"} 126.0
59:rabbitmq_detailed_raft_commit_index{queue="quorum-1",vhost="/"} 6947.0

As far as I can tell, return_per_object_metrics does not impact the issue whatsoever, the rest of the configuration is pretty innocuous - user & clustering & TLS config.

Note: Prometheus itself doesn't complain on ingestion, despite it being out of spec, but Telegraf, on the other hand, does, for exemple:

decoding response failed: text format parsing error in line 704: second TYPE line for metric name "rabbitmq_raft_commit_index", or TYPE reported after samples

Narrowing it down, it is only observed in RabbitMQ 4.2.x (I've only tested .4 - I might be able to test others later, but I would assume 4.2.0+), leaving 4.1 unaffected (tested w/ Khepri enabled) as those metrics don't exist, and the closest, rabbitmq_raft_log_commit_index is happily exporting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions