Per-object Prometheus metrics: avoid duplicate HELP, TYPE metadata lines by michaelklishin · Pull Request #15610 · rabbitmq/rabbitmq-server

michaelklishin · 2026-03-02T18:28:16Z

I am not sure if this is the optimal approach but it is best place/manner of addressing this that I could find without affecting aggregated metrics.

Raft metrics can and do come from different Ra systems, namely Khepri and quorum queues. We need to format them as a "single" metric to avoid duplicate HELP, TYPE metadata lines.

Since quorum queues have dozens of metrics, we filter out a set of Raft-related ones specifically that combine well with the Raft metrics from Khepri.

Closes #15600.

for Raft metrics. Raft metrics can and do come from different Ra systems, namely Khepri and quorum queues. We need to format them as a "single" metric to avoid duplicate HELP, TYPE metadata lines. Since quorum queues have dozens of metrics, we filter out a set of Raft-related ones specifically that combine well with the Raft metrics from Khepri. Closes #15600.

deadtrickster · 2026-03-04T09:43:27Z

there is also https://github.com/rabbitmq/khepri/blob/main/src/khepri_cluster.erl#L326 khepri ra system for khepri? its metrics are not merged IIUC?

does_belong_to_quorum_queue(#{queue := _}) -> true;
does_belong_to_quorum_queue(_) -> false.

will this work with other queue types?

Previously a missing metric was ignored.

mkuratczyk

made the test more strict and added the same validation to the aggregated endpoint to avoid similar problems, but the main fix looks good to me.

mkuratczyk · 2026-03-04T11:22:31Z

@elo-magnier-7s can you confirm this solves #15600 for you?

mkuratczyk · 2026-03-04T11:28:14Z

khepri ra system for khepri?

Khepri metrics are present. As part of RabbitMQ, khepri runs in the coordination Ra system.

will this work with other queue types?

Ra-based queue types other than QQs are not accounted for here. I assume they will either have their own collector or we'll make sure this collector handles them correctly as part of work on those queue types.

We've played with the idea of splitting Ra metrics per Ra system on the Ra level (currently they are all under that ra group. I think to keep some sanity here, we may need to do that. Otherwise the ra group is a mix of Khepri metrics, QQ metrics and potentially other queue type metrics. And based on the endpoint we may want to return all or some, aggregated or not and so on. It's becoming hard to make sure that we correctly return everything we want, as the issue fixed by this PR shows.

deadtrickster · 2026-03-04T11:37:47Z

Khepri metrics are present. As part of RabbitMQ, khepri runs in the coordination Ra system.

streams also? I posted a link to the khepri, from that link it looks like it has its own system. What am I missing?

mkuratczyk · 2026-03-04T11:43:58Z

streams also?

Good point - I don't think any osiris counters are currently returned. Something to look into for sure, but not directly related to this issue.

I posted a link to the khepri, from that link it looks like it has its own system. What am I missing?

You linked to Khepri directly, which is developed as a standalone project. When embedded in RabbitMQ, the correct place to look is https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_khepri.erl#L279

deadtrickster · 2026-03-04T12:00:32Z

I mean is it ok and visible and understood both streams and khepri use coordination system? https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/rabbit_stream_coordinator.hrl#L7

Even if only from metrics PoV.

mkuratczyk · 2026-03-04T12:35:44Z

streams don't use the Ra system as such, only the stream coordinator does. So metrics of the coordination system cover Khepri and Stream Coordinator's activity.

deadtrickster · 2026-03-04T13:37:42Z

We've played with the idea of splitting Ra metrics per Ra system on the Ra level (currently they are all under that ra group. I think to keep some sanity here, we may need to do that.

like having a label system="system_name"?

mkuratczyk · 2026-03-04T13:54:57Z

We've played with the idea of splitting Ra metrics per Ra system on the Ra level (currently they are all under that ra group. I think to keep some sanity here, we may need to do that.

like having a label system="system_name"?

ra_system label is already present. What I meant is that currently anything that uses Ra will have its counters in the ra seshat group: https://github.com/rabbitmq/ra/blob/main/src/ra_counters.erl. Seshats maintains counters for each group in a dedicated ETS table. But we currently throw all counters into a single group and therefore into a single table, which could be large, because there can be many queues and each has a lot of metrics. And to return some per-cluster metrics (as opposed to per-object) we still need to scan through the large ETS table with lots of per-object metrcs.

Anyway, there's room for improvements in terms of clarify and hopefully performance as well.

elo-magnier-7s · 2026-03-04T15:18:13Z

Hi everyone,
Just done with the testing: it works.

Before (make run-broker from `main`)

[root@ip-172-31-15-12 rabbitmq1]# curl http://127.0.0.1:15692/metrics/per-object?family=queue_metrics\&family=ra_metrics 2>/dev/null|grep -n rabbitmq_raft_commit_index
496:# TYPE rabbitmq_raft_commit_index counter
497:# HELP rabbitmq_raft_commit_index The current commit index.
498:rabbitmq_raft_commit_index{module="rabbit_khepri",ra_system="coordination"} 46.0
499:rabbitmq_raft_commit_index{module="rabbit_stream_coordinator",ra_system="coordination"} 7.0
703:# TYPE rabbitmq_raft_commit_index counter
704:# HELP rabbitmq_raft_commit_index The current commit index.
705:rabbitmq_raft_commit_index{queue="qq",vhost="/"} 5.0

[root@ip-172-31-15-12 rabbitmq1]# telegraf —config telegraf.conf
[…]
2026-03-04T13:44:14Z D! [agent] Starting service inputs
2026-03-04T13:44:20Z E! [inputs.prometheus] Error in plugin: error reading metrics for "http://127.0.0.1:15692/metrics/per-object?family=queue_metrics&family=ra_metrics": decoding response failed: text format parsing error in line 696: second TYPE line for metric name "rabbitmq_raft_commit_index", or TYPE reported after samples

After (make run-broker from `rabbitmq-server-15600`)

[root@ip-172-31-15-12 rabbitmq1]# curl http://127.0.0.1:15692/metrics/per-object?family=queue_metrics\&family=ra_metrics 2>/dev/null|grep -n rabbitmq_raft_commit_index
495:# TYPE rabbitmq_raft_commit_index counter
496:# HELP rabbitmq_raft_commit_index The current commit index.
497:rabbitmq_raft_commit_index{module="rabbit_khepri",ra_system="coordination"} 80.0
498:rabbitmq_raft_commit_index{module="rabbit_stream_coordinator",ra_system="coordination"} 13.0
499:rabbitmq_raft_commit_index{queue="qq",vhost="/"} 5.0

[root@ip-172-31-15-12 rabbitmq1]# telegraf —config telegraf.conf
[…]
2026-03-04T13:46:20Z D! [outputs.file] Wrote batch of 737 metrics in 17.799713ms

Probably unrelated

I encountered a weird issue when going from the main to the rabbitmq-server-15600 branch. Weirdly enough, it doesn't happen going the other way around, from rabbitmq-server-15600 back to main.
I see this at startup, then at use

2026-03-04 14:24:10.808198+00:00 [debug] <0.515.0> Repaired quorum queue 'qq1' in vhost '/' amqqueue record

2026-03-04 15:10:48.334188+00:00 [error] <0.2255.0>  {{bad_generator,#{'rabbit@ip-172-31-15-12' => <<"2F_QQ12BHWI68BI3YG">>}},
2026-03-04 15:10:48.334188+00:00 [error] <0.2255.0>  [{rabbit_quorum_queue,'-init/1-lc$^0/1-0-',2,
2026-03-04 15:10:48.334188+00:00 [error] <0.2255.0>                        [{file,"rabbit_quorum_queue.erl"},{line,228}]},
2026-03-04 15:10:48.334188+00:00 [error] <0.2255.0>   {rabbit_quorum_queue,init,1,[{file,"rabbit_quorum_queue.erl"},{line,228}]},
2026-03-04 15:10:48.334188+00:00 [error] <0.2255.0>   {rabbit_queue_type,get_ctx_with,3,
2026-03-04 15:10:48.334188+00:00 [error] <0.2255.0>                      [{file,"rabbit_queue_type.erl"},{line,775}]},
2026-03-04 15:10:48.334188+00:00 [error] <0.2255.0>   {rabbit_queue_type,consume,3,[{file,"rabbit_queue_type.erl"},{line,507}]},
2026-03-04 15:10:48.334188+00:00 [error] <0.2255.0>   {rabbit_channel,'-basic_consume/8-fun-0-',10,
2026-03-04 15:10:48.334188+00:00 [error] <0.2255.0>                   [{file,"rabbit_channel.erl"},{line,1718}]},
2026-03-04 15:10:48.334188+00:00 [error] <0.2255.0>   {rabbit_misc,with_exit_handler,2,[{file,"rabbit_misc.erl"},{line,469}]},
2026-03-04 15:10:48.334188+00:00 [error] <0.2255.0>   {rabbit_channel,basic_consume,8,[{file,"rabbit_channel.erl"},{line,1715}]},
2026-03-04 15:10:48.334188+00:00 [error] <0.2255.0>   {rabbit_channel,handle_cast,2,[{file,"rabbit_channel.erl"},{line,605}]}]}

Then, my declared quorum queue crashes out completely at use (while the stream is OK), makes the management return 500 for the queues page until it is deleted, & delete_queue doesn't work on it either, I had to force delete it through an eval. After that, everything is happy!
But I'm going to assume that's entirely unrelated and expected because I've gone back and forth with the versions a few times, I've seen ra UId changes and PID not existing messages for that queue in the logs.

Conclusion

So unless someone thinks that last point could be related to changes in this branch, we're all good I believe, both per-object & detailed endpoints ingest correctly into Telegraf now.

Thanks everyone! :)

michaelklishin · 2026-03-04T17:57:28Z

@elo-magnier-7s main has a new feature flag that depends on rabbitmq_4.3.0 (track_qq_members_uids) while this branch does not.

A node started from main will store QQ members (replicas) as a map, e.g. #{nodes => #{'rabbit@ip-172-31-15-12' => <<"2F_QQ12BHWI68BI3YG">>}}.

So branch switching can run into "downgrading-like" scenarios.
For this and other reasons, during development you need to wipe out the data directory when switching branches, otherwise nodes sometimes can fail to start.

All regular contributors to RabbitMQ eventually develop this habit of wiping rabbitmq-test-instances under $TEMP, reaping Erlang processes on the host every so often (this is less relevant with the peer module in main these days).

Per-object Prometheus metrics: avoid duplicate HELP, TYPE metadata lines (backport #15610)

elo-magnier-7s · 2026-03-04T19:30:01Z

Thanks for the patch & the explanation @michaelklishin, I appreciate it!

I figured playing it fast and loose was what was biting me in the butt, but not how or why, as I didn't expect any difference in code, I thought it was a compilation issue and you'd be able to tell in a second.

Cheers!

michaelklishin added this to the 4.3.0 milestone Mar 2, 2026

michaelklishin added the backport-v4.2.x label Mar 2, 2026

michaelklishin requested review from ikavgo and mkuratczyk March 3, 2026 18:53

mkuratczyk added 2 commits March 4, 2026 10:46

tests: fail if a metric is missing

e34e5f0

Previously a missing metric was ignored.

Test for duplicates in the aggregated endpoint

f26cde6

mkuratczyk approved these changes Mar 4, 2026

View reviewed changes

michaelklishin merged commit d4297be into main Mar 4, 2026
182 checks passed

michaelklishin deleted the rabbitmq-server-15600 branch March 4, 2026 17:42

mergify bot mentioned this pull request Mar 4, 2026

Per-object Prometheus metrics: avoid duplicate HELP, TYPE metadata lines (backport #15610) #15632

Merged

rabbitmq locked as resolved and limited conversation to collaborators Mar 4, 2026

rabbitmq unlocked this conversation Mar 4, 2026

michaelklishin added a commit that referenced this pull request Mar 4, 2026

Merge pull request #15632 from rabbitmq/mergify/bp/v4.2.x/pr-15610

1ae3696

Per-object Prometheus metrics: avoid duplicate HELP, TYPE metadata lines (backport #15610)

Conversation

michaelklishin commented Mar 2, 2026

Uh oh!

deadtrickster commented Mar 4, 2026

Uh oh!

mkuratczyk left a comment

Choose a reason for hiding this comment

Uh oh!

mkuratczyk commented Mar 4, 2026

Uh oh!

mkuratczyk commented Mar 4, 2026

Uh oh!

deadtrickster commented Mar 4, 2026

Uh oh!

mkuratczyk commented Mar 4, 2026

Uh oh!

deadtrickster commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkuratczyk commented Mar 4, 2026

Uh oh!

deadtrickster commented Mar 4, 2026

Uh oh!

mkuratczyk commented Mar 4, 2026

Uh oh!

elo-magnier-7s commented Mar 4, 2026

Before (make run-broker from main)

After (make run-broker from rabbitmq-server-15600)

Probably unrelated

Conclusion

Uh oh!

Uh oh!

michaelklishin commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elo-magnier-7s commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deadtrickster commented Mar 4, 2026 •

edited

Loading

Before (make run-broker from `main`)

After (make run-broker from `rabbitmq-server-15600`)

michaelklishin commented Mar 4, 2026 •

edited

Loading