Skip to content

feat(notifier): independent alertmanager sendloops#16355

Merged
SuperQ merged 3 commits intoprometheus:mainfrom
siavashs:notifier-queue-per-am
Jan 20, 2026
Merged

feat(notifier): independent alertmanager sendloops#16355
SuperQ merged 3 commits intoprometheus:mainfrom
siavashs:notifier-queue-per-am

Conversation

@siavashs
Copy link
Contributor

@siavashs siavashs commented Mar 31, 2025

Independent Alertmanager sendloops avoid issues with queue overflowing when one or more Alertmanager instances are unavailable which could result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:

  • prometheus_notifications_dropped_total
  • prometheus_notifications_queue_capacity
  • prometheus_notifications_queue_length

Which issue(s) does the PR fix:

Closes #7676

This PR also includes the test from #14099

Does this PR introduce a user-facing change?

[ENHANCEMENT] independent alertmanager sendloops
[CHANGE] add `alertmanager` dimention to following metrics: `prometheus_notifications_dropped_total`, `prometheus_notifications_queue_capacity`, `prometheus_notifications_queue_length`

Signed-off-by: Siavash Safi siavash@cloudflare.com
Signed-off-by: machine424 ayoubmrini424@gmail.com

@siavashs siavashs force-pushed the notifier-queue-per-am branch 3 times, most recently from 299dfc8 to f1ea687 Compare March 31, 2025 07:39
Copy link
Contributor

@OGKevin OGKevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@siavashs siavashs force-pushed the notifier-queue-per-am branch from f1ea687 to 2969971 Compare April 1, 2025 00:30
@krajorama krajorama self-requested a review April 1, 2025 07:12
Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, awesome start! I've done my first pass on the code and gave a couple of comments.

In general I'm not against the refactoring part, but I'm not sure about other reviewers, so I'm looking for some feedback on that. I'd also prefer to have the refactoring done in a separate PR up front, to get it out of the way and make this PR smaller. Let's see what other say on that first.

@siavashs
Copy link
Contributor Author

siavashs commented Apr 1, 2025

Hi, awesome start! I've done my first pass on the code and gave a couple of comments.

In general I'm not against the refactoring part, but I'm not sure about other reviewers, so I'm looking for some feedback on that. I'd also prefer to have the refactoring done in a separate PR up front, to get it out of the way and make this PR smaller. Let's see what other say on that first.

Hi, thanks for your feedback.

The refactoring commit is kept separate, so we can easily do another PR for that one first.
I can submit that later today.

@krajorama
Copy link
Member

Hi, awesome start! I've done my first pass on the code and gave a couple of comments.
In general I'm not against the refactoring part, but I'm not sure about other reviewers, so I'm looking for some feedback on that. I'd also prefer to have the refactoring done in a separate PR up front, to get it out of the way and make this PR smaller. Let's see what other say on that first.

Hi, thanks for your feedback.

The refactoring commit is kept separate, so we can easily do another PR for that one first. I can submit that later today.

Hi, we discussed this with @grobinson-grafana and yes we'd like to have the refactoring in a separate PR for easy review and merge. Thanks in advance!

@siavashs
Copy link
Contributor Author

siavashs commented Apr 3, 2025

Hi, awesome start! I've done my first pass on the code and gave a couple of comments.
In general I'm not against the refactoring part, but I'm not sure about other reviewers, so I'm looking for some feedback on that. I'd also prefer to have the refactoring done in a separate PR up front, to get it out of the way and make this PR smaller. Let's see what other say on that first.

Hi, thanks for your feedback.
The refactoring commit is kept separate, so we can easily do another PR for that one first. I can submit that later today.

Hi, we discussed this with @grobinson-grafana and yes we'd like to have the refactoring in a separate PR for easy review and merge. Thanks in advance!

Created #16398

@siavashs siavashs force-pushed the notifier-queue-per-am branch 6 times, most recently from e6e57c0 to 1605a50 Compare April 8, 2025 19:42
@siavashs siavashs requested a review from krajorama April 8, 2025 19:42
@krajorama
Copy link
Member

hi @siavashs , will you have time to rebase this PR and continue? thanks

@siavashs siavashs force-pushed the notifier-queue-per-am branch from 1605a50 to 410c4f3 Compare April 22, 2025 07:55
@siavashs siavashs marked this pull request as ready for review April 28, 2025 08:31
@siavashs siavashs force-pushed the notifier-queue-per-am branch 3 times, most recently from 70395be to 5438ece Compare May 1, 2025 15:21
@siavashs siavashs requested a review from krajorama May 1, 2025 15:37
@siavashs siavashs force-pushed the notifier-queue-per-am branch from 5438ece to 62c7f17 Compare May 2, 2025 09:59
@siavashs
Copy link
Contributor Author

siavashs commented May 2, 2025

@krajorama this is ready for another review.

@krajorama krajorama self-assigned this Nov 20, 2025
Copy link
Member

@krajorama krajorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not gone into it super deep again. I was ok with the general way of working already.

I've made a few comments on the tests, nothing blocking you really.

@siavashs siavashs force-pushed the notifier-queue-per-am branch from 7fb7183 to c00afa2 Compare November 22, 2025 12:45
@siavashs
Copy link
Contributor Author

I have not gone into it super deep again. I was ok with the general way of working already.

I've made a few comments on the tests, nothing blocking you really.

Thanks for the suggestions @krajorama
I applied them all.

@siavashs siavashs force-pushed the notifier-queue-per-am branch 2 times, most recently from 694f668 to 9189c9a Compare November 22, 2025 13:08
@siavashs
Copy link
Contributor Author

Not sure why random unrelated tests are failing on Windows!

Copy link
Member

@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again, Siavash, for working on this, and thanks to the other reviewers as well.

This is not an easy change, but we all agree we're getting close to the finish line.

I’ve left some comments.

Because we're touching a battle tested logic, I’ll take another look with fresh eyes tomorrow.
Also I'll try to offload any changes that are not required for the refactor.

@siavashs
Copy link
Contributor Author

I’ve left some comments.

Because we're touching a battle tested logic, I’ll take another look with fresh eyes tomorrow. Also I'll try to offload any changes that are not required for the refactor.

Thanks for the review Ayoub, I think I addressed all the new comments.

@siavashs
Copy link
Contributor Author

siavashs commented Nov 25, 2025

CI tests are failing but I cannot reproduce the failure locally.
Managed to reproduce the error using -count=50.

@siavashs
Copy link
Contributor Author

Found and fixed a race condition on drain when updating metrics.

Copy link
Member

@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some final ones.

krajorama and others added 2 commits November 28, 2025 12:16
Ref: prometheus#7676

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs
Copy link
Contributor Author

Github is having a bad day, random stuff are failing.

@siavashs
Copy link
Contributor Author

All checks are green again 🎉

@siavashs
Copy link
Contributor Author

Thanks for the last review @machine424
Let me know how to proceed with the last few items remaining.

Copy link
Member

@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once #16355 (comment) is addressed.

let's give this a try, thanks again for working on it!

@siavashs
Copy link
Contributor Author

Thank you @machine424
I resolved that comment.

@siavashs
Copy link
Contributor Author

Hmm, I made a small change and now random unrelated CI checks are failing 🤔

Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs
Copy link
Contributor Author

All green again, let's merge this thing 😄

@machine424
Copy link
Member

I was on PTO and just returned today, thanks for taking care of this.

@machine424
Copy link
Member

seems TestHandlerSendAll is flaky onboth linux https://github.com/prometheus/prometheus/actions/runs/21166419284/job/60872149084 and windows https://github.com/prometheus/prometheus/actions/runs/21166419284/job/60872149078

maybe it only flakes on the CI, we can add some debug logs and make it run there

@siavashs
Copy link
Contributor Author

seems TestHandlerSendAll is flaky onboth linux https://github.com/prometheus/prometheus/actions/runs/21166419284/job/60872149084 and windows https://github.com/prometheus/prometheus/actions/runs/21166419284/job/60872149078

maybe it only flakes on the CI, we can add some debug logs and make it run there

I could not reproduce this locally. Can we bump the CI machine to a bigger one?

@machine424
Copy link
Member

let's try to fix the test/or the code for the current CI first.
You can open a PR with some fmt.Println() around the check to see what's going on and try to reproduce the failure.

(I can look into this if you want when I have some time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Notification queue fills with single down AM instance

7 participants