feat(notifier): independent alertmanager sendloops#16355
feat(notifier): independent alertmanager sendloops#16355SuperQ merged 3 commits intoprometheus:mainfrom
Conversation
299dfc8 to
f1ea687
Compare
f1ea687 to
2969971
Compare
krajorama
left a comment
There was a problem hiding this comment.
Hi, awesome start! I've done my first pass on the code and gave a couple of comments.
In general I'm not against the refactoring part, but I'm not sure about other reviewers, so I'm looking for some feedback on that. I'd also prefer to have the refactoring done in a separate PR up front, to get it out of the way and make this PR smaller. Let's see what other say on that first.
Hi, thanks for your feedback. The refactoring commit is kept separate, so we can easily do another PR for that one first. |
Hi, we discussed this with @grobinson-grafana and yes we'd like to have the refactoring in a separate PR for easy review and merge. Thanks in advance! |
Created #16398 |
e6e57c0 to
1605a50
Compare
|
hi @siavashs , will you have time to rebase this PR and continue? thanks |
1605a50 to
410c4f3
Compare
70395be to
5438ece
Compare
5438ece to
62c7f17
Compare
|
@krajorama this is ready for another review. |
krajorama
left a comment
There was a problem hiding this comment.
I have not gone into it super deep again. I was ok with the general way of working already.
I've made a few comments on the tests, nothing blocking you really.
7fb7183 to
c00afa2
Compare
Thanks for the suggestions @krajorama |
694f668 to
9189c9a
Compare
|
Not sure why random unrelated tests are failing on Windows! |
machine424
left a comment
There was a problem hiding this comment.
Thanks again, Siavash, for working on this, and thanks to the other reviewers as well.
This is not an easy change, but we all agree we're getting close to the finish line.
I’ve left some comments.
Because we're touching a battle tested logic, I’ll take another look with fresh eyes tomorrow.
Also I'll try to offload any changes that are not required for the refactor.
Thanks for the review Ayoub, I think I addressed all the new comments. |
|
|
|
Found and fixed a race condition on drain when updating metrics. |
Ref: prometheus#7676 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
|
Github is having a bad day, random stuff are failing. |
|
All checks are green again 🎉 |
|
Thanks for the last review @machine424 |
machine424
left a comment
There was a problem hiding this comment.
once #16355 (comment) is addressed.
let's give this a try, thanks again for working on it!
|
Thank you @machine424 |
|
Hmm, I made a small change and now random unrelated CI checks are failing 🤔 |
Independent Alertmanager sendloops avoid issues with queue overflowing when one or more Alertmanager instances are unavailable which could result in lost alert notifications. The sendloops are managed per AlertmanagerSet which are dynamically added/removed with service discovery or configuration reload. The following metrics now include an extra dimention for alertmanager label: - prometheus_notifications_dropped_total - prometheus_notifications_queue_capacity - prometheus_notifications_queue_length This change also includes the test from prometheus#14099 Closes prometheus#7676 Signed-off-by: machine424 <ayoubmrini424@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>
|
All green again, let's merge this thing 😄 |
|
I was on PTO and just returned today, thanks for taking care of this. |
|
seems maybe it only flakes on the CI, we can add some debug logs and make it run there |
I could not reproduce this locally. Can we bump the CI machine to a bigger one? |
|
let's try to fix the test/or the code for the current CI first. (I can look into this if you want when I have some time) |
Independent Alertmanager sendloops avoid issues with queue overflowing when one or more Alertmanager instances are unavailable which could result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically added/removed with service discovery or configuration reload.
The following metrics now include an extra dimention for alertmanager label:
prometheus_notifications_dropped_totalprometheus_notifications_queue_capacityprometheus_notifications_queue_lengthWhich issue(s) does the PR fix:
Closes #7676
This PR also includes the test from #14099
Does this PR introduce a user-facing change?
Signed-off-by: Siavash Safi siavash@cloudflare.com
Signed-off-by: machine424 ayoubmrini424@gmail.com