notifier: stop queue filling due to single failed AM by krajorama · Pull Request #14099 · prometheus/prometheus

krajorama · 2024-05-14T15:15:00Z

WIP

Adds a unit test to emulate the problem of throughput dropping in #7676

Solution ideas (not implemented yet):

Put failed alertmanagers into a quarantine for some time. This would preserve the throughput much better. Possibly use exponential back-off to determine that next time we try to contact the alertmanager. Reset the timer if no alive alertmanagers are left.
Separate queues?

Fixes: #7676

Ref: prometheus#7676 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

nielsole · 2024-05-15T09:05:21Z

Nice work on the unit test.

Principally I'd prefer 2nd, separate queues. This would allow completely independent processing and overflowing of queues.
Unfortunately when I looked into this, it seems like quite the large lift as a lot of the functions like sendAll and the exported metrics do assume that a single queue is worked off in lockstep. We would need to have all metrics separately by alertmanager instance which is imo the right thing to do, but might be considered a breaking change.
On the bright side, having separate queues might accidentally also fix #13676

nielsole · 2024-05-15T09:59:37Z

an alternative 3rd solution with a shared queue I was considering was having a ring buffer, where there's a go routine for every alert manager that has a pointer into the ring buffer. Whenever the insert operation into the ring buffer catches up with an alert manager pointer, it would move that pointer forward, effectively dropping the oldest alert.
This may allow us to keep the existing metrics, thus keeping backward compatibility. But that doesn't feel like idiomatic golang.

machine424 · 2024-05-16T13:32:40Z

Thanks for this! (I always appreciate creative tests)
We do trim the queue before each sendAll iteration in nextBatch, if we couldn't send the alerts to any AM, we increase a "dropped alerts" metrics. But the alerts are already gone from the queue. I see the unit test you added doesn't take nextBatch into account.
Also, given the current implementation, I don't see why we set the timeout to 1y and expect it not to hang. (I don't think the timeout was set to 1y in #7676)

In an ideal word, An SD is to be used for those AM instead of static config, so the faulty ones are excluded and the notifier doesn't have to worry about that. I'm afraid a "quarantine logic" would look like re-implementing the SD logic...

bboreham · 2025-01-07T11:51:11Z

Hello from the bug-scrub! @krajorama do you think you will come back to this?

krajorama · 2025-01-07T14:42:11Z

Hello from the bug-scrub! @krajorama do you think you will come back to this?

Probably not :( Bigger project than what I have bandwidth for.

Independant Alertmanager queues avoid issues with queue overflowing when one or more Alertmanager instances are unavailable which could result in lost alert notifications. The queues are managed per AlertmanagerSet which are dynamically added/removed with service discovery or configurarion reload. This change also includes the test from prometheus#14099 Closes prometheus#7676 Signed-off-by: Siavash Safi <siavash@cloudflare.com>

Independent Alertmanager queues avoid issues with queue overflowing when one or more Alertmanager instances are unavailable which could result in lost alert notifications. The queues are managed per AlertmanagerSet which are dynamically added/removed with service discovery or configurarion reload. This change also includes the test from prometheus#14099 Closes prometheus#7676 Signed-off-by: Siavash Safi <siavash@cloudflare.com>

Independent Alertmanager queues avoid issues with queue overflowing when one or more Alertmanager instances are unavailable which could result in lost alert notifications. The queues are managed per AlertmanagerSet which are dynamically added/removed with service discovery or configuration reload. This change also includes the test from prometheus#14099 Closes prometheus#7676 Signed-off-by: Siavash Safi <siavash@cloudflare.com>

Independent Alertmanager queues avoid issues with queue overflowing when one or more Alertmanager instances are unavailable which could result in lost alert notifications. The buffered queues are managed per AlertmanagerSet which are dynamically added/removed with service discovery or configuration reload. This change also includes the test from prometheus#14099 Closes prometheus#7676 Signed-off-by: Siavash Safi <siavash@cloudflare.com>

Independent Alertmanager sendloops avoid issues with queue overflowing when one or more Alertmanager instances are unavailable which could result in lost alert notifications. The sendloops are managed per AlertmanagerSet which are dynamically added/removed with service discovery or configuration reload. The following metrics now include an extra dimention for alertmanager label: - prometheus_notifications_dropped_total - prometheus_notifications_queue_capacity - prometheus_notifications_queue_length This change also includes the test from prometheus#14099 Closes prometheus#7676 Signed-off-by: machine424 <ayoubmrini424@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com>

* notifier: unit test for dropping throughput on stuck AM Ref: prometheus#7676 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com> * chore(notifier): remove year from copyrights Signed-off-by: Siavash Safi <siavash@cloudflare.com> * feat(notifier): independent alertmanager sendloops Independent Alertmanager sendloops avoid issues with queue overflowing when one or more Alertmanager instances are unavailable which could result in lost alert notifications. The sendloops are managed per AlertmanagerSet which are dynamically added/removed with service discovery or configuration reload. The following metrics now include an extra dimention for alertmanager label: - prometheus_notifications_dropped_total - prometheus_notifications_queue_capacity - prometheus_notifications_queue_length This change also includes the test from prometheus#14099 Closes prometheus#7676 Signed-off-by: machine424 <ayoubmrini424@gmail.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com> --------- Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com> Signed-off-by: Siavash Safi <siavash@cloudflare.com> Signed-off-by: machine424 <ayoubmrini424@gmail.com> Co-authored-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

notifier: unit test for dropping throughput on stuck AM

d13d7b2

Ref: prometheus#7676 Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

krajorama mentioned this pull request May 14, 2024

Notification queue fills with single down AM instance #7676

Closed

github-actions bot added the stale label Sep 2, 2024

github-actions bot removed the stale label Jan 12, 2025

machine424 mentioned this pull request Feb 6, 2025

Support Alertmanager healthchecks #15985

Closed

siavashs mentioned this pull request Mar 31, 2025

feat(notifier): independent alertmanager sendloops #16355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notifier: stop queue filling due to single failed AM#14099

notifier: stop queue filling due to single failed AM#14099
krajorama wants to merge 1 commit intoprometheus:mainfrom
krajorama:fix-notifier-stuck-on-failed-am

krajorama commented May 14, 2024

Uh oh!

nielsole commented May 15, 2024 •

edited

Loading

Uh oh!

nielsole commented May 15, 2024

Uh oh!

machine424 commented May 16, 2024

Uh oh!

bboreham commented Jan 7, 2025

Uh oh!

krajorama commented Jan 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

krajorama commented May 14, 2024

Uh oh!

nielsole commented May 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nielsole commented May 15, 2024

Uh oh!

machine424 commented May 16, 2024

Uh oh!

bboreham commented Jan 7, 2025

Uh oh!

krajorama commented Jan 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nielsole commented May 15, 2024 •

edited

Loading