Skip to content

notifier: stop queue filling due to single failed AM#14099

Closed
krajorama wants to merge 1 commit intoprometheus:mainfrom
krajorama:fix-notifier-stuck-on-failed-am
Closed

notifier: stop queue filling due to single failed AM#14099
krajorama wants to merge 1 commit intoprometheus:mainfrom
krajorama:fix-notifier-stuck-on-failed-am

Conversation

@krajorama
Copy link
Member

WIP

Adds a unit test to emulate the problem of throughput dropping in #7676

Solution ideas (not implemented yet):

  1. Put failed alertmanagers into a quarantine for some time. This would preserve the throughput much better. Possibly use exponential back-off to determine that next time we try to contact the alertmanager. Reset the timer if no alive alertmanagers are left.
  2. Separate queues?

Fixes: #7676

Ref: prometheus#7676

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@nielsole
Copy link

nielsole commented May 15, 2024

Nice work on the unit test.

Principally I'd prefer 2nd, separate queues. This would allow completely independent processing and overflowing of queues.
Unfortunately when I looked into this, it seems like quite the large lift as a lot of the functions like sendAll and the exported metrics do assume that a single queue is worked off in lockstep. We would need to have all metrics separately by alertmanager instance which is imo the right thing to do, but might be considered a breaking change.
On the bright side, having separate queues might accidentally also fix #13676

@nielsole
Copy link

an alternative 3rd solution with a shared queue I was considering was having a ring buffer, where there's a go routine for every alert manager that has a pointer into the ring buffer. Whenever the insert operation into the ring buffer catches up with an alert manager pointer, it would move that pointer forward, effectively dropping the oldest alert.
This may allow us to keep the existing metrics, thus keeping backward compatibility. But that doesn't feel like idiomatic golang.

@machine424
Copy link
Member

Thanks for this! (I always appreciate creative tests)
We do trim the queue before each sendAll iteration in nextBatch, if we couldn't send the alerts to any AM, we increase a "dropped alerts" metrics. But the alerts are already gone from the queue. I see the unit test you added doesn't take nextBatch into account.
Also, given the current implementation, I don't see why we set the timeout to 1y and expect it not to hang. (I don't think the timeout was set to 1y in #7676)

In an ideal word, An SD is to be used for those AM instead of static config, so the faulty ones are excluded and the notifier doesn't have to worry about that. I'm afraid a "quarantine logic" would look like re-implementing the SD logic...

@github-actions github-actions bot added the stale label Sep 2, 2024
@bboreham
Copy link
Member

bboreham commented Jan 7, 2025

Hello from the bug-scrub! @krajorama do you think you will come back to this?

@krajorama
Copy link
Member Author

Hello from the bug-scrub! @krajorama do you think you will come back to this?

Probably not :( Bigger project than what I have bandwidth for.

@github-actions github-actions bot removed the stale label Jan 12, 2025
siavashs added a commit to siavashs/prometheus that referenced this pull request Mar 31, 2025
Independant Alertmanager queues avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configurarion reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Mar 31, 2025
Independant Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configurarion reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Mar 31, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configurarion reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Mar 31, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Mar 31, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Mar 31, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Mar 31, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Apr 1, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Apr 7, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Apr 7, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Apr 8, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The buffered queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Apr 8, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The buffered queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Apr 8, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The buffered queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Apr 8, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The buffered queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Apr 22, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The buffered queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request May 1, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The buffered queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request May 1, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The buffered queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request May 1, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The buffered queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request May 2, 2025
Independent Alertmanager queues avoid issues with queue overflowing when
one or more Alertmanager instances are unavailable which could result in
lost alert notifications.
The buffered queues are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Aug 29, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 11, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 11, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 11, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 11, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 11, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 11, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 12, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 12, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 22, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 22, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 22, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 22, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 22, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 22, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 22, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 25, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 25, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 25, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 28, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 28, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 28, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 28, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 28, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Nov 28, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Dec 2, 2025
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Jan 12, 2026
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Jan 12, 2026
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
siavashs added a commit to siavashs/prometheus that referenced this pull request Jan 13, 2026
Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Naman-B-Parlecha pushed a commit to Naman-B-Parlecha/prometheus that referenced this pull request Jan 20, 2026
* notifier: unit test for dropping throughput on stuck AM

Ref: prometheus#7676

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>

* chore(notifier): remove year from copyrights

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

* feat(notifier): independent alertmanager sendloops

Independent Alertmanager sendloops avoid issues with queue overflowing
when one or more Alertmanager instances are unavailable which could
result in lost alert notifications.
The sendloops are managed per AlertmanagerSet which are dynamically
added/removed with service discovery or configuration reload.

The following metrics now include an extra dimention for alertmanager label:
- prometheus_notifications_dropped_total
- prometheus_notifications_queue_capacity
- prometheus_notifications_queue_length

This change also includes the test from prometheus#14099

Closes prometheus#7676

Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>

---------

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: machine424 <ayoubmrini424@gmail.com>
Co-authored-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Notification queue fills with single down AM instance

4 participants