Skip to content

feat: add histogram metric for notification_latency_seconds#16637

Merged
beorn7 merged 3 commits intoprometheus:mainfrom
OGKevin:khellemun/push/pkokwmrnlowq
Nov 11, 2025
Merged

feat: add histogram metric for notification_latency_seconds#16637
beorn7 merged 3 commits intoprometheus:mainfrom
OGKevin:khellemun/push/pkokwmrnlowq

Conversation

@OGKevin
Copy link
Contributor

@OGKevin OGKevin commented May 27, 2025

This metric can be used to create alerting based on how many notifications finish or do not finish within a certain amount of time.

Change-Id: afbf3d8ceb3994c7d6220389353cff92


The reason this metric is needed is to build SLOs around notification delivery for our Prometheus setup. With the current set of metrics, it’s not possible to answer the question: how many attempts completed within X seconds?

This metric would allow us to monitor, for example, whether 99% of attempts complete within 10 seconds.

As far as I know, a histogram is required to support this use case, utilizing the _count and _bucket metrics to answer such questions.

I didn’t observe any negative performance impact when running Prometheus with this patch applied.

[ENHANCEMENT] metrics: Add a histogram `prometheus_notifications_latency_histogram_seconds` to complement the existing summary `prometheus_notifications_latency_seconds`.

@krajorama
Copy link
Member

Hello from the bug scrub!

@beorn7 will take a look and comment.
We agree that this would be a good improvement, with additional comments incoming.

@beorn7
Copy link
Member

beorn7 commented Oct 14, 2025

Here my thoughts:

Now that we have native histograms available, we should probably add native buckets to most of the existing histograms, and ultimately replace most of the summaries by native histograms.

notification_latency_seconds is certainly a good example for the latter. One might ask for a general strategy how to add native histograms everywhere where it makes sense, but we can as well start here.

Following considerations:

  • Initially, we cannot really replace the summaries with the histograms because many users would probably like to continue to use the summaries for a while (at least for a smooth transition). So we should probably have histograms and summaries in parallel for a while, and maybe remove the summaries in v4.
  • However, we cannot have a summary and a histogram in parallel with the same name. So a newly introduced histogram needs a new name. We should have a good idea for a naming schema for that and apply it consistently.
  • We want to encourage native histogram usage. While we might still add some classic buckets for convenience, let's not use too many of them so that the size of the exposition doesn't blow up.

I'll comment on the PR with these considerations in mind.

@beorn7
Copy link
Member

beorn7 commented Oct 30, 2025

@OGKevin are you planning to come back to this? Very little seems to be needed to merge this.

@OGKevin
Copy link
Contributor Author

OGKevin commented Nov 3, 2025

It's on my list for this week 👍🏾

This metric can be used to create alerting based on how many
notifications finish or do not finish within a certain amount of time.

Change-Id: afbf3d8ceb3994c7d6220389353cff92
Signed-Off-By: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
Copilot AI review requested due to automatic review settings November 5, 2025 08:15
@OGKevin OGKevin force-pushed the khellemun/push/pkokwmrnlowq branch from 5b9b4e6 to 502c9cd Compare November 5, 2025 08:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds native histogram support for alert notification latency metrics in Prometheus. Previously, only a summary metric was used to track latency; now both a summary and a histogram are collected to provide more flexible query options.

  • Introduced a new latencyHistogram metric alongside the existing summary
  • Updated all metric observation and cleanup code to handle both metrics
  • Configured the histogram with native histogram parameters for efficient storage

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
notifier/metric.go Added latencyHistogram field to alertMetrics struct and configured native histogram metric with bucket and reset parameters
notifier/manager.go Updated sendAll to observe duration in both latency metrics (summary and histogram)
notifier/alertmanagerset.go Updated cleanup logic to delete label values from both latency metrics when alertmanagers are removed

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@beorn7
Copy link
Member

beorn7 commented Nov 9, 2025

You haven't provided write access for the reviewer to your branch. Please apply my suggestions yourself so that we can merge this.

Co-authored-by: Björn Rabenstein <github@rabenste.in>
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
@beorn7
Copy link
Member

beorn7 commented Nov 9, 2025

File needs formatting. 😁

Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
@OGKevin OGKevin force-pushed the khellemun/push/pkokwmrnlowq branch from 67de63f to 7b04cd7 Compare November 10, 2025 08:31
@OGKevin
Copy link
Contributor Author

OGKevin commented Nov 10, 2025

Tests pass locally.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test was just flaky. I re-ran it, and now everything works.

Thank you very much.

@beorn7 beorn7 merged commit 33082be into prometheus:main Nov 11, 2025
45 of 46 checks passed
@OGKevin OGKevin deleted the khellemun/push/pkokwmrnlowq branch November 11, 2025 14:21
renovate bot added a commit to sdwilsh/ansible-playbooks that referenced this pull request Jan 10, 2026
##### [\`v3.9.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.0)

#### Note for users of Native Histograms

In version 3.9, Native Histograms is no longer experimental, and the feature flag `native-histogram` has no effect.  You must now turn on
the config setting `scrape_native_histograms` to collect Native Histogram samples from exporters.

#### Changelog

- \[CHANGE] Native Histograms are no longer experimental! Make the `native-histogram` feature flag a no-op. Use `scrape_native_histograms` config option instead. [#17528](prometheus/prometheus#17528)
- \[CHANGE] API: Add maximum limit of 10,000 sets of statistics to TSDB status endpoint. [#17647](prometheus/prometheus#17647)
- \[FEATURE] API: Add /api/v1/features for clients to understand which features are supported. [#17427](prometheus/prometheus#17427)
- \[FEATURE] Promtool: Add `start_timestamp` field for unit tests. [#17636](prometheus/prometheus#17636)
- \[FEATURE] Promtool: Add `--format seriesjson` option to `tsdb dump` to output just series labels in JSON format. [#13409](prometheus/prometheus#13409)
- \[FEATURE] Add `--storage.tsdb.delay-compact-file.path` flag for better interoperability with Thanos. [#17435](prometheus/prometheus#17435)
- \[FEATURE] UI: Add an option on the query drop-down menu to duplicate that query panel. [#17714](prometheus/prometheus#17714)
- \[ENHANCEMENT]: TSDB: add flag `--storage.tsdb.block-reload-interval` to configure TSDB Block Reload Interval. [#16728](prometheus/prometheus#16728)
- \[ENHANCEMENT] UI: Add graph option to start the chart's Y axis at zero. [#17565](prometheus/prometheus#17565)
- \[ENHANCEMENT] Scraping: Classic protobuf format no longer requires the unit in the metric name. [#16834](prometheus/prometheus#16834)
- \[ENHANCEMENT] PromQL, Rules, SD, Scraping: Add native histograms to complement existing summaries. [#17374](prometheus/prometheus#17374)
- \[ENHANCEMENT] Notifications: Add a histogram `prometheus_notifications_latency_histogram_seconds` to complement the existing summary. [#16637](prometheus/prometheus#16637)
- \[ENHANCEMENT] Remote-write: Add custom scope support for AzureAD authentication. [#17483](prometheus/prometheus#17483)
- \[ENHANCEMENT] SD: add a `config` label with job name for most `prometheus_sd_refresh` metrics. [#17138](prometheus/prometheus#17138)
- \[ENHANCEMENT] TSDB: New histogram `prometheus_tsdb_sample_ooo_delta`, the distribution of out-of-order samples in seconds. Collected for all samples, accepted or not. [#17477](prometheus/prometheus#17477)
- \[ENHANCEMENT] Remote-read: Validate histograms received via remote-read. [#17561](prometheus/prometheus#17561)
- \[PERF] TSDB: Small optimizations to postings index. [#17439](prometheus/prometheus#17439)
- \[PERF] Scraping: Speed up relabelling of series. [#17530](prometheus/prometheus#17530)
- \[PERF] PromQL: Small optimisations in binary operators. [#17524](prometheus/prometheus#17524), [#17519](prometheus/prometheus#17519).
- \[BUGFIX] UI: PromQL autocomplete now shows the correct type and HELP text for OpenMetrics counters whose samples end in `_total`. [#17682](prometheus/prometheus#17682)
- \[BUGFIX] UI: Fixed codemirror-promql incorrectly showing label completion suggestions after the closing curly brace of a vector selector. [#17602](prometheus/prometheus#17602)
- \[BUGFIX] UI: Query editor no longer suggests a duration unit if one is already present after a number. [#17605](prometheus/prometheus#17605)
- \[BUGFIX] PromQL: Fix some "vector cannot contain metrics with the same labelset" errors when experimental delayed name removal is enabled. [#17678](prometheus/prometheus#17678)
- \[BUGFIX] PromQL: Fix possible corruption of PromQL text if the query had an empty `ignoring()` and non-empty grouping. [#17643](prometheus/prometheus#17643)
- \[BUGFIX] PromQL: Fix resets/changes to return empty results for anchored selectors when all samples are outside the range. [#17479](prometheus/prometheus#17479)
- \[BUGFIX] PromQL: Check more consistently for many-to-one matching in filter binary operators. [#17668](prometheus/prometheus#17668)
- \[BUGFIX] PromQL: Fix collision in unary negation with non-overlapping series. [#17708](prometheus/prometheus#17708)
- \[BUGFIX] PromQL: Fix collision in label\_join and label\_replace with non-overlapping series. [#17703](prometheus/prometheus#17703)
- \[BUGFIX] PromQL: Fix bug with inconsistent results for queries with OR expression when experimental delayed name removal is enabled. [#17161](prometheus/prometheus#17161)
- \[BUGFIX] PromQL: Ensure that `rate`/`increase`/`delta` of histograms results in a gauge histogram. [#17608](prometheus/prometheus#17608)
- \[BUGFIX] PromQL: Do not panic while iterating over invalid histograms. [#17559](prometheus/prometheus#17559)
- \[BUGFIX] TSDB: Reject chunk files whose encoded chunk length overflows int. [#17533](prometheus/prometheus#17533)
- \[BUGFIX] TSDB: Do not panic during resolution reduction of invalid histograms. [#17561](prometheus/prometheus#17561)
- \[BUGFIX] Remote-write Receive: Avoid duplicate labels when experimental type-and-unit-label feature is enabled. [#17546](prometheus/prometheus#17546)
- \[BUGFIX] OTLP Receiver: Only write metadata to disk when experimental metadata-wal-records feature is enabled. [#17472](prometheus/prometheus#17472)
renovate bot added a commit to sdwilsh/ansible-playbooks that referenced this pull request Jan 10, 2026
##### [\`v3.9.1\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.1)

- \[BUGFIX] Agent: fix crash shortly after startup from invalid type of object. [#17802](prometheus/prometheus#17802)
- \[BUGFIX] Scraping: fix relabel keep/drop not working. [#17807](prometheus/prometheus#17807)

---
##### [\`v3.9.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.0)

#### Note for users of Native Histograms

In version 3.9, Native Histograms is no longer experimental, and the feature flag `native-histogram` has no effect.  You must now turn on
the config setting `scrape_native_histograms` to collect Native Histogram samples from exporters.

#### Changelog

- \[CHANGE] Native Histograms are no longer experimental! Make the `native-histogram` feature flag a no-op. Use `scrape_native_histograms` config option instead. [#17528](prometheus/prometheus#17528)
- \[CHANGE] API: Add maximum limit of 10,000 sets of statistics to TSDB status endpoint. [#17647](prometheus/prometheus#17647)
- \[FEATURE] API: Add /api/v1/features for clients to understand which features are supported. [#17427](prometheus/prometheus#17427)
- \[FEATURE] Promtool: Add `start_timestamp` field for unit tests. [#17636](prometheus/prometheus#17636)
- \[FEATURE] Promtool: Add `--format seriesjson` option to `tsdb dump` to output just series labels in JSON format. [#13409](prometheus/prometheus#13409)
- \[FEATURE] Add `--storage.tsdb.delay-compact-file.path` flag for better interoperability with Thanos. [#17435](prometheus/prometheus#17435)
- \[FEATURE] UI: Add an option on the query drop-down menu to duplicate that query panel. [#17714](prometheus/prometheus#17714)
- \[ENHANCEMENT]: TSDB: add flag `--storage.tsdb.block-reload-interval` to configure TSDB Block Reload Interval. [#16728](prometheus/prometheus#16728)
- \[ENHANCEMENT] UI: Add graph option to start the chart's Y axis at zero. [#17565](prometheus/prometheus#17565)
- \[ENHANCEMENT] Scraping: Classic protobuf format no longer requires the unit in the metric name. [#16834](prometheus/prometheus#16834)
- \[ENHANCEMENT] PromQL, Rules, SD, Scraping: Add native histograms to complement existing summaries. [#17374](prometheus/prometheus#17374)
- \[ENHANCEMENT] Notifications: Add a histogram `prometheus_notifications_latency_histogram_seconds` to complement the existing summary. [#16637](prometheus/prometheus#16637)
- \[ENHANCEMENT] Remote-write: Add custom scope support for AzureAD authentication. [#17483](prometheus/prometheus#17483)
- \[ENHANCEMENT] SD: add a `config` label with job name for most `prometheus_sd_refresh` metrics. [#17138](prometheus/prometheus#17138)
- \[ENHANCEMENT] TSDB: New histogram `prometheus_tsdb_sample_ooo_delta`, the distribution of out-of-order samples in seconds. Collected for all samples, accepted or not. [#17477](prometheus/prometheus#17477)
- \[ENHANCEMENT] Remote-read: Validate histograms received via remote-read. [#17561](prometheus/prometheus#17561)
- \[PERF] TSDB: Small optimizations to postings index. [#17439](prometheus/prometheus#17439)
- \[PERF] Scraping: Speed up relabelling of series. [#17530](prometheus/prometheus#17530)
- \[PERF] PromQL: Small optimisations in binary operators. [#17524](prometheus/prometheus#17524), [#17519](prometheus/prometheus#17519).
- \[BUGFIX] UI: PromQL autocomplete now shows the correct type and HELP text for OpenMetrics counters whose samples end in `_total`. [#17682](prometheus/prometheus#17682)
- \[BUGFIX] UI: Fixed codemirror-promql incorrectly showing label completion suggestions after the closing curly brace of a vector selector. [#17602](prometheus/prometheus#17602)
- \[BUGFIX] UI: Query editor no longer suggests a duration unit if one is already present after a number. [#17605](prometheus/prometheus#17605)
- \[BUGFIX] PromQL: Fix some "vector cannot contain metrics with the same labelset" errors when experimental delayed name removal is enabled. [#17678](prometheus/prometheus#17678)
- \[BUGFIX] PromQL: Fix possible corruption of PromQL text if the query had an empty `ignoring()` and non-empty grouping. [#17643](prometheus/prometheus#17643)
- \[BUGFIX] PromQL: Fix resets/changes to return empty results for anchored selectors when all samples are outside the range. [#17479](prometheus/prometheus#17479)
- \[BUGFIX] PromQL: Check more consistently for many-to-one matching in filter binary operators. [#17668](prometheus/prometheus#17668)
- \[BUGFIX] PromQL: Fix collision in unary negation with non-overlapping series. [#17708](prometheus/prometheus#17708)
- \[BUGFIX] PromQL: Fix collision in label\_join and label\_replace with non-overlapping series. [#17703](prometheus/prometheus#17703)
- \[BUGFIX] PromQL: Fix bug with inconsistent results for queries with OR expression when experimental delayed name removal is enabled. [#17161](prometheus/prometheus#17161)
- \[BUGFIX] PromQL: Ensure that `rate`/`increase`/`delta` of histograms results in a gauge histogram. [#17608](prometheus/prometheus#17608)
- \[BUGFIX] PromQL: Do not panic while iterating over invalid histograms. [#17559](prometheus/prometheus#17559)
- \[BUGFIX] TSDB: Reject chunk files whose encoded chunk length overflows int. [#17533](prometheus/prometheus#17533)
- \[BUGFIX] TSDB: Do not panic during resolution reduction of invalid histograms. [#17561](prometheus/prometheus#17561)
- \[BUGFIX] Remote-write Receive: Avoid duplicate labels when experimental type-and-unit-label feature is enabled. [#17546](prometheus/prometheus#17546)
- \[BUGFIX] OTLP Receiver: Only write metadata to disk when experimental metadata-wal-records feature is enabled. [#17472](prometheus/prometheus#17472)
renovate bot added a commit to sdwilsh/ansible-playbooks that referenced this pull request Jan 10, 2026
##### [\`v3.9.1\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.1)

- \[BUGFIX] Agent: fix crash shortly after startup from invalid type of object. [#17802](prometheus/prometheus#17802)
- \[BUGFIX] Scraping: fix relabel keep/drop not working. [#17807](prometheus/prometheus#17807)

---
##### [\`v3.9.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.0)

#### Note for users of Native Histograms

In version 3.9, Native Histograms is no longer experimental, and the feature flag `native-histogram` has no effect.  You must now turn on
the config setting `scrape_native_histograms` to collect Native Histogram samples from exporters.

#### Changelog

- \[CHANGE] Native Histograms are no longer experimental! Make the `native-histogram` feature flag a no-op. Use `scrape_native_histograms` config option instead. [#17528](prometheus/prometheus#17528)
- \[CHANGE] API: Add maximum limit of 10,000 sets of statistics to TSDB status endpoint. [#17647](prometheus/prometheus#17647)
- \[FEATURE] API: Add /api/v1/features for clients to understand which features are supported. [#17427](prometheus/prometheus#17427)
- \[FEATURE] Promtool: Add `start_timestamp` field for unit tests. [#17636](prometheus/prometheus#17636)
- \[FEATURE] Promtool: Add `--format seriesjson` option to `tsdb dump` to output just series labels in JSON format. [#13409](prometheus/prometheus#13409)
- \[FEATURE] Add `--storage.tsdb.delay-compact-file.path` flag for better interoperability with Thanos. [#17435](prometheus/prometheus#17435)
- \[FEATURE] UI: Add an option on the query drop-down menu to duplicate that query panel. [#17714](prometheus/prometheus#17714)
- \[ENHANCEMENT]: TSDB: add flag `--storage.tsdb.block-reload-interval` to configure TSDB Block Reload Interval. [#16728](prometheus/prometheus#16728)
- \[ENHANCEMENT] UI: Add graph option to start the chart's Y axis at zero. [#17565](prometheus/prometheus#17565)
- \[ENHANCEMENT] Scraping: Classic protobuf format no longer requires the unit in the metric name. [#16834](prometheus/prometheus#16834)
- \[ENHANCEMENT] PromQL, Rules, SD, Scraping: Add native histograms to complement existing summaries. [#17374](prometheus/prometheus#17374)
- \[ENHANCEMENT] Notifications: Add a histogram `prometheus_notifications_latency_histogram_seconds` to complement the existing summary. [#16637](prometheus/prometheus#16637)
- \[ENHANCEMENT] Remote-write: Add custom scope support for AzureAD authentication. [#17483](prometheus/prometheus#17483)
- \[ENHANCEMENT] SD: add a `config` label with job name for most `prometheus_sd_refresh` metrics. [#17138](prometheus/prometheus#17138)
- \[ENHANCEMENT] TSDB: New histogram `prometheus_tsdb_sample_ooo_delta`, the distribution of out-of-order samples in seconds. Collected for all samples, accepted or not. [#17477](prometheus/prometheus#17477)
- \[ENHANCEMENT] Remote-read: Validate histograms received via remote-read. [#17561](prometheus/prometheus#17561)
- \[PERF] TSDB: Small optimizations to postings index. [#17439](prometheus/prometheus#17439)
- \[PERF] Scraping: Speed up relabelling of series. [#17530](prometheus/prometheus#17530)
- \[PERF] PromQL: Small optimisations in binary operators. [#17524](prometheus/prometheus#17524), [#17519](prometheus/prometheus#17519).
- \[BUGFIX] UI: PromQL autocomplete now shows the correct type and HELP text for OpenMetrics counters whose samples end in `_total`. [#17682](prometheus/prometheus#17682)
- \[BUGFIX] UI: Fixed codemirror-promql incorrectly showing label completion suggestions after the closing curly brace of a vector selector. [#17602](prometheus/prometheus#17602)
- \[BUGFIX] UI: Query editor no longer suggests a duration unit if one is already present after a number. [#17605](prometheus/prometheus#17605)
- \[BUGFIX] PromQL: Fix some "vector cannot contain metrics with the same labelset" errors when experimental delayed name removal is enabled. [#17678](prometheus/prometheus#17678)
- \[BUGFIX] PromQL: Fix possible corruption of PromQL text if the query had an empty `ignoring()` and non-empty grouping. [#17643](prometheus/prometheus#17643)
- \[BUGFIX] PromQL: Fix resets/changes to return empty results for anchored selectors when all samples are outside the range. [#17479](prometheus/prometheus#17479)
- \[BUGFIX] PromQL: Check more consistently for many-to-one matching in filter binary operators. [#17668](prometheus/prometheus#17668)
- \[BUGFIX] PromQL: Fix collision in unary negation with non-overlapping series. [#17708](prometheus/prometheus#17708)
- \[BUGFIX] PromQL: Fix collision in label\_join and label\_replace with non-overlapping series. [#17703](prometheus/prometheus#17703)
- \[BUGFIX] PromQL: Fix bug with inconsistent results for queries with OR expression when experimental delayed name removal is enabled. [#17161](prometheus/prometheus#17161)
- \[BUGFIX] PromQL: Ensure that `rate`/`increase`/`delta` of histograms results in a gauge histogram. [#17608](prometheus/prometheus#17608)
- \[BUGFIX] PromQL: Do not panic while iterating over invalid histograms. [#17559](prometheus/prometheus#17559)
- \[BUGFIX] TSDB: Reject chunk files whose encoded chunk length overflows int. [#17533](prometheus/prometheus#17533)
- \[BUGFIX] TSDB: Do not panic during resolution reduction of invalid histograms. [#17561](prometheus/prometheus#17561)
- \[BUGFIX] Remote-write Receive: Avoid duplicate labels when experimental type-and-unit-label feature is enabled. [#17546](prometheus/prometheus#17546)
- \[BUGFIX] OTLP Receiver: Only write metadata to disk when experimental metadata-wal-records feature is enabled. [#17472](prometheus/prometheus#17472)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants