feat(metrics): add config label to refresh metrics#17138
Merged
bboreham merged 1 commit intoprometheus:mainfrom Nov 13, 2025
Merged
feat(metrics): add config label to refresh metrics#17138bboreham merged 1 commit intoprometheus:mainfrom
bboreham merged 1 commit intoprometheus:mainfrom
Conversation
00d8bf0 to
be2feea
Compare
Contributor
Author
|
hi @bboreham @machine424, when you get a moment do you mind taking a look at this PR? tagging you both as you helped in this (closed) PR 😄 #17069 |
88b9df6 to
6c26d8e
Compare
3837ded to
16d5eb3
Compare
Adds a `config` label (similar to `prometheus_sd_discovered_targets`) to refresh metrics to help identify the source of refresh issues or performance stats. In particular for HTTP SD, it can be common to have multiple disparate HTTP SD sources that should be identified and not lumped together. For example if one HTTP SD service has failures, that should be evident in its own time series seperate from other HTTP SD sources. `config` seemed more appropriate than `endpoint` as a general standard for `prometheus_sd` metrics. Docs were also updated for HTTP SD to point at the new refresh metrics rather than the older metrics. Signed-off-by: Will Bollock <wbollock@linode.com>
16d5eb3 to
e894a22
Compare
wbollock
added a commit
to wbollock/prometheus
that referenced
this pull request
Nov 13, 2025
Fixes a problem introduced after the merge of this prometheus#17138 PR didn't take into account another merged PR! ``` discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery have (*EC2SDConfig, *slog.Logger, *ec2Metrics) want (*EC2SDConfig, discovery.DiscovererOptions) discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics) want (*LightsailSDConfig, discovery.DiscovererOptions) ```
wbollock
added a commit
to wbollock/prometheus
that referenced
this pull request
Nov 13, 2025
ECS was a new service discovery tool added after this PR was merged: prometheus#17138 Aligns the style of passing a single "opts" to it like almost all the other service discovery engines now use
wbollock
added a commit
to wbollock/prometheus
that referenced
this pull request
Nov 13, 2025
Fixes a problem introduced after the merge of this prometheus#17138 PR didn't take into account another merged PR! ``` discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery have (*EC2SDConfig, *slog.Logger, *ec2Metrics) want (*EC2SDConfig, discovery.DiscovererOptions) discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics) want (*LightsailSDConfig, discovery.DiscovererOptions) ``` Signed-off-by: Will Bollock <wbollock@linode.com>
wbollock
added a commit
to wbollock/prometheus
that referenced
this pull request
Nov 13, 2025
ECS was a new service discovery tool added after this PR was merged: prometheus#17138 Aligns the style of passing a single "opts" to it like almost all the other service discovery engines now use Signed-off-by: Will Bollock <wbollock@linode.com>
wbollock
added a commit
to wbollock/prometheus
that referenced
this pull request
Nov 13, 2025
Fixes a problem introduced after the merge of this prometheus#17138 PR didn't take into account another merged PR! ``` discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery have (*EC2SDConfig, *slog.Logger, *ec2Metrics) want (*EC2SDConfig, discovery.DiscovererOptions) discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics) want (*LightsailSDConfig, discovery.DiscovererOptions) ``` Signed-off-by: Will Bollock <wbollock@linode.com>
wbollock
added a commit
to wbollock/prometheus
that referenced
this pull request
Nov 13, 2025
ECS was a new service discovery tool added after this PR was merged: prometheus#17138 Aligns the style of passing a single "opts" to it like almost all the other service discovery engines now use Signed-off-by: Will Bollock <wbollock@linode.com>
aknuds1
pushed a commit
that referenced
this pull request
Nov 16, 2025
* fix: aws discovery test fix Fixes a problem introduced after the merge of this #17138 PR didn't take into account another merged PR! ``` discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery have (*EC2SDConfig, *slog.Logger, *ec2Metrics) want (*EC2SDConfig, discovery.DiscovererOptions) discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics) want (*LightsailSDConfig, discovery.DiscovererOptions) ``` Signed-off-by: Will Bollock <wbollock@linode.com> * fix: align ecs style ECS was a new service discovery tool added after this PR was merged: #17138 Aligns the style of passing a single "opts" to it like almost all the other service discovery engines now use Signed-off-by: Will Bollock <wbollock@linode.com> --------- Signed-off-by: Will Bollock <wbollock@linode.com>
wbollock
added a commit
to wbollock/prometheus
that referenced
this pull request
Nov 21, 2025
Building off config-specific Prometheus refresh metrics from an earlier PR (prometheus#17138), this deletes refresh metrics like `prometheus_sd_refresh_duration_seconds` and `prometheus_sd_refresh_failures_total` when the underlying scrape job configuration is removed on reload. This reduces un-needed cardinality from scrape job specific metrics while still preserving metrics that indicate overall health of a service discovery engine. For example, `prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1` will no longer be exported by Prometheus when the `linode-servers` scrape job for the Linode service provider is removed. The generic, service discovery specific `prometheus_sd_linode_failures_total` metric will persist however.
wbollock
added a commit
to wbollock/prometheus
that referenced
this pull request
Nov 25, 2025
Building off config-specific Prometheus refresh metrics from an earlier PR (prometheus#17138), this deletes refresh metrics like `prometheus_sd_refresh_duration_seconds` and `prometheus_sd_refresh_failures_total` when the underlying scrape job configuration is removed on reload. This reduces un-needed cardinality from scrape job specific metrics while still preserving metrics that indicate overall health of a service discovery engine. For example, `prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1` will no longer be exported by Prometheus when the `linode-servers` scrape job for the Linode service provider is removed. The generic, service discovery specific `prometheus_sd_linode_failures_total` metric will persist however.
wbollock
added a commit
to wbollock/prometheus
that referenced
this pull request
Nov 25, 2025
Building off config-specific Prometheus refresh metrics from an earlier PR (prometheus#17138), this deletes refresh metrics like `prometheus_sd_refresh_duration_seconds` and `prometheus_sd_refresh_failures_total` when the underlying scrape job configuration is removed on reload. This reduces un-needed cardinality from scrape job specific metrics while still preserving metrics that indicate overall health of a service discovery engine. For example, `prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1` will no longer be exported by Prometheus when the `linode-servers` scrape job for the Linode service provider is removed. The generic, service discovery specific `prometheus_sd_linode_failures_total` metric will persist however.
wbollock
added a commit
to wbollock/prometheus
that referenced
this pull request
Nov 25, 2025
Building off config-specific Prometheus refresh metrics from an earlier PR (prometheus#17138), this deletes refresh metrics like `prometheus_sd_refresh_duration_seconds` and `prometheus_sd_refresh_failures_total` when the underlying scrape job configuration is removed on reload. This reduces un-needed cardinality from scrape job specific metrics while still preserving metrics that indicate overall health of a service discovery engine. For example, `prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1` will no longer be exported by Prometheus when the `linode-servers` scrape job for the Linode service provider is removed. The generic, service discovery specific `prometheus_sd_linode_failures_total` metric will persist however. Signed-off-by: Will Bollock <wbollock@linode.com>
wbollock
added a commit
to wbollock/prometheus
that referenced
this pull request
Nov 25, 2025
Building off config-specific Prometheus refresh metrics from an earlier PR (prometheus#17138), this deletes refresh metrics like `prometheus_sd_refresh_duration_seconds` and `prometheus_sd_refresh_failures_total` when the underlying scrape job configuration is removed on reload. This reduces un-needed cardinality from scrape job specific metrics while still preserving metrics that indicate overall health of a service discovery engine. For example, `prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1` will no longer be exported by Prometheus when the `linode-servers` scrape job for the Linode service provider is removed. The generic, service discovery specific `prometheus_sd_linode_failures_total` metric will persist however. Signed-off-by: Will Bollock <wbollock@linode.com>
renovate bot
added a commit
to sdwilsh/ansible-playbooks
that referenced
this pull request
Jan 10, 2026
##### [\`v3.9.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.0) #### Note for users of Native Histograms In version 3.9, Native Histograms is no longer experimental, and the feature flag `native-histogram` has no effect. You must now turn on the config setting `scrape_native_histograms` to collect Native Histogram samples from exporters. #### Changelog - \[CHANGE] Native Histograms are no longer experimental! Make the `native-histogram` feature flag a no-op. Use `scrape_native_histograms` config option instead. [#17528](prometheus/prometheus#17528) - \[CHANGE] API: Add maximum limit of 10,000 sets of statistics to TSDB status endpoint. [#17647](prometheus/prometheus#17647) - \[FEATURE] API: Add /api/v1/features for clients to understand which features are supported. [#17427](prometheus/prometheus#17427) - \[FEATURE] Promtool: Add `start_timestamp` field for unit tests. [#17636](prometheus/prometheus#17636) - \[FEATURE] Promtool: Add `--format seriesjson` option to `tsdb dump` to output just series labels in JSON format. [#13409](prometheus/prometheus#13409) - \[FEATURE] Add `--storage.tsdb.delay-compact-file.path` flag for better interoperability with Thanos. [#17435](prometheus/prometheus#17435) - \[FEATURE] UI: Add an option on the query drop-down menu to duplicate that query panel. [#17714](prometheus/prometheus#17714) - \[ENHANCEMENT]: TSDB: add flag `--storage.tsdb.block-reload-interval` to configure TSDB Block Reload Interval. [#16728](prometheus/prometheus#16728) - \[ENHANCEMENT] UI: Add graph option to start the chart's Y axis at zero. [#17565](prometheus/prometheus#17565) - \[ENHANCEMENT] Scraping: Classic protobuf format no longer requires the unit in the metric name. [#16834](prometheus/prometheus#16834) - \[ENHANCEMENT] PromQL, Rules, SD, Scraping: Add native histograms to complement existing summaries. [#17374](prometheus/prometheus#17374) - \[ENHANCEMENT] Notifications: Add a histogram `prometheus_notifications_latency_histogram_seconds` to complement the existing summary. [#16637](prometheus/prometheus#16637) - \[ENHANCEMENT] Remote-write: Add custom scope support for AzureAD authentication. [#17483](prometheus/prometheus#17483) - \[ENHANCEMENT] SD: add a `config` label with job name for most `prometheus_sd_refresh` metrics. [#17138](prometheus/prometheus#17138) - \[ENHANCEMENT] TSDB: New histogram `prometheus_tsdb_sample_ooo_delta`, the distribution of out-of-order samples in seconds. Collected for all samples, accepted or not. [#17477](prometheus/prometheus#17477) - \[ENHANCEMENT] Remote-read: Validate histograms received via remote-read. [#17561](prometheus/prometheus#17561) - \[PERF] TSDB: Small optimizations to postings index. [#17439](prometheus/prometheus#17439) - \[PERF] Scraping: Speed up relabelling of series. [#17530](prometheus/prometheus#17530) - \[PERF] PromQL: Small optimisations in binary operators. [#17524](prometheus/prometheus#17524), [#17519](prometheus/prometheus#17519). - \[BUGFIX] UI: PromQL autocomplete now shows the correct type and HELP text for OpenMetrics counters whose samples end in `_total`. [#17682](prometheus/prometheus#17682) - \[BUGFIX] UI: Fixed codemirror-promql incorrectly showing label completion suggestions after the closing curly brace of a vector selector. [#17602](prometheus/prometheus#17602) - \[BUGFIX] UI: Query editor no longer suggests a duration unit if one is already present after a number. [#17605](prometheus/prometheus#17605) - \[BUGFIX] PromQL: Fix some "vector cannot contain metrics with the same labelset" errors when experimental delayed name removal is enabled. [#17678](prometheus/prometheus#17678) - \[BUGFIX] PromQL: Fix possible corruption of PromQL text if the query had an empty `ignoring()` and non-empty grouping. [#17643](prometheus/prometheus#17643) - \[BUGFIX] PromQL: Fix resets/changes to return empty results for anchored selectors when all samples are outside the range. [#17479](prometheus/prometheus#17479) - \[BUGFIX] PromQL: Check more consistently for many-to-one matching in filter binary operators. [#17668](prometheus/prometheus#17668) - \[BUGFIX] PromQL: Fix collision in unary negation with non-overlapping series. [#17708](prometheus/prometheus#17708) - \[BUGFIX] PromQL: Fix collision in label\_join and label\_replace with non-overlapping series. [#17703](prometheus/prometheus#17703) - \[BUGFIX] PromQL: Fix bug with inconsistent results for queries with OR expression when experimental delayed name removal is enabled. [#17161](prometheus/prometheus#17161) - \[BUGFIX] PromQL: Ensure that `rate`/`increase`/`delta` of histograms results in a gauge histogram. [#17608](prometheus/prometheus#17608) - \[BUGFIX] PromQL: Do not panic while iterating over invalid histograms. [#17559](prometheus/prometheus#17559) - \[BUGFIX] TSDB: Reject chunk files whose encoded chunk length overflows int. [#17533](prometheus/prometheus#17533) - \[BUGFIX] TSDB: Do not panic during resolution reduction of invalid histograms. [#17561](prometheus/prometheus#17561) - \[BUGFIX] Remote-write Receive: Avoid duplicate labels when experimental type-and-unit-label feature is enabled. [#17546](prometheus/prometheus#17546) - \[BUGFIX] OTLP Receiver: Only write metadata to disk when experimental metadata-wal-records feature is enabled. [#17472](prometheus/prometheus#17472)
renovate bot
added a commit
to sdwilsh/ansible-playbooks
that referenced
this pull request
Jan 10, 2026
##### [\`v3.9.1\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.1) - \[BUGFIX] Agent: fix crash shortly after startup from invalid type of object. [#17802](prometheus/prometheus#17802) - \[BUGFIX] Scraping: fix relabel keep/drop not working. [#17807](prometheus/prometheus#17807) --- ##### [\`v3.9.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.0) #### Note for users of Native Histograms In version 3.9, Native Histograms is no longer experimental, and the feature flag `native-histogram` has no effect. You must now turn on the config setting `scrape_native_histograms` to collect Native Histogram samples from exporters. #### Changelog - \[CHANGE] Native Histograms are no longer experimental! Make the `native-histogram` feature flag a no-op. Use `scrape_native_histograms` config option instead. [#17528](prometheus/prometheus#17528) - \[CHANGE] API: Add maximum limit of 10,000 sets of statistics to TSDB status endpoint. [#17647](prometheus/prometheus#17647) - \[FEATURE] API: Add /api/v1/features for clients to understand which features are supported. [#17427](prometheus/prometheus#17427) - \[FEATURE] Promtool: Add `start_timestamp` field for unit tests. [#17636](prometheus/prometheus#17636) - \[FEATURE] Promtool: Add `--format seriesjson` option to `tsdb dump` to output just series labels in JSON format. [#13409](prometheus/prometheus#13409) - \[FEATURE] Add `--storage.tsdb.delay-compact-file.path` flag for better interoperability with Thanos. [#17435](prometheus/prometheus#17435) - \[FEATURE] UI: Add an option on the query drop-down menu to duplicate that query panel. [#17714](prometheus/prometheus#17714) - \[ENHANCEMENT]: TSDB: add flag `--storage.tsdb.block-reload-interval` to configure TSDB Block Reload Interval. [#16728](prometheus/prometheus#16728) - \[ENHANCEMENT] UI: Add graph option to start the chart's Y axis at zero. [#17565](prometheus/prometheus#17565) - \[ENHANCEMENT] Scraping: Classic protobuf format no longer requires the unit in the metric name. [#16834](prometheus/prometheus#16834) - \[ENHANCEMENT] PromQL, Rules, SD, Scraping: Add native histograms to complement existing summaries. [#17374](prometheus/prometheus#17374) - \[ENHANCEMENT] Notifications: Add a histogram `prometheus_notifications_latency_histogram_seconds` to complement the existing summary. [#16637](prometheus/prometheus#16637) - \[ENHANCEMENT] Remote-write: Add custom scope support for AzureAD authentication. [#17483](prometheus/prometheus#17483) - \[ENHANCEMENT] SD: add a `config` label with job name for most `prometheus_sd_refresh` metrics. [#17138](prometheus/prometheus#17138) - \[ENHANCEMENT] TSDB: New histogram `prometheus_tsdb_sample_ooo_delta`, the distribution of out-of-order samples in seconds. Collected for all samples, accepted or not. [#17477](prometheus/prometheus#17477) - \[ENHANCEMENT] Remote-read: Validate histograms received via remote-read. [#17561](prometheus/prometheus#17561) - \[PERF] TSDB: Small optimizations to postings index. [#17439](prometheus/prometheus#17439) - \[PERF] Scraping: Speed up relabelling of series. [#17530](prometheus/prometheus#17530) - \[PERF] PromQL: Small optimisations in binary operators. [#17524](prometheus/prometheus#17524), [#17519](prometheus/prometheus#17519). - \[BUGFIX] UI: PromQL autocomplete now shows the correct type and HELP text for OpenMetrics counters whose samples end in `_total`. [#17682](prometheus/prometheus#17682) - \[BUGFIX] UI: Fixed codemirror-promql incorrectly showing label completion suggestions after the closing curly brace of a vector selector. [#17602](prometheus/prometheus#17602) - \[BUGFIX] UI: Query editor no longer suggests a duration unit if one is already present after a number. [#17605](prometheus/prometheus#17605) - \[BUGFIX] PromQL: Fix some "vector cannot contain metrics with the same labelset" errors when experimental delayed name removal is enabled. [#17678](prometheus/prometheus#17678) - \[BUGFIX] PromQL: Fix possible corruption of PromQL text if the query had an empty `ignoring()` and non-empty grouping. [#17643](prometheus/prometheus#17643) - \[BUGFIX] PromQL: Fix resets/changes to return empty results for anchored selectors when all samples are outside the range. [#17479](prometheus/prometheus#17479) - \[BUGFIX] PromQL: Check more consistently for many-to-one matching in filter binary operators. [#17668](prometheus/prometheus#17668) - \[BUGFIX] PromQL: Fix collision in unary negation with non-overlapping series. [#17708](prometheus/prometheus#17708) - \[BUGFIX] PromQL: Fix collision in label\_join and label\_replace with non-overlapping series. [#17703](prometheus/prometheus#17703) - \[BUGFIX] PromQL: Fix bug with inconsistent results for queries with OR expression when experimental delayed name removal is enabled. [#17161](prometheus/prometheus#17161) - \[BUGFIX] PromQL: Ensure that `rate`/`increase`/`delta` of histograms results in a gauge histogram. [#17608](prometheus/prometheus#17608) - \[BUGFIX] PromQL: Do not panic while iterating over invalid histograms. [#17559](prometheus/prometheus#17559) - \[BUGFIX] TSDB: Reject chunk files whose encoded chunk length overflows int. [#17533](prometheus/prometheus#17533) - \[BUGFIX] TSDB: Do not panic during resolution reduction of invalid histograms. [#17561](prometheus/prometheus#17561) - \[BUGFIX] Remote-write Receive: Avoid duplicate labels when experimental type-and-unit-label feature is enabled. [#17546](prometheus/prometheus#17546) - \[BUGFIX] OTLP Receiver: Only write metadata to disk when experimental metadata-wal-records feature is enabled. [#17472](prometheus/prometheus#17472)
renovate bot
added a commit
to sdwilsh/ansible-playbooks
that referenced
this pull request
Jan 10, 2026
##### [\`v3.9.1\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.1) - \[BUGFIX] Agent: fix crash shortly after startup from invalid type of object. [#17802](prometheus/prometheus#17802) - \[BUGFIX] Scraping: fix relabel keep/drop not working. [#17807](prometheus/prometheus#17807) --- ##### [\`v3.9.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.0) #### Note for users of Native Histograms In version 3.9, Native Histograms is no longer experimental, and the feature flag `native-histogram` has no effect. You must now turn on the config setting `scrape_native_histograms` to collect Native Histogram samples from exporters. #### Changelog - \[CHANGE] Native Histograms are no longer experimental! Make the `native-histogram` feature flag a no-op. Use `scrape_native_histograms` config option instead. [#17528](prometheus/prometheus#17528) - \[CHANGE] API: Add maximum limit of 10,000 sets of statistics to TSDB status endpoint. [#17647](prometheus/prometheus#17647) - \[FEATURE] API: Add /api/v1/features for clients to understand which features are supported. [#17427](prometheus/prometheus#17427) - \[FEATURE] Promtool: Add `start_timestamp` field for unit tests. [#17636](prometheus/prometheus#17636) - \[FEATURE] Promtool: Add `--format seriesjson` option to `tsdb dump` to output just series labels in JSON format. [#13409](prometheus/prometheus#13409) - \[FEATURE] Add `--storage.tsdb.delay-compact-file.path` flag for better interoperability with Thanos. [#17435](prometheus/prometheus#17435) - \[FEATURE] UI: Add an option on the query drop-down menu to duplicate that query panel. [#17714](prometheus/prometheus#17714) - \[ENHANCEMENT]: TSDB: add flag `--storage.tsdb.block-reload-interval` to configure TSDB Block Reload Interval. [#16728](prometheus/prometheus#16728) - \[ENHANCEMENT] UI: Add graph option to start the chart's Y axis at zero. [#17565](prometheus/prometheus#17565) - \[ENHANCEMENT] Scraping: Classic protobuf format no longer requires the unit in the metric name. [#16834](prometheus/prometheus#16834) - \[ENHANCEMENT] PromQL, Rules, SD, Scraping: Add native histograms to complement existing summaries. [#17374](prometheus/prometheus#17374) - \[ENHANCEMENT] Notifications: Add a histogram `prometheus_notifications_latency_histogram_seconds` to complement the existing summary. [#16637](prometheus/prometheus#16637) - \[ENHANCEMENT] Remote-write: Add custom scope support for AzureAD authentication. [#17483](prometheus/prometheus#17483) - \[ENHANCEMENT] SD: add a `config` label with job name for most `prometheus_sd_refresh` metrics. [#17138](prometheus/prometheus#17138) - \[ENHANCEMENT] TSDB: New histogram `prometheus_tsdb_sample_ooo_delta`, the distribution of out-of-order samples in seconds. Collected for all samples, accepted or not. [#17477](prometheus/prometheus#17477) - \[ENHANCEMENT] Remote-read: Validate histograms received via remote-read. [#17561](prometheus/prometheus#17561) - \[PERF] TSDB: Small optimizations to postings index. [#17439](prometheus/prometheus#17439) - \[PERF] Scraping: Speed up relabelling of series. [#17530](prometheus/prometheus#17530) - \[PERF] PromQL: Small optimisations in binary operators. [#17524](prometheus/prometheus#17524), [#17519](prometheus/prometheus#17519). - \[BUGFIX] UI: PromQL autocomplete now shows the correct type and HELP text for OpenMetrics counters whose samples end in `_total`. [#17682](prometheus/prometheus#17682) - \[BUGFIX] UI: Fixed codemirror-promql incorrectly showing label completion suggestions after the closing curly brace of a vector selector. [#17602](prometheus/prometheus#17602) - \[BUGFIX] UI: Query editor no longer suggests a duration unit if one is already present after a number. [#17605](prometheus/prometheus#17605) - \[BUGFIX] PromQL: Fix some "vector cannot contain metrics with the same labelset" errors when experimental delayed name removal is enabled. [#17678](prometheus/prometheus#17678) - \[BUGFIX] PromQL: Fix possible corruption of PromQL text if the query had an empty `ignoring()` and non-empty grouping. [#17643](prometheus/prometheus#17643) - \[BUGFIX] PromQL: Fix resets/changes to return empty results for anchored selectors when all samples are outside the range. [#17479](prometheus/prometheus#17479) - \[BUGFIX] PromQL: Check more consistently for many-to-one matching in filter binary operators. [#17668](prometheus/prometheus#17668) - \[BUGFIX] PromQL: Fix collision in unary negation with non-overlapping series. [#17708](prometheus/prometheus#17708) - \[BUGFIX] PromQL: Fix collision in label\_join and label\_replace with non-overlapping series. [#17703](prometheus/prometheus#17703) - \[BUGFIX] PromQL: Fix bug with inconsistent results for queries with OR expression when experimental delayed name removal is enabled. [#17161](prometheus/prometheus#17161) - \[BUGFIX] PromQL: Ensure that `rate`/`increase`/`delta` of histograms results in a gauge histogram. [#17608](prometheus/prometheus#17608) - \[BUGFIX] PromQL: Do not panic while iterating over invalid histograms. [#17559](prometheus/prometheus#17559) - \[BUGFIX] TSDB: Reject chunk files whose encoded chunk length overflows int. [#17533](prometheus/prometheus#17533) - \[BUGFIX] TSDB: Do not panic during resolution reduction of invalid histograms. [#17561](prometheus/prometheus#17561) - \[BUGFIX] Remote-write Receive: Avoid duplicate labels when experimental type-and-unit-label feature is enabled. [#17546](prometheus/prometheus#17546) - \[BUGFIX] OTLP Receiver: Only write metadata to disk when experimental metadata-wal-records feature is enabled. [#17472](prometheus/prometheus#17472)
wbollock
added a commit
to wbollock/prometheus
that referenced
this pull request
Feb 6, 2026
Building off config-specific Prometheus refresh metrics from an earlier PR (prometheus#17138), this deletes refresh metrics like `prometheus_sd_refresh_duration_seconds` and `prometheus_sd_refresh_failures_total` when the underlying scrape job configuration is removed on reload. This reduces un-needed cardinality from scrape job specific metrics while still preserving metrics that indicate overall health of a service discovery engine. For example, `prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1` will no longer be exported by Prometheus when the `linode-servers` scrape job for the Linode service provider is removed. The generic, service discovery specific `prometheus_sd_linode_failures_total` metric will persist however. Signed-off-by: Will Bollock <wbollock@linode.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a
configlabel (similar toprometheus_sd_discovered_targets) to refresh metrics (prometheus_sd_refresh_duration_secondsandprometheus_sd_refresh_failures_total) to help identify the source of refresh issues or performance stats. In particular for HTTP SD, it can be common to have multiple disparate HTTP SD sources that should be identified and not lumped together. For example if one HTTP SD service has failures, that should be evident in its own time series separate from other HTTP SD sources.The same arguments could be made for other service discovery providers. You may have jobs with entirely different settings - different API tokens or configurations that should be separated from each other. Or even testing scrape jobs that shouldn't have the same urgency of failure as production scrape jobs.
configseemed more appropriate thanendpointas a general standard forprometheus_sdmetrics.Docs were also updated for HTTP SD to point at the new refresh metrics rather than the older metrics.
🗒️ Note: this will be roughly around ~6 new time series per job. I believe the extra cardinality is worth it as the data is a lot more useful. We could also compromise by deleting the legacy per-service discovery metrics that are duplicated by
prometheus_sd_refresh*metrics that would help a little bit.Example:
For this config:
There's also maybe something to be improved here with unregistering scrape job metrics for jobs that no longer exist. These will still be retained after a scrape job is gone on a reload (edit: this is current behavior too unrelated to this PR):