Skip to content

feat(metrics): add config label to refresh metrics#17138

Merged
bboreham merged 1 commit intoprometheus:mainfrom
wbollock:feat/prometheus_refresh_config_label
Nov 13, 2025
Merged

feat(metrics): add config label to refresh metrics#17138
bboreham merged 1 commit intoprometheus:mainfrom
wbollock:feat/prometheus_refresh_config_label

Conversation

@wbollock
Copy link
Contributor

@wbollock wbollock commented Sep 3, 2025

Adds a config label (similar to prometheus_sd_discovered_targets) to refresh metrics (prometheus_sd_refresh_duration_seconds and prometheus_sd_refresh_failures_total) to help identify the source of refresh issues or performance stats. In particular for HTTP SD, it can be common to have multiple disparate HTTP SD sources that should be identified and not lumped together. For example if one HTTP SD service has failures, that should be evident in its own time series separate from other HTTP SD sources.

The same arguments could be made for other service discovery providers. You may have jobs with entirely different settings - different API tokens or configurations that should be separated from each other. Or even testing scrape jobs that shouldn't have the same urgency of failure as production scrape jobs.

config seemed more appropriate than endpoint as a general standard for prometheus_sd metrics.

Docs were also updated for HTTP SD to point at the new refresh metrics rather than the older metrics.

⚠️ I did not update kuma, file_sd, kubernetes, zookeeper, or consul as they don't use the new refresh metrics entirely and require more work to integrate into those metrics.

  • add the config label for each service discovery endpoint
  • tests
[ENHANCEMENT] adds a `config` label indicating specific job names for most `prometheus_sd_refresh_duration_seconds` and `prometheus_sd_refresh_failures_total` metrics

🗒️ Note: this will be roughly around ~6 new time series per job. I believe the extra cardinality is worth it as the data is a lot more useful. We could also compromise by deleting the legacy per-service discovery metrics that are duplicated by prometheus_sd_refresh* metrics that would help a little bit.

Example:

prometheus_sd_refresh_duration_seconds{config="linode-nodes",mechanism="linode",quantile="0.5"} 0.199821177
prometheus_sd_refresh_duration_seconds{config="linode-nodes",mechanism="linode",quantile="0.9"} 0.199821177
prometheus_sd_refresh_duration_seconds{config="linode-nodes",mechanism="linode",quantile="0.99"} 0.199821177
prometheus_sd_refresh_duration_seconds_sum{config="linode-nodes",mechanism="linode"} 0.199821177
prometheus_sd_refresh_duration_seconds_count{config="linode-nodes",mechanism="linode"} 1

For this config:

<snip>
  - job_name: 'linode-nodes'
    linode_sd_configs:
      - authorization:
          credentials: "xxx"
        region: "us-east"

There's also maybe something to be improved here with unregistering scrape job metrics for jobs that no longer exist. These will still be retained after a scrape job is gone on a reload (edit: this is current behavior too unrelated to this PR):

net_conntrack_dialer_conn_attempted_total{dialer_name="aws-lightsail"} 0
net_conntrack_dialer_conn_closed_total{dialer_name="aws-lightsail"} 0
net_conntrack_dialer_conn_established_total{dialer_name="aws-lightsail"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="aws-lightsail",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="aws-lightsail",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="aws-lightsail",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="aws-lightsail",reason="unknown"} 0
prometheus_sd_discovered_targets{config="aws-lightsail",name="scrape"} 0
prometheus_sd_refresh_duration_seconds{config="aws-lightsail",mechanism="ec2",quantile="0.5"} 0.057529291
prometheus_sd_refresh_duration_seconds{config="aws-lightsail",mechanism="ec2",quantile="0.9"} 0.061284775
prometheus_sd_refresh_duration_seconds{config="aws-lightsail",mechanism="ec2",quantile="0.99"} 0.061284775
prometheus_sd_refresh_duration_seconds_sum{config="aws-lightsail",mechanism="ec2"} 0.118814066
prometheus_sd_refresh_duration_seconds_count{config="aws-lightsail",mechanism="ec2"} 2
prometheus_sd_refresh_failures_total{config="aws-lightsail",mechanism="ec2"} 2

@wbollock wbollock force-pushed the feat/prometheus_refresh_config_label branch 3 times, most recently from 00d8bf0 to be2feea Compare September 5, 2025 16:08
@wbollock wbollock marked this pull request as ready for review September 5, 2025 16:33
@wbollock wbollock changed the title [WIP] feat(metrics): add config label to refresh metrics feat(metrics): add config label to refresh metrics Sep 5, 2025
@wbollock
Copy link
Contributor Author

hi @bboreham @machine424, when you get a moment do you mind taking a look at this PR? tagging you both as you helped in this (closed) PR 😄 #17069

Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial thoughts.

@wbollock wbollock force-pushed the feat/prometheus_refresh_config_label branch 3 times, most recently from 88b9df6 to 6c26d8e Compare October 1, 2025 21:34
@wbollock wbollock requested a review from bboreham October 2, 2025 14:38
@wbollock wbollock force-pushed the feat/prometheus_refresh_config_label branch 3 times, most recently from 3837ded to 16d5eb3 Compare October 8, 2025 16:38
Adds a `config` label (similar to `prometheus_sd_discovered_targets`) to
refresh metrics to help identify the source of refresh issues or
performance stats. In particular for HTTP SD, it can be common to have
multiple disparate HTTP SD sources that should be identified and not
lumped together. For example if one HTTP SD service has failures, that
should be evident in its own time series seperate from other HTTP SD
sources.

`config` seemed more appropriate than `endpoint` as a general standard
for `prometheus_sd` metrics.

Docs were also updated for HTTP SD to point at the new refresh metrics
rather than the older metrics.

Signed-off-by: Will Bollock <wbollock@linode.com>
@wbollock wbollock force-pushed the feat/prometheus_refresh_config_label branch from 16d5eb3 to e894a22 Compare October 14, 2025 15:36
Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok to proceed, thanks.

@bboreham bboreham merged commit e02a65b into prometheus:main Nov 13, 2025
28 checks passed
wbollock added a commit to wbollock/prometheus that referenced this pull request Nov 13, 2025
Fixes a problem introduced after the merge of this prometheus#17138

PR didn't take into account another merged PR!

```
discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery
    have (*EC2SDConfig, *slog.Logger, *ec2Metrics)
    want (*EC2SDConfig, discovery.DiscovererOptions)
discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery
    have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics)
    want (*LightsailSDConfig, discovery.DiscovererOptions)
```
wbollock added a commit to wbollock/prometheus that referenced this pull request Nov 13, 2025
ECS was a new service discovery tool added after this PR was merged: prometheus#17138

Aligns the style of passing a single "opts" to it like almost all the other
service discovery engines now use
wbollock added a commit to wbollock/prometheus that referenced this pull request Nov 13, 2025
Fixes a problem introduced after the merge of this prometheus#17138

PR didn't take into account another merged PR!

```
discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery
    have (*EC2SDConfig, *slog.Logger, *ec2Metrics)
    want (*EC2SDConfig, discovery.DiscovererOptions)
discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery
    have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics)
    want (*LightsailSDConfig, discovery.DiscovererOptions)
```

Signed-off-by: Will Bollock <wbollock@linode.com>
wbollock added a commit to wbollock/prometheus that referenced this pull request Nov 13, 2025
ECS was a new service discovery tool added after this PR was merged: prometheus#17138

Aligns the style of passing a single "opts" to it like almost all the other
service discovery engines now use

Signed-off-by: Will Bollock <wbollock@linode.com>
wbollock added a commit to wbollock/prometheus that referenced this pull request Nov 13, 2025
Fixes a problem introduced after the merge of this prometheus#17138

PR didn't take into account another merged PR!

```
discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery
    have (*EC2SDConfig, *slog.Logger, *ec2Metrics)
    want (*EC2SDConfig, discovery.DiscovererOptions)
discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery
    have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics)
    want (*LightsailSDConfig, discovery.DiscovererOptions)
```

Signed-off-by: Will Bollock <wbollock@linode.com>
wbollock added a commit to wbollock/prometheus that referenced this pull request Nov 13, 2025
ECS was a new service discovery tool added after this PR was merged: prometheus#17138

Aligns the style of passing a single "opts" to it like almost all the other
service discovery engines now use

Signed-off-by: Will Bollock <wbollock@linode.com>
aknuds1 pushed a commit that referenced this pull request Nov 16, 2025
* fix: aws discovery test fix

Fixes a problem introduced after the merge of this #17138

PR didn't take into account another merged PR!

```
discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery
    have (*EC2SDConfig, *slog.Logger, *ec2Metrics)
    want (*EC2SDConfig, discovery.DiscovererOptions)
discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery
    have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics)
    want (*LightsailSDConfig, discovery.DiscovererOptions)
```

Signed-off-by: Will Bollock <wbollock@linode.com>

* fix: align ecs style

ECS was a new service discovery tool added after this PR was merged: #17138

Aligns the style of passing a single "opts" to it like almost all the other
service discovery engines now use

Signed-off-by: Will Bollock <wbollock@linode.com>

---------

Signed-off-by: Will Bollock <wbollock@linode.com>
@wbollock wbollock deleted the feat/prometheus_refresh_config_label branch November 17, 2025 17:26
wbollock added a commit to wbollock/prometheus that referenced this pull request Nov 21, 2025
Building off config-specific Prometheus refresh metrics from an earlier
PR (prometheus#17138), this deletes
refresh metrics like `prometheus_sd_refresh_duration_seconds` and
`prometheus_sd_refresh_failures_total` when the underlying scrape job
configuration is removed on reload. This reduces un-needed cardinality
from scrape job specific metrics while still preserving metrics that
indicate overall health of a service discovery engine.

For example,
`prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1`
will no longer be exported by Prometheus when the `linode-servers`
scrape job for the Linode service provider is removed. The generic,
service discovery specific `prometheus_sd_linode_failures_total` metric
will persist however.
wbollock added a commit to wbollock/prometheus that referenced this pull request Nov 25, 2025
Building off config-specific Prometheus refresh metrics from an earlier
PR (prometheus#17138), this deletes
refresh metrics like `prometheus_sd_refresh_duration_seconds` and
`prometheus_sd_refresh_failures_total` when the underlying scrape job
configuration is removed on reload. This reduces un-needed cardinality
from scrape job specific metrics while still preserving metrics that
indicate overall health of a service discovery engine.

For example,
`prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1`
will no longer be exported by Prometheus when the `linode-servers`
scrape job for the Linode service provider is removed. The generic,
service discovery specific `prometheus_sd_linode_failures_total` metric
will persist however.
wbollock added a commit to wbollock/prometheus that referenced this pull request Nov 25, 2025
Building off config-specific Prometheus refresh metrics from an earlier
PR (prometheus#17138), this deletes
refresh metrics like `prometheus_sd_refresh_duration_seconds` and
`prometheus_sd_refresh_failures_total` when the underlying scrape job
configuration is removed on reload. This reduces un-needed cardinality
from scrape job specific metrics while still preserving metrics that
indicate overall health of a service discovery engine.

For example,
`prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1`
will no longer be exported by Prometheus when the `linode-servers`
scrape job for the Linode service provider is removed. The generic,
service discovery specific `prometheus_sd_linode_failures_total` metric
will persist however.
wbollock added a commit to wbollock/prometheus that referenced this pull request Nov 25, 2025
Building off config-specific Prometheus refresh metrics from an earlier
PR (prometheus#17138), this deletes
refresh metrics like `prometheus_sd_refresh_duration_seconds` and
`prometheus_sd_refresh_failures_total` when the underlying scrape job
configuration is removed on reload. This reduces un-needed cardinality
from scrape job specific metrics while still preserving metrics that
indicate overall health of a service discovery engine.

For example,
`prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1`
will no longer be exported by Prometheus when the `linode-servers`
scrape job for the Linode service provider is removed. The generic,
service discovery specific `prometheus_sd_linode_failures_total` metric
will persist however.

Signed-off-by: Will Bollock <wbollock@linode.com>
wbollock added a commit to wbollock/prometheus that referenced this pull request Nov 25, 2025
Building off config-specific Prometheus refresh metrics from an earlier
PR (prometheus#17138), this deletes
refresh metrics like `prometheus_sd_refresh_duration_seconds` and
`prometheus_sd_refresh_failures_total` when the underlying scrape job
configuration is removed on reload. This reduces un-needed cardinality
from scrape job specific metrics while still preserving metrics that
indicate overall health of a service discovery engine.

For example,
`prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1`
will no longer be exported by Prometheus when the `linode-servers`
scrape job for the Linode service provider is removed. The generic,
service discovery specific `prometheus_sd_linode_failures_total` metric
will persist however.

Signed-off-by: Will Bollock <wbollock@linode.com>
renovate bot added a commit to sdwilsh/ansible-playbooks that referenced this pull request Jan 10, 2026
##### [\`v3.9.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.0)

#### Note for users of Native Histograms

In version 3.9, Native Histograms is no longer experimental, and the feature flag `native-histogram` has no effect.  You must now turn on
the config setting `scrape_native_histograms` to collect Native Histogram samples from exporters.

#### Changelog

- \[CHANGE] Native Histograms are no longer experimental! Make the `native-histogram` feature flag a no-op. Use `scrape_native_histograms` config option instead. [#17528](prometheus/prometheus#17528)
- \[CHANGE] API: Add maximum limit of 10,000 sets of statistics to TSDB status endpoint. [#17647](prometheus/prometheus#17647)
- \[FEATURE] API: Add /api/v1/features for clients to understand which features are supported. [#17427](prometheus/prometheus#17427)
- \[FEATURE] Promtool: Add `start_timestamp` field for unit tests. [#17636](prometheus/prometheus#17636)
- \[FEATURE] Promtool: Add `--format seriesjson` option to `tsdb dump` to output just series labels in JSON format. [#13409](prometheus/prometheus#13409)
- \[FEATURE] Add `--storage.tsdb.delay-compact-file.path` flag for better interoperability with Thanos. [#17435](prometheus/prometheus#17435)
- \[FEATURE] UI: Add an option on the query drop-down menu to duplicate that query panel. [#17714](prometheus/prometheus#17714)
- \[ENHANCEMENT]: TSDB: add flag `--storage.tsdb.block-reload-interval` to configure TSDB Block Reload Interval. [#16728](prometheus/prometheus#16728)
- \[ENHANCEMENT] UI: Add graph option to start the chart's Y axis at zero. [#17565](prometheus/prometheus#17565)
- \[ENHANCEMENT] Scraping: Classic protobuf format no longer requires the unit in the metric name. [#16834](prometheus/prometheus#16834)
- \[ENHANCEMENT] PromQL, Rules, SD, Scraping: Add native histograms to complement existing summaries. [#17374](prometheus/prometheus#17374)
- \[ENHANCEMENT] Notifications: Add a histogram `prometheus_notifications_latency_histogram_seconds` to complement the existing summary. [#16637](prometheus/prometheus#16637)
- \[ENHANCEMENT] Remote-write: Add custom scope support for AzureAD authentication. [#17483](prometheus/prometheus#17483)
- \[ENHANCEMENT] SD: add a `config` label with job name for most `prometheus_sd_refresh` metrics. [#17138](prometheus/prometheus#17138)
- \[ENHANCEMENT] TSDB: New histogram `prometheus_tsdb_sample_ooo_delta`, the distribution of out-of-order samples in seconds. Collected for all samples, accepted or not. [#17477](prometheus/prometheus#17477)
- \[ENHANCEMENT] Remote-read: Validate histograms received via remote-read. [#17561](prometheus/prometheus#17561)
- \[PERF] TSDB: Small optimizations to postings index. [#17439](prometheus/prometheus#17439)
- \[PERF] Scraping: Speed up relabelling of series. [#17530](prometheus/prometheus#17530)
- \[PERF] PromQL: Small optimisations in binary operators. [#17524](prometheus/prometheus#17524), [#17519](prometheus/prometheus#17519).
- \[BUGFIX] UI: PromQL autocomplete now shows the correct type and HELP text for OpenMetrics counters whose samples end in `_total`. [#17682](prometheus/prometheus#17682)
- \[BUGFIX] UI: Fixed codemirror-promql incorrectly showing label completion suggestions after the closing curly brace of a vector selector. [#17602](prometheus/prometheus#17602)
- \[BUGFIX] UI: Query editor no longer suggests a duration unit if one is already present after a number. [#17605](prometheus/prometheus#17605)
- \[BUGFIX] PromQL: Fix some "vector cannot contain metrics with the same labelset" errors when experimental delayed name removal is enabled. [#17678](prometheus/prometheus#17678)
- \[BUGFIX] PromQL: Fix possible corruption of PromQL text if the query had an empty `ignoring()` and non-empty grouping. [#17643](prometheus/prometheus#17643)
- \[BUGFIX] PromQL: Fix resets/changes to return empty results for anchored selectors when all samples are outside the range. [#17479](prometheus/prometheus#17479)
- \[BUGFIX] PromQL: Check more consistently for many-to-one matching in filter binary operators. [#17668](prometheus/prometheus#17668)
- \[BUGFIX] PromQL: Fix collision in unary negation with non-overlapping series. [#17708](prometheus/prometheus#17708)
- \[BUGFIX] PromQL: Fix collision in label\_join and label\_replace with non-overlapping series. [#17703](prometheus/prometheus#17703)
- \[BUGFIX] PromQL: Fix bug with inconsistent results for queries with OR expression when experimental delayed name removal is enabled. [#17161](prometheus/prometheus#17161)
- \[BUGFIX] PromQL: Ensure that `rate`/`increase`/`delta` of histograms results in a gauge histogram. [#17608](prometheus/prometheus#17608)
- \[BUGFIX] PromQL: Do not panic while iterating over invalid histograms. [#17559](prometheus/prometheus#17559)
- \[BUGFIX] TSDB: Reject chunk files whose encoded chunk length overflows int. [#17533](prometheus/prometheus#17533)
- \[BUGFIX] TSDB: Do not panic during resolution reduction of invalid histograms. [#17561](prometheus/prometheus#17561)
- \[BUGFIX] Remote-write Receive: Avoid duplicate labels when experimental type-and-unit-label feature is enabled. [#17546](prometheus/prometheus#17546)
- \[BUGFIX] OTLP Receiver: Only write metadata to disk when experimental metadata-wal-records feature is enabled. [#17472](prometheus/prometheus#17472)
renovate bot added a commit to sdwilsh/ansible-playbooks that referenced this pull request Jan 10, 2026
##### [\`v3.9.1\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.1)

- \[BUGFIX] Agent: fix crash shortly after startup from invalid type of object. [#17802](prometheus/prometheus#17802)
- \[BUGFIX] Scraping: fix relabel keep/drop not working. [#17807](prometheus/prometheus#17807)

---
##### [\`v3.9.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.0)

#### Note for users of Native Histograms

In version 3.9, Native Histograms is no longer experimental, and the feature flag `native-histogram` has no effect.  You must now turn on
the config setting `scrape_native_histograms` to collect Native Histogram samples from exporters.

#### Changelog

- \[CHANGE] Native Histograms are no longer experimental! Make the `native-histogram` feature flag a no-op. Use `scrape_native_histograms` config option instead. [#17528](prometheus/prometheus#17528)
- \[CHANGE] API: Add maximum limit of 10,000 sets of statistics to TSDB status endpoint. [#17647](prometheus/prometheus#17647)
- \[FEATURE] API: Add /api/v1/features for clients to understand which features are supported. [#17427](prometheus/prometheus#17427)
- \[FEATURE] Promtool: Add `start_timestamp` field for unit tests. [#17636](prometheus/prometheus#17636)
- \[FEATURE] Promtool: Add `--format seriesjson` option to `tsdb dump` to output just series labels in JSON format. [#13409](prometheus/prometheus#13409)
- \[FEATURE] Add `--storage.tsdb.delay-compact-file.path` flag for better interoperability with Thanos. [#17435](prometheus/prometheus#17435)
- \[FEATURE] UI: Add an option on the query drop-down menu to duplicate that query panel. [#17714](prometheus/prometheus#17714)
- \[ENHANCEMENT]: TSDB: add flag `--storage.tsdb.block-reload-interval` to configure TSDB Block Reload Interval. [#16728](prometheus/prometheus#16728)
- \[ENHANCEMENT] UI: Add graph option to start the chart's Y axis at zero. [#17565](prometheus/prometheus#17565)
- \[ENHANCEMENT] Scraping: Classic protobuf format no longer requires the unit in the metric name. [#16834](prometheus/prometheus#16834)
- \[ENHANCEMENT] PromQL, Rules, SD, Scraping: Add native histograms to complement existing summaries. [#17374](prometheus/prometheus#17374)
- \[ENHANCEMENT] Notifications: Add a histogram `prometheus_notifications_latency_histogram_seconds` to complement the existing summary. [#16637](prometheus/prometheus#16637)
- \[ENHANCEMENT] Remote-write: Add custom scope support for AzureAD authentication. [#17483](prometheus/prometheus#17483)
- \[ENHANCEMENT] SD: add a `config` label with job name for most `prometheus_sd_refresh` metrics. [#17138](prometheus/prometheus#17138)
- \[ENHANCEMENT] TSDB: New histogram `prometheus_tsdb_sample_ooo_delta`, the distribution of out-of-order samples in seconds. Collected for all samples, accepted or not. [#17477](prometheus/prometheus#17477)
- \[ENHANCEMENT] Remote-read: Validate histograms received via remote-read. [#17561](prometheus/prometheus#17561)
- \[PERF] TSDB: Small optimizations to postings index. [#17439](prometheus/prometheus#17439)
- \[PERF] Scraping: Speed up relabelling of series. [#17530](prometheus/prometheus#17530)
- \[PERF] PromQL: Small optimisations in binary operators. [#17524](prometheus/prometheus#17524), [#17519](prometheus/prometheus#17519).
- \[BUGFIX] UI: PromQL autocomplete now shows the correct type and HELP text for OpenMetrics counters whose samples end in `_total`. [#17682](prometheus/prometheus#17682)
- \[BUGFIX] UI: Fixed codemirror-promql incorrectly showing label completion suggestions after the closing curly brace of a vector selector. [#17602](prometheus/prometheus#17602)
- \[BUGFIX] UI: Query editor no longer suggests a duration unit if one is already present after a number. [#17605](prometheus/prometheus#17605)
- \[BUGFIX] PromQL: Fix some "vector cannot contain metrics with the same labelset" errors when experimental delayed name removal is enabled. [#17678](prometheus/prometheus#17678)
- \[BUGFIX] PromQL: Fix possible corruption of PromQL text if the query had an empty `ignoring()` and non-empty grouping. [#17643](prometheus/prometheus#17643)
- \[BUGFIX] PromQL: Fix resets/changes to return empty results for anchored selectors when all samples are outside the range. [#17479](prometheus/prometheus#17479)
- \[BUGFIX] PromQL: Check more consistently for many-to-one matching in filter binary operators. [#17668](prometheus/prometheus#17668)
- \[BUGFIX] PromQL: Fix collision in unary negation with non-overlapping series. [#17708](prometheus/prometheus#17708)
- \[BUGFIX] PromQL: Fix collision in label\_join and label\_replace with non-overlapping series. [#17703](prometheus/prometheus#17703)
- \[BUGFIX] PromQL: Fix bug with inconsistent results for queries with OR expression when experimental delayed name removal is enabled. [#17161](prometheus/prometheus#17161)
- \[BUGFIX] PromQL: Ensure that `rate`/`increase`/`delta` of histograms results in a gauge histogram. [#17608](prometheus/prometheus#17608)
- \[BUGFIX] PromQL: Do not panic while iterating over invalid histograms. [#17559](prometheus/prometheus#17559)
- \[BUGFIX] TSDB: Reject chunk files whose encoded chunk length overflows int. [#17533](prometheus/prometheus#17533)
- \[BUGFIX] TSDB: Do not panic during resolution reduction of invalid histograms. [#17561](prometheus/prometheus#17561)
- \[BUGFIX] Remote-write Receive: Avoid duplicate labels when experimental type-and-unit-label feature is enabled. [#17546](prometheus/prometheus#17546)
- \[BUGFIX] OTLP Receiver: Only write metadata to disk when experimental metadata-wal-records feature is enabled. [#17472](prometheus/prometheus#17472)
renovate bot added a commit to sdwilsh/ansible-playbooks that referenced this pull request Jan 10, 2026
##### [\`v3.9.1\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.1)

- \[BUGFIX] Agent: fix crash shortly after startup from invalid type of object. [#17802](prometheus/prometheus#17802)
- \[BUGFIX] Scraping: fix relabel keep/drop not working. [#17807](prometheus/prometheus#17807)

---
##### [\`v3.9.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.0)

#### Note for users of Native Histograms

In version 3.9, Native Histograms is no longer experimental, and the feature flag `native-histogram` has no effect.  You must now turn on
the config setting `scrape_native_histograms` to collect Native Histogram samples from exporters.

#### Changelog

- \[CHANGE] Native Histograms are no longer experimental! Make the `native-histogram` feature flag a no-op. Use `scrape_native_histograms` config option instead. [#17528](prometheus/prometheus#17528)
- \[CHANGE] API: Add maximum limit of 10,000 sets of statistics to TSDB status endpoint. [#17647](prometheus/prometheus#17647)
- \[FEATURE] API: Add /api/v1/features for clients to understand which features are supported. [#17427](prometheus/prometheus#17427)
- \[FEATURE] Promtool: Add `start_timestamp` field for unit tests. [#17636](prometheus/prometheus#17636)
- \[FEATURE] Promtool: Add `--format seriesjson` option to `tsdb dump` to output just series labels in JSON format. [#13409](prometheus/prometheus#13409)
- \[FEATURE] Add `--storage.tsdb.delay-compact-file.path` flag for better interoperability with Thanos. [#17435](prometheus/prometheus#17435)
- \[FEATURE] UI: Add an option on the query drop-down menu to duplicate that query panel. [#17714](prometheus/prometheus#17714)
- \[ENHANCEMENT]: TSDB: add flag `--storage.tsdb.block-reload-interval` to configure TSDB Block Reload Interval. [#16728](prometheus/prometheus#16728)
- \[ENHANCEMENT] UI: Add graph option to start the chart's Y axis at zero. [#17565](prometheus/prometheus#17565)
- \[ENHANCEMENT] Scraping: Classic protobuf format no longer requires the unit in the metric name. [#16834](prometheus/prometheus#16834)
- \[ENHANCEMENT] PromQL, Rules, SD, Scraping: Add native histograms to complement existing summaries. [#17374](prometheus/prometheus#17374)
- \[ENHANCEMENT] Notifications: Add a histogram `prometheus_notifications_latency_histogram_seconds` to complement the existing summary. [#16637](prometheus/prometheus#16637)
- \[ENHANCEMENT] Remote-write: Add custom scope support for AzureAD authentication. [#17483](prometheus/prometheus#17483)
- \[ENHANCEMENT] SD: add a `config` label with job name for most `prometheus_sd_refresh` metrics. [#17138](prometheus/prometheus#17138)
- \[ENHANCEMENT] TSDB: New histogram `prometheus_tsdb_sample_ooo_delta`, the distribution of out-of-order samples in seconds. Collected for all samples, accepted or not. [#17477](prometheus/prometheus#17477)
- \[ENHANCEMENT] Remote-read: Validate histograms received via remote-read. [#17561](prometheus/prometheus#17561)
- \[PERF] TSDB: Small optimizations to postings index. [#17439](prometheus/prometheus#17439)
- \[PERF] Scraping: Speed up relabelling of series. [#17530](prometheus/prometheus#17530)
- \[PERF] PromQL: Small optimisations in binary operators. [#17524](prometheus/prometheus#17524), [#17519](prometheus/prometheus#17519).
- \[BUGFIX] UI: PromQL autocomplete now shows the correct type and HELP text for OpenMetrics counters whose samples end in `_total`. [#17682](prometheus/prometheus#17682)
- \[BUGFIX] UI: Fixed codemirror-promql incorrectly showing label completion suggestions after the closing curly brace of a vector selector. [#17602](prometheus/prometheus#17602)
- \[BUGFIX] UI: Query editor no longer suggests a duration unit if one is already present after a number. [#17605](prometheus/prometheus#17605)
- \[BUGFIX] PromQL: Fix some "vector cannot contain metrics with the same labelset" errors when experimental delayed name removal is enabled. [#17678](prometheus/prometheus#17678)
- \[BUGFIX] PromQL: Fix possible corruption of PromQL text if the query had an empty `ignoring()` and non-empty grouping. [#17643](prometheus/prometheus#17643)
- \[BUGFIX] PromQL: Fix resets/changes to return empty results for anchored selectors when all samples are outside the range. [#17479](prometheus/prometheus#17479)
- \[BUGFIX] PromQL: Check more consistently for many-to-one matching in filter binary operators. [#17668](prometheus/prometheus#17668)
- \[BUGFIX] PromQL: Fix collision in unary negation with non-overlapping series. [#17708](prometheus/prometheus#17708)
- \[BUGFIX] PromQL: Fix collision in label\_join and label\_replace with non-overlapping series. [#17703](prometheus/prometheus#17703)
- \[BUGFIX] PromQL: Fix bug with inconsistent results for queries with OR expression when experimental delayed name removal is enabled. [#17161](prometheus/prometheus#17161)
- \[BUGFIX] PromQL: Ensure that `rate`/`increase`/`delta` of histograms results in a gauge histogram. [#17608](prometheus/prometheus#17608)
- \[BUGFIX] PromQL: Do not panic while iterating over invalid histograms. [#17559](prometheus/prometheus#17559)
- \[BUGFIX] TSDB: Reject chunk files whose encoded chunk length overflows int. [#17533](prometheus/prometheus#17533)
- \[BUGFIX] TSDB: Do not panic during resolution reduction of invalid histograms. [#17561](prometheus/prometheus#17561)
- \[BUGFIX] Remote-write Receive: Avoid duplicate labels when experimental type-and-unit-label feature is enabled. [#17546](prometheus/prometheus#17546)
- \[BUGFIX] OTLP Receiver: Only write metadata to disk when experimental metadata-wal-records feature is enabled. [#17472](prometheus/prometheus#17472)
wbollock added a commit to wbollock/prometheus that referenced this pull request Feb 6, 2026
Building off config-specific Prometheus refresh metrics from an earlier
PR (prometheus#17138), this deletes
refresh metrics like `prometheus_sd_refresh_duration_seconds` and
`prometheus_sd_refresh_failures_total` when the underlying scrape job
configuration is removed on reload. This reduces un-needed cardinality
from scrape job specific metrics while still preserving metrics that
indicate overall health of a service discovery engine.

For example,
`prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1`
will no longer be exported by Prometheus when the `linode-servers`
scrape job for the Linode service provider is removed. The generic,
service discovery specific `prometheus_sd_linode_failures_total` metric
will persist however.

Signed-off-by: Will Bollock <wbollock@linode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants