feat(metrics): add config label to refresh metrics by wbollock · Pull Request #17138 · prometheus/prometheus

wbollock · 2025-09-03T14:02:50Z

Adds a config label (similar to prometheus_sd_discovered_targets) to refresh metrics (prometheus_sd_refresh_duration_seconds and prometheus_sd_refresh_failures_total) to help identify the source of refresh issues or performance stats. In particular for HTTP SD, it can be common to have multiple disparate HTTP SD sources that should be identified and not lumped together. For example if one HTTP SD service has failures, that should be evident in its own time series separate from other HTTP SD sources.

The same arguments could be made for other service discovery providers. You may have jobs with entirely different settings - different API tokens or configurations that should be separated from each other. Or even testing scrape jobs that shouldn't have the same urgency of failure as production scrape jobs.

config seemed more appropriate than endpoint as a general standard for prometheus_sd metrics.

Docs were also updated for HTTP SD to point at the new refresh metrics rather than the older metrics.

⚠️ I did not update kuma, file_sd, kubernetes, zookeeper, or consul as they don't use the new refresh metrics entirely and require more work to integrate into those metrics.

add the config label for each service discovery endpoint
tests

[ENHANCEMENT] adds a `config` label indicating specific job names for most `prometheus_sd_refresh_duration_seconds` and `prometheus_sd_refresh_failures_total` metrics

🗒️ Note: this will be roughly around ~6 new time series per job. I believe the extra cardinality is worth it as the data is a lot more useful. We could also compromise by deleting the legacy per-service discovery metrics that are duplicated by prometheus_sd_refresh* metrics that would help a little bit.

Example:

prometheus_sd_refresh_duration_seconds{config="linode-nodes",mechanism="linode",quantile="0.5"} 0.199821177
prometheus_sd_refresh_duration_seconds{config="linode-nodes",mechanism="linode",quantile="0.9"} 0.199821177
prometheus_sd_refresh_duration_seconds{config="linode-nodes",mechanism="linode",quantile="0.99"} 0.199821177
prometheus_sd_refresh_duration_seconds_sum{config="linode-nodes",mechanism="linode"} 0.199821177
prometheus_sd_refresh_duration_seconds_count{config="linode-nodes",mechanism="linode"} 1

For this config:

<snip>
  - job_name: 'linode-nodes'
    linode_sd_configs:
      - authorization:
          credentials: "xxx"
        region: "us-east"

There's also maybe something to be improved here with unregistering scrape job metrics for jobs that no longer exist. These will still be retained after a scrape job is gone on a reload (edit: this is current behavior too unrelated to this PR):

net_conntrack_dialer_conn_attempted_total{dialer_name="aws-lightsail"} 0
net_conntrack_dialer_conn_closed_total{dialer_name="aws-lightsail"} 0
net_conntrack_dialer_conn_established_total{dialer_name="aws-lightsail"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="aws-lightsail",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="aws-lightsail",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="aws-lightsail",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="aws-lightsail",reason="unknown"} 0
prometheus_sd_discovered_targets{config="aws-lightsail",name="scrape"} 0
prometheus_sd_refresh_duration_seconds{config="aws-lightsail",mechanism="ec2",quantile="0.5"} 0.057529291
prometheus_sd_refresh_duration_seconds{config="aws-lightsail",mechanism="ec2",quantile="0.9"} 0.061284775
prometheus_sd_refresh_duration_seconds{config="aws-lightsail",mechanism="ec2",quantile="0.99"} 0.061284775
prometheus_sd_refresh_duration_seconds_sum{config="aws-lightsail",mechanism="ec2"} 0.118814066
prometheus_sd_refresh_duration_seconds_count{config="aws-lightsail",mechanism="ec2"} 2
prometheus_sd_refresh_failures_total{config="aws-lightsail",mechanism="ec2"} 2

wbollock · 2025-09-16T21:10:36Z

hi @bboreham @machine424, when you get a moment do you mind taking a look at this PR? tagging you both as you helped in this (closed) PR 😄 #17069

bboreham

Some initial thoughts.

discovery/moby/services_test.go

discovery/aws/ec2.go

discovery/metrics_refresh.go

discovery/refresh/refresh.go

discovery/refresh/refresh_test.go

discovery/discovery.go

Adds a `config` label (similar to `prometheus_sd_discovered_targets`) to refresh metrics to help identify the source of refresh issues or performance stats. In particular for HTTP SD, it can be common to have multiple disparate HTTP SD sources that should be identified and not lumped together. For example if one HTTP SD service has failures, that should be evident in its own time series seperate from other HTTP SD sources. `config` seemed more appropriate than `endpoint` as a general standard for `prometheus_sd` metrics. Docs were also updated for HTTP SD to point at the new refresh metrics rather than the older metrics. Signed-off-by: Will Bollock <wbollock@linode.com>

bboreham

Ok to proceed, thanks.

discovery/http/http_test.go

Fixes a problem introduced after the merge of this prometheus#17138 PR didn't take into account another merged PR! ``` discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery have (*EC2SDConfig, *slog.Logger, *ec2Metrics) want (*EC2SDConfig, discovery.DiscovererOptions) discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics) want (*LightsailSDConfig, discovery.DiscovererOptions) ```

ECS was a new service discovery tool added after this PR was merged: prometheus#17138 Aligns the style of passing a single "opts" to it like almost all the other service discovery engines now use

Fixes a problem introduced after the merge of this prometheus#17138 PR didn't take into account another merged PR! ``` discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery have (*EC2SDConfig, *slog.Logger, *ec2Metrics) want (*EC2SDConfig, discovery.DiscovererOptions) discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics) want (*LightsailSDConfig, discovery.DiscovererOptions) ``` Signed-off-by: Will Bollock <wbollock@linode.com>

ECS was a new service discovery tool added after this PR was merged: prometheus#17138 Aligns the style of passing a single "opts" to it like almost all the other service discovery engines now use Signed-off-by: Will Bollock <wbollock@linode.com>

Fixes a problem introduced after the merge of this prometheus#17138 PR didn't take into account another merged PR! ``` discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery have (*EC2SDConfig, *slog.Logger, *ec2Metrics) want (*EC2SDConfig, discovery.DiscovererOptions) discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics) want (*LightsailSDConfig, discovery.DiscovererOptions) ``` Signed-off-by: Will Bollock <wbollock@linode.com>

ECS was a new service discovery tool added after this PR was merged: prometheus#17138 Aligns the style of passing a single "opts" to it like almost all the other service discovery engines now use Signed-off-by: Will Bollock <wbollock@linode.com>

* fix: aws discovery test fix Fixes a problem introduced after the merge of this #17138 PR didn't take into account another merged PR! ``` discovery/aws/aws.go:218:54: too many arguments in call to NewEC2Discovery have (*EC2SDConfig, *slog.Logger, *ec2Metrics) want (*EC2SDConfig, discovery.DiscovererOptions) discovery/aws/aws.go:222:66: too many arguments in call to NewLightsailDiscovery have (*LightsailSDConfig, *slog.Logger, *lightsailMetrics) want (*LightsailSDConfig, discovery.DiscovererOptions) ``` Signed-off-by: Will Bollock <wbollock@linode.com> * fix: align ecs style ECS was a new service discovery tool added after this PR was merged: #17138 Aligns the style of passing a single "opts" to it like almost all the other service discovery engines now use Signed-off-by: Will Bollock <wbollock@linode.com> --------- Signed-off-by: Will Bollock <wbollock@linode.com>

Building off config-specific Prometheus refresh metrics from an earlier PR (prometheus#17138), this deletes refresh metrics like `prometheus_sd_refresh_duration_seconds` and `prometheus_sd_refresh_failures_total` when the underlying scrape job configuration is removed on reload. This reduces un-needed cardinality from scrape job specific metrics while still preserving metrics that indicate overall health of a service discovery engine. For example, `prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1` will no longer be exported by Prometheus when the `linode-servers` scrape job for the Linode service provider is removed. The generic, service discovery specific `prometheus_sd_linode_failures_total` metric will persist however.

Building off config-specific Prometheus refresh metrics from an earlier PR (prometheus#17138), this deletes refresh metrics like `prometheus_sd_refresh_duration_seconds` and `prometheus_sd_refresh_failures_total` when the underlying scrape job configuration is removed on reload. This reduces un-needed cardinality from scrape job specific metrics while still preserving metrics that indicate overall health of a service discovery engine. For example, `prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1` will no longer be exported by Prometheus when the `linode-servers` scrape job for the Linode service provider is removed. The generic, service discovery specific `prometheus_sd_linode_failures_total` metric will persist however. Signed-off-by: Will Bollock <wbollock@linode.com>

##### [\`v3.9.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.0) #### Note for users of Native Histograms In version 3.9, Native Histograms is no longer experimental, and the feature flag `native-histogram` has no effect. You must now turn on the config setting `scrape_native_histograms` to collect Native Histogram samples from exporters. #### Changelog - \[CHANGE] Native Histograms are no longer experimental! Make the `native-histogram` feature flag a no-op. Use `scrape_native_histograms` config option instead. [#17528](prometheus/prometheus#17528) - \[CHANGE] API: Add maximum limit of 10,000 sets of statistics to TSDB status endpoint. [#17647](prometheus/prometheus#17647) - \[FEATURE] API: Add /api/v1/features for clients to understand which features are supported. [#17427](prometheus/prometheus#17427) - \[FEATURE] Promtool: Add `start_timestamp` field for unit tests. [#17636](prometheus/prometheus#17636) - \[FEATURE] Promtool: Add `--format seriesjson` option to `tsdb dump` to output just series labels in JSON format. [#13409](prometheus/prometheus#13409) - \[FEATURE] Add `--storage.tsdb.delay-compact-file.path` flag for better interoperability with Thanos. [#17435](prometheus/prometheus#17435) - \[FEATURE] UI: Add an option on the query drop-down menu to duplicate that query panel. [#17714](prometheus/prometheus#17714) - \[ENHANCEMENT]: TSDB: add flag `--storage.tsdb.block-reload-interval` to configure TSDB Block Reload Interval. [#16728](prometheus/prometheus#16728) - \[ENHANCEMENT] UI: Add graph option to start the chart's Y axis at zero. [#17565](prometheus/prometheus#17565) - \[ENHANCEMENT] Scraping: Classic protobuf format no longer requires the unit in the metric name. [#16834](prometheus/prometheus#16834) - \[ENHANCEMENT] PromQL, Rules, SD, Scraping: Add native histograms to complement existing summaries. [#17374](prometheus/prometheus#17374) - \[ENHANCEMENT] Notifications: Add a histogram `prometheus_notifications_latency_histogram_seconds` to complement the existing summary. [#16637](prometheus/prometheus#16637) - \[ENHANCEMENT] Remote-write: Add custom scope support for AzureAD authentication. [#17483](prometheus/prometheus#17483) - \[ENHANCEMENT] SD: add a `config` label with job name for most `prometheus_sd_refresh` metrics. [#17138](prometheus/prometheus#17138) - \[ENHANCEMENT] TSDB: New histogram `prometheus_tsdb_sample_ooo_delta`, the distribution of out-of-order samples in seconds. Collected for all samples, accepted or not. [#17477](prometheus/prometheus#17477) - \[ENHANCEMENT] Remote-read: Validate histograms received via remote-read. [#17561](prometheus/prometheus#17561) - \[PERF] TSDB: Small optimizations to postings index. [#17439](prometheus/prometheus#17439) - \[PERF] Scraping: Speed up relabelling of series. [#17530](prometheus/prometheus#17530) - \[PERF] PromQL: Small optimisations in binary operators. [#17524](prometheus/prometheus#17524), [#17519](prometheus/prometheus#17519). - \[BUGFIX] UI: PromQL autocomplete now shows the correct type and HELP text for OpenMetrics counters whose samples end in `_total`. [#17682](prometheus/prometheus#17682) - \[BUGFIX] UI: Fixed codemirror-promql incorrectly showing label completion suggestions after the closing curly brace of a vector selector. [#17602](prometheus/prometheus#17602) - \[BUGFIX] UI: Query editor no longer suggests a duration unit if one is already present after a number. [#17605](prometheus/prometheus#17605) - \[BUGFIX] PromQL: Fix some "vector cannot contain metrics with the same labelset" errors when experimental delayed name removal is enabled. [#17678](prometheus/prometheus#17678) - \[BUGFIX] PromQL: Fix possible corruption of PromQL text if the query had an empty `ignoring()` and non-empty grouping. [#17643](prometheus/prometheus#17643) - \[BUGFIX] PromQL: Fix resets/changes to return empty results for anchored selectors when all samples are outside the range. [#17479](prometheus/prometheus#17479) - \[BUGFIX] PromQL: Check more consistently for many-to-one matching in filter binary operators. [#17668](prometheus/prometheus#17668) - \[BUGFIX] PromQL: Fix collision in unary negation with non-overlapping series. [#17708](prometheus/prometheus#17708) - \[BUGFIX] PromQL: Fix collision in label\_join and label\_replace with non-overlapping series. [#17703](prometheus/prometheus#17703) - \[BUGFIX] PromQL: Fix bug with inconsistent results for queries with OR expression when experimental delayed name removal is enabled. [#17161](prometheus/prometheus#17161) - \[BUGFIX] PromQL: Ensure that `rate`/`increase`/`delta` of histograms results in a gauge histogram. [#17608](prometheus/prometheus#17608) - \[BUGFIX] PromQL: Do not panic while iterating over invalid histograms. [#17559](prometheus/prometheus#17559) - \[BUGFIX] TSDB: Reject chunk files whose encoded chunk length overflows int. [#17533](prometheus/prometheus#17533) - \[BUGFIX] TSDB: Do not panic during resolution reduction of invalid histograms. [#17561](prometheus/prometheus#17561) - \[BUGFIX] Remote-write Receive: Avoid duplicate labels when experimental type-and-unit-label feature is enabled. [#17546](prometheus/prometheus#17546) - \[BUGFIX] OTLP Receiver: Only write metadata to disk when experimental metadata-wal-records feature is enabled. [#17472](prometheus/prometheus#17472)

##### [\`v3.9.1\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.1) - \[BUGFIX] Agent: fix crash shortly after startup from invalid type of object. [#17802](prometheus/prometheus#17802) - \[BUGFIX] Scraping: fix relabel keep/drop not working. [#17807](prometheus/prometheus#17807) --- ##### [\`v3.9.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.9.0) #### Note for users of Native Histograms In version 3.9, Native Histograms is no longer experimental, and the feature flag `native-histogram` has no effect. You must now turn on the config setting `scrape_native_histograms` to collect Native Histogram samples from exporters. #### Changelog - \[CHANGE] Native Histograms are no longer experimental! Make the `native-histogram` feature flag a no-op. Use `scrape_native_histograms` config option instead. [#17528](prometheus/prometheus#17528) - \[CHANGE] API: Add maximum limit of 10,000 sets of statistics to TSDB status endpoint. [#17647](prometheus/prometheus#17647) - \[FEATURE] API: Add /api/v1/features for clients to understand which features are supported. [#17427](prometheus/prometheus#17427) - \[FEATURE] Promtool: Add `start_timestamp` field for unit tests. [#17636](prometheus/prometheus#17636) - \[FEATURE] Promtool: Add `--format seriesjson` option to `tsdb dump` to output just series labels in JSON format. [#13409](prometheus/prometheus#13409) - \[FEATURE] Add `--storage.tsdb.delay-compact-file.path` flag for better interoperability with Thanos. [#17435](prometheus/prometheus#17435) - \[FEATURE] UI: Add an option on the query drop-down menu to duplicate that query panel. [#17714](prometheus/prometheus#17714) - \[ENHANCEMENT]: TSDB: add flag `--storage.tsdb.block-reload-interval` to configure TSDB Block Reload Interval. [#16728](prometheus/prometheus#16728) - \[ENHANCEMENT] UI: Add graph option to start the chart's Y axis at zero. [#17565](prometheus/prometheus#17565) - \[ENHANCEMENT] Scraping: Classic protobuf format no longer requires the unit in the metric name. [#16834](prometheus/prometheus#16834) - \[ENHANCEMENT] PromQL, Rules, SD, Scraping: Add native histograms to complement existing summaries. [#17374](prometheus/prometheus#17374) - \[ENHANCEMENT] Notifications: Add a histogram `prometheus_notifications_latency_histogram_seconds` to complement the existing summary. [#16637](prometheus/prometheus#16637) - \[ENHANCEMENT] Remote-write: Add custom scope support for AzureAD authentication. [#17483](prometheus/prometheus#17483) - \[ENHANCEMENT] SD: add a `config` label with job name for most `prometheus_sd_refresh` metrics. [#17138](prometheus/prometheus#17138) - \[ENHANCEMENT] TSDB: New histogram `prometheus_tsdb_sample_ooo_delta`, the distribution of out-of-order samples in seconds. Collected for all samples, accepted or not. [#17477](prometheus/prometheus#17477) - \[ENHANCEMENT] Remote-read: Validate histograms received via remote-read. [#17561](prometheus/prometheus#17561) - \[PERF] TSDB: Small optimizations to postings index. [#17439](prometheus/prometheus#17439) - \[PERF] Scraping: Speed up relabelling of series. [#17530](prometheus/prometheus#17530) - \[PERF] PromQL: Small optimisations in binary operators. [#17524](prometheus/prometheus#17524), [#17519](prometheus/prometheus#17519). - \[BUGFIX] UI: PromQL autocomplete now shows the correct type and HELP text for OpenMetrics counters whose samples end in `_total`. [#17682](prometheus/prometheus#17682) - \[BUGFIX] UI: Fixed codemirror-promql incorrectly showing label completion suggestions after the closing curly brace of a vector selector. [#17602](prometheus/prometheus#17602) - \[BUGFIX] UI: Query editor no longer suggests a duration unit if one is already present after a number. [#17605](prometheus/prometheus#17605) - \[BUGFIX] PromQL: Fix some "vector cannot contain metrics with the same labelset" errors when experimental delayed name removal is enabled. [#17678](prometheus/prometheus#17678) - \[BUGFIX] PromQL: Fix possible corruption of PromQL text if the query had an empty `ignoring()` and non-empty grouping. [#17643](prometheus/prometheus#17643) - \[BUGFIX] PromQL: Fix resets/changes to return empty results for anchored selectors when all samples are outside the range. [#17479](prometheus/prometheus#17479) - \[BUGFIX] PromQL: Check more consistently for many-to-one matching in filter binary operators. [#17668](prometheus/prometheus#17668) - \[BUGFIX] PromQL: Fix collision in unary negation with non-overlapping series. [#17708](prometheus/prometheus#17708) - \[BUGFIX] PromQL: Fix collision in label\_join and label\_replace with non-overlapping series. [#17703](prometheus/prometheus#17703) - \[BUGFIX] PromQL: Fix bug with inconsistent results for queries with OR expression when experimental delayed name removal is enabled. [#17161](prometheus/prometheus#17161) - \[BUGFIX] PromQL: Ensure that `rate`/`increase`/`delta` of histograms results in a gauge histogram. [#17608](prometheus/prometheus#17608) - \[BUGFIX] PromQL: Do not panic while iterating over invalid histograms. [#17559](prometheus/prometheus#17559) - \[BUGFIX] TSDB: Reject chunk files whose encoded chunk length overflows int. [#17533](prometheus/prometheus#17533) - \[BUGFIX] TSDB: Do not panic during resolution reduction of invalid histograms. [#17561](prometheus/prometheus#17561) - \[BUGFIX] Remote-write Receive: Avoid duplicate labels when experimental type-and-unit-label feature is enabled. [#17546](prometheus/prometheus#17546) - \[BUGFIX] OTLP Receiver: Only write metadata to disk when experimental metadata-wal-records feature is enabled. [#17472](prometheus/prometheus#17472)

Building off config-specific Prometheus refresh metrics from an earlier PR (prometheus#17138), this deletes refresh metrics like `prometheus_sd_refresh_duration_seconds` and `prometheus_sd_refresh_failures_total` when the underlying scrape job configuration is removed on reload. This reduces un-needed cardinality from scrape job specific metrics while still preserving metrics that indicate overall health of a service discovery engine. For example, `prometheus_sd_refresh_failures_total{config="linode-servers",mechanism="linode"} 1` will no longer be exported by Prometheus when the `linode-servers` scrape job for the Linode service provider is removed. The generic, service discovery specific `prometheus_sd_linode_failures_total` metric will persist however. Signed-off-by: Will Bollock <wbollock@linode.com>

wbollock mentioned this pull request Sep 3, 2025

feat(metrics): add prometheus_sd_http_requests_total metric #17069

Closed

wbollock force-pushed the feat/prometheus_refresh_config_label branch 3 times, most recently from 00d8bf0 to be2feea Compare September 5, 2025 16:08

wbollock marked this pull request as ready for review September 5, 2025 16:33

wbollock changed the title ~~[WIP] feat(metrics): add config label to refresh metrics~~ feat(metrics): add config label to refresh metrics Sep 5, 2025

bboreham reviewed Sep 30, 2025

View reviewed changes

wbollock force-pushed the feat/prometheus_refresh_config_label branch 3 times, most recently from 88b9df6 to 6c26d8e Compare October 1, 2025 21:34

wbollock requested a review from bboreham October 2, 2025 14:38

wbollock force-pushed the feat/prometheus_refresh_config_label branch 3 times, most recently from 3837ded to 16d5eb3 Compare October 8, 2025 16:38

wbollock force-pushed the feat/prometheus_refresh_config_label branch from 16d5eb3 to e894a22 Compare October 14, 2025 15:36

bboreham approved these changes Nov 11, 2025

View reviewed changes

discovery/http/http_test.go Show resolved Hide resolved

bboreham merged commit e02a65b into prometheus:main Nov 13, 2025
28 checks passed

wbollock mentioned this pull request Nov 13, 2025

fix(discovery): aws discovery test fix #17527

Merged

bboreham mentioned this pull request Nov 13, 2025

discovery: fix constructor arguments in aws discovery #17526

Merged

wbollock deleted the feat/prometheus_refresh_config_label branch November 17, 2025 17:26

wbollock mentioned this pull request Nov 25, 2025

fix(discovery): delete expired refresh metrics on reload #17614

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add config label to refresh metrics#17138

feat(metrics): add config label to refresh metrics#17138
bboreham merged 1 commit intoprometheus:mainfrom
wbollock:feat/prometheus_refresh_config_label

wbollock commented Sep 3, 2025 •

edited

Loading

Uh oh!

wbollock commented Sep 16, 2025

Uh oh!

bboreham left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bboreham left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wbollock commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbollock commented Sep 16, 2025

Uh oh!

bboreham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bboreham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wbollock commented Sep 3, 2025 •

edited

Loading