feat(metrics): add prometheus_sd_http_requests_total metric by wbollock · Pull Request #17069 · prometheus/prometheus

wbollock · 2025-08-21T13:26:47Z

When attempting to make an SLO for prometheus_sd_http_failures_total I noticed there wasn't an accompanying requests_total metric. This is very helpful to form SLOs so you can get the entire context of the failure rate with:

1 - (error / total)

Some service discovery providers already follow this pattern like prometheus_sd_dns_lookups_total, but others also only have failures and could be improved too. I can add in more missing metrics to service discovery engines if this pattern is okay.

Which issue(s) does the PR fix:

N/A

Does this PR introduce a user-facing change?

[FEATURE] Adding prometheus_sd_http_requests_total time series

When attempting to make an SLO for prometheus_sd_http_failures_total I noticed there wasn't an accompanying `requests_total` metric. This is very helpful to form SLOs so you can get the entire context of the failure rate with: `1 - (error / total)` Some service discovery providers already follow this pattern like `prometheus_sd_dns_lookups_total`, but others also only have failures and could be improved too. I can add in more if this pattern is okay. Signed-off-by: Will Bollock <wbollock@linode.com>

wbollock · 2025-08-26T13:41:59Z

cc: @bboreham, @machine424, @roidelapluie, @tjhop

I'm happy to audit all the service discovery metrics that only have errors and add a corresponding total metric if you want, too. might be easier for this PR to have a smaller scope though

bboreham · 2025-09-02T12:00:12Z

Hi, thanks for this.

Can you comment on whether prometheus_sd_refresh_duration_seconds_count would have the same value?

machine424 · 2025-09-02T15:39:47Z

Thanks for this.
I agree with Bryan, all the SD providers (except for k8s, zk and static IIRC) are wrapped by the refresh discoverer

prometheus/discovery/refresh/refresh.go

Line 45 in 11c4915

func NewDiscovery(opts Options) *Discovery {

which should provide a prometheus_sd_refresh_failures_total and a prometheus_sd_refresh_duration_seconds for each of them.

For the http SD, a refresh should be equivalent to a request to the http API. For other providers, that should not be the case. But, this should not prevent us from handling errors at a higher level, specifically at the refresh level.

Regarding prometheus_sd_http_failures_total, that was introduced in https://github.com/prometheus/prometheus/pull/10372/files which seems to track every failures in http.refresh(), I think that's a duplicate of prometheus_sd_refresh_failures_total (refresh was around at that time), we'll need to confirm that and get rid of it/deprecate it to avoid confusion. Or maybe I'm missing sth.
It seems we add such duplicates in other providers (I see we have the same failuresCount that track every error for azure e.g.) as well.

wbollock · 2025-09-02T16:03:11Z

I also agree with Bryan! 😂 I was able to make an SLO with prometheus_sd_refresh_failures_total and prometheus_sd_refresh_duration_seconds_count thanks to the mechanism label. This was just a UX/me issue and lack of consistency between provider metrics as far as I can tell.

I could open a new PR to try to remove prometheus_sd_http_requests_total in favor of the generic prometheus_sd_refresh* metrics instead?

edit: could also just be a docs update to point more people at prometheus_sd_refresh* metrics?

machine424 · 2025-09-03T13:47:23Z

I think we can do both; give more visibility in the docs to the prometheus_sd_refresh* metrics and also remove the per SD duplicates (we may need to clean some alerts and dashboards), especially if they catch exactly the same thing as the refresh metrics.

In the release notes we can guide users to move to refresh metrics.

Contributions are welcome, of course.

wbollock · 2025-09-03T14:03:27Z

Sounds good! I have the docs changes in this semi-related draft PR at least: #17138

Thanks all, apologies for the confusion with the old metrics

wbollock force-pushed the feat/add_sd_http_total_metric branch 2 times, most recently from 87d7b02 to 0361068 Compare August 21, 2025 13:35

wbollock force-pushed the feat/add_sd_http_total_metric branch from 04801ab to 5dc97b2 Compare August 21, 2025 13:46

wbollock closed this Sep 3, 2025

wbollock mentioned this pull request Sep 16, 2025

feat(metrics): add config label to refresh metrics #17138

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add prometheus_sd_http_requests_total metric#17069

feat(metrics): add prometheus_sd_http_requests_total metric#17069
wbollock wants to merge 1 commit intoprometheus:mainfrom
wbollock:feat/add_sd_http_total_metric

wbollock commented Aug 21, 2025

Uh oh!

wbollock commented Aug 26, 2025 •

edited

Loading

Uh oh!

bboreham commented Sep 2, 2025

Uh oh!

machine424 commented Sep 2, 2025 •

edited

Loading

Uh oh!

wbollock commented Sep 2, 2025 •

edited

Loading

Uh oh!

machine424 commented Sep 3, 2025

Uh oh!

wbollock commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wbollock commented Aug 21, 2025

Which issue(s) does the PR fix:

Does this PR introduce a user-facing change?

Uh oh!

wbollock commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bboreham commented Sep 2, 2025

Uh oh!

machine424 commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbollock commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

machine424 commented Sep 3, 2025

Uh oh!

wbollock commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wbollock commented Aug 26, 2025 •

edited

Loading

machine424 commented Sep 2, 2025 •

edited

Loading

wbollock commented Sep 2, 2025 •

edited

Loading