Skip to content

feat(metrics): add prometheus_sd_http_requests_total metric#17069

Closed
wbollock wants to merge 1 commit intoprometheus:mainfrom
wbollock:feat/add_sd_http_total_metric
Closed

feat(metrics): add prometheus_sd_http_requests_total metric#17069
wbollock wants to merge 1 commit intoprometheus:mainfrom
wbollock:feat/add_sd_http_total_metric

Conversation

@wbollock
Copy link
Contributor

When attempting to make an SLO for prometheus_sd_http_failures_total I noticed there wasn't an accompanying requests_total metric. This is very helpful to form SLOs so you can get the entire context of the failure rate with:

1 - (error / total)

Some service discovery providers already follow this pattern like prometheus_sd_dns_lookups_total, but others also only have failures and could be improved too. I can add in more missing metrics to service discovery engines if this pattern is okay.

Which issue(s) does the PR fix:

N/A

Does this PR introduce a user-facing change?

[FEATURE] Adding prometheus_sd_http_requests_total time series

@wbollock wbollock force-pushed the feat/add_sd_http_total_metric branch 2 times, most recently from 87d7b02 to 0361068 Compare August 21, 2025 13:35
When attempting to make an SLO for prometheus_sd_http_failures_total I
noticed there wasn't an accompanying `requests_total` metric. This is
very helpful to form SLOs so you can get the entire context of the
failure rate with:

`1 - (error / total)`

Some service discovery providers already follow this pattern like
`prometheus_sd_dns_lookups_total`, but others also only have failures
and could be improved too. I can add in more if this pattern is okay.

Signed-off-by: Will Bollock <wbollock@linode.com>
@wbollock wbollock force-pushed the feat/add_sd_http_total_metric branch from 04801ab to 5dc97b2 Compare August 21, 2025 13:46
@wbollock
Copy link
Contributor Author

wbollock commented Aug 26, 2025

cc: @bboreham, @machine424, @roidelapluie, @tjhop

I'm happy to audit all the service discovery metrics that only have errors and add a corresponding total metric if you want, too. might be easier for this PR to have a smaller scope though

@bboreham
Copy link
Member

bboreham commented Sep 2, 2025

Hi, thanks for this.

Can you comment on whether prometheus_sd_refresh_duration_seconds_count would have the same value?

@machine424
Copy link
Member

machine424 commented Sep 2, 2025

Thanks for this.
I agree with Bryan, all the SD providers (except for k8s, zk and static IIRC) are wrapped by the refresh discoverer

func NewDiscovery(opts Options) *Discovery {
which should provide a prometheus_sd_refresh_failures_total and a prometheus_sd_refresh_duration_seconds for each of them.

For the http SD, a refresh should be equivalent to a request to the http API. For other providers, that should not be the case. But, this should not prevent us from handling errors at a higher level, specifically at the refresh level.

Regarding prometheus_sd_http_failures_total, that was introduced in https://github.com/prometheus/prometheus/pull/10372/files which seems to track every failures in http.refresh(), I think that's a duplicate of prometheus_sd_refresh_failures_total (refresh was around at that time), we'll need to confirm that and get rid of it/deprecate it to avoid confusion. Or maybe I'm missing sth.
It seems we add such duplicates in other providers (I see we have the same failuresCount that track every error for azure e.g.) as well.

@wbollock
Copy link
Contributor Author

wbollock commented Sep 2, 2025

I also agree with Bryan! 😂 I was able to make an SLO with prometheus_sd_refresh_failures_total and prometheus_sd_refresh_duration_seconds_count thanks to the mechanism label. This was just a UX/me issue and lack of consistency between provider metrics as far as I can tell.

I could open a new PR to try to remove prometheus_sd_http_requests_total in favor of the generic prometheus_sd_refresh* metrics instead?

edit: could also just be a docs update to point more people at prometheus_sd_refresh* metrics?

@machine424
Copy link
Member

I think we can do both; give more visibility in the docs to the prometheus_sd_refresh* metrics and also remove the per SD duplicates (we may need to clean some alerts and dashboards), especially if they catch exactly the same thing as the refresh metrics.

In the release notes we can guide users to move to refresh metrics.

Contributions are welcome, of course.

@wbollock
Copy link
Contributor Author

wbollock commented Sep 3, 2025

Sounds good! I have the docs changes in this semi-related draft PR at least: #17138

Thanks all, apologies for the confusion with the old metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants