feat(metrics): add prometheus_sd_http_requests_total metric#17069
feat(metrics): add prometheus_sd_http_requests_total metric#17069wbollock wants to merge 1 commit intoprometheus:mainfrom
Conversation
87d7b02 to
0361068
Compare
When attempting to make an SLO for prometheus_sd_http_failures_total I noticed there wasn't an accompanying `requests_total` metric. This is very helpful to form SLOs so you can get the entire context of the failure rate with: `1 - (error / total)` Some service discovery providers already follow this pattern like `prometheus_sd_dns_lookups_total`, but others also only have failures and could be improved too. I can add in more if this pattern is okay. Signed-off-by: Will Bollock <wbollock@linode.com>
04801ab to
5dc97b2
Compare
|
cc: @bboreham, @machine424, @roidelapluie, @tjhop I'm happy to audit all the service discovery metrics that only have errors and add a corresponding total metric if you want, too. might be easier for this PR to have a smaller scope though |
|
Hi, thanks for this. Can you comment on whether |
|
Thanks for this. prometheus/discovery/refresh/refresh.go Line 45 in 11c4915 prometheus_sd_refresh_failures_total and a prometheus_sd_refresh_duration_seconds for each of them.
For the http SD, a refresh should be equivalent to a request to the http API. For other providers, that should not be the case. But, this should not prevent us from handling errors at a higher level, specifically at the refresh level. Regarding prometheus_sd_http_failures_total, that was introduced in https://github.com/prometheus/prometheus/pull/10372/files which seems to track every failures in |
|
I also agree with Bryan! 😂 I was able to make an SLO with I could open a new PR to try to remove edit: could also just be a docs update to point more people at |
|
I think we can do both; give more visibility in the docs to the In the release notes we can guide users to move to refresh metrics. Contributions are welcome, of course. |
|
Sounds good! I have the docs changes in this semi-related draft PR at least: #17138 Thanks all, apologies for the confusion with the old metrics |
When attempting to make an SLO for prometheus_sd_http_failures_total I noticed there wasn't an accompanying
requests_totalmetric. This is very helpful to form SLOs so you can get the entire context of the failure rate with:1 - (error / total)Some service discovery providers already follow this pattern like
prometheus_sd_dns_lookups_total, but others also only have failures and could be improved too. I can add in more missing metrics to service discovery engines if this pattern is okay.Which issue(s) does the PR fix:
N/A
Does this PR introduce a user-facing change?