scraping/alerting: Fix issues with duplicate metric registration#13316
scraping/alerting: Fix issues with duplicate metric registration#13316ptodev wants to merge 1 commit intoprometheus:mainfrom
Conversation
| } | ||
| } | ||
| typ := cfg.Name() | ||
| reg := prometheus.WrapRegistererWith(prometheus.Labels{"sd_type": m.name, "job_name": setName}, m.registerer) |
There was a problem hiding this comment.
I'm not sure if those label names are the most appropriate?
Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
d87419c to
7660e75
Compare
|
Thanks for catching this. The most appropriate label name would probably be just However, I'm not sure if it is desired to partition all the SD metrics by job. Maybe it's even an improvement, but maybe it's not useful at all. I would like to hear more opinions about this. @roidelapluie @juliusv (and anyone that feels qualified) WDYT? |
|
This is not an acceptable solution as it will create a lot of cardinality. |
|
The goal of #13023 was to allow users who import Prometheus SD as a library, to pass their own registry. If Prometheus is to do this, while still only having one metric series for every SD instance, then I am not sure how to do this cleanly. Do you want to revert #13023 while we discuss a solution? Or would you rather us find a solution straight away? One solution could be to have an |
|
@roidelapluie thanks for your response. @ptodev I guess that clarifies things: If cardinality increase is a problem, that's even on top of the problem of changing the labels of long established metrics. I would say we should try to not have separate metrics per scrape job, or in other words: Whatever changes we do in the code behind the scenes, the metrics shouldn't change. So far, we acted under the assumption that there will be one SD instance per kind of SD, not per job, but that's wrong, see above. So the conclusion is we truly have to share metrics between SD instances. And for sharing, we either have to manage them outside of the SD instance (as you have described in your last comment), or we go back and reconsider using the A problem with the latter approach is that you cannot really deregister metrics anymore (but de-registration is probably a problem with shared metrics anyway, and maybe we should just not de-register at all). The former approach might be a bit cleaner in general, but then we are back to managing the whole lifecycle of the metrics outside of the SD instances, which is particularly annoying if you use the SD code as a library and have to remember to always initialize the correct set of metrics for each SD type. |
|
Thank you for the feedback! Let's move the discussion to #13375, which is a different attempt at fixing this issue. |
This PR will make it so that each service discovery mechanism reports its own debug metrics series. Currently, on the latest stable Prometheus release, if a Prometheus process runs more than one instance of the same SD mechanism, then it's not possible to distinguish between metrics for different scrape instances.
Fixes #13312
Unfortunately, #13023 introduced a bug whereby if two scrape jobs use the same service discovery, Prometheus would crash.
When I was working on #13023 I was working under the assumption that Prometheus can't instantiate more than one instance of the same SD. Therefore I unfortunately missed the most obvious test case - the one where there is more then one scrape job which uses a particular SD. I also didn't test AlertManager SDs. This led to the bug described in #13312.
I tested this bugfix locally using the config file below.
Prometheus config
I tested both by using
--enable-feature=new-service-discovery-managerand without.I also forced several config reloads using the
pkill -SIGHUP -i prometheuscommand, to make sure Prometheus doesn't crash and that no errors are logged.After the change I see metrics such as these on
http://localhost:9090/metrics:cc @bwplotka @beorn7