Prefer to not use the global registry#12565
Conversation
|
Hi @roidelapluie and @LeviHarrison 👋 Since you are the default maintainers, would you mind kindly taking a look at the PR please? |
19aee4d to
301ee8e
Compare
On the face of it, no. There are existing tests checking metrics.
I didn't understand this question.
That sounds OK to me. |
I guess this can only happen if the Prometheus code is used as a library (because Prometheus itself would never use multiple instances of the same SD). In the former case, I think panic'ing is even better than silently sharing the same metrics. (So it's a "breaking change" only for those that use Prometheus code as a library? Would be good indeed if you could clarify what you mean by "breaking change" here.)
I think it's even better than before. Seeing tons of metrics for all the SD methods that I'm never going to use is annoying. Note that we explicitly allow changes in the exported metrics without bumping the major version. See https://prometheus.io/docs/prometheus/latest/stability/#api-stability-guarantees (“Things considered unstable […]: The metrics in the /metrics endpoint of Prometheus itself”). I think this is an incremental improvement, and the same idea could be pushed further incrementally until we don't use the global registry anymore directly. @roidelapluie we didn't really get to this during the bug scrub. If you have any objections, please make them known. Otherwise, I think @ptodev should finalize this PR. The aforementioned further increments can be done later at will. |
75459c4 to
8e2466d
Compare
This clarification is correct - it is only a "breaking change" for those that use Prometheus code as a library. For those folks, it won't be possible to use multiple instances of the same SD component with the global registry anymore, because the SD components no longer register their metrics using an |
beorn7
left a comment
There was a problem hiding this comment.
I like it. I have already done a detailed review and found only some stylistic nits.
@roidelapluie could we get your general opinion on this? I think this is in line with what we general want to (avoid the global registry, for one because that is generally better in complex code bases, but also to make life easier for those using our packages as libraries).
c3d2477 to
66c6c68
Compare
There was a problem hiding this comment.
Thank you for the thorough review, @beorn7 ! I pushed a commit to address your comments. I also pushed a few other changes:
- An update to
discovery/README.md, which mentions theDiscovererOptionsinterface. - Passing
prometheus.DefaultRegistereras a a registerer inDiscovererOptionsin a few other places:discovery/legacymanager/manager.godiscovery/manager.go
The main things that worried me about this PR are:
- Situations where a SD might get recreated, in which case the metrics will get registered again to the same reistry, causing a failure.
- Places where we are not setting a registerer in
DiscovererOptions. In that case the registerer would be nil, and the metrics registration would fail.
To test those concerns I also ported the file_sd mechanism and used it to test locally, with these configs:
Prometheus config
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
file_sd_configs:
- files:
- "/Users/paulintodev/Desktop/targets.json"File SD config
[
{
"targets": [ "localhost:9090" ],
"labels": {
"paulin_prom": "paulin_label_value"
}
}
]Prometheus ran fine, with no errors in the logs. The metrics for consul, azure, dns SDs were missing as expected. The scrape metrics and the file_sd ones were getting populated. I ran it for a few minutes and I don't have a reason to think something is going wrong.
I didn't test promtool.
If that ever happens legitimately, then we needed to come up with a way to un-register. Or use this pattern (but is a bit frowned upon). For now, I would expect that this case doesn't happen.
A pattern used at some places is to allow a |
|
And BTW, I chatted with @roidelapluie . Generally, he likes what you are doing here. So let's brush up final things here (like the |
with a non-default registerer. Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
with a non-default registerer. Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
for its metrics. Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
66c6c68 to
ffcdbd0
Compare
and unregistration for SD components. The metrics for the Discovery Manager are no longer global and are now registered when the Manager is created, instead of in an init(). The Prometheus Registry which the main function uses is now defined in only one place.
Unfortunately I just realised there is a situation where we could indeed register the same metric twice. It would happen during a config reload. I see this error when trying a reload locally: I think a nice way to solve this is to have the discovery.Manager manage metric registration and unregistration. I pushed a commit which implements this for Consul SD. Would you mind taking a look and letting me know if this would be ok please? The bad thing about unregistration is that metrics will be cleared each time there is a config reload (e.g. counters would start from 0). However, I think this is the expected behaviour - there are other places in the code which do delete series with certain labels when config is reloaded (e.g. the stop() function for the scrape pool).
The first time I worked on this PR, I did actually implement it this way - if a nil registerer was passed to a SD, then it was ignored with no error. However, later I changed it to always requiring a registerer to be present. I like this approach more even in tests, because it helped me catch certain errors. Would you mind if we continue treating a nil registerer as an error please? I just realised the current implementation will probably panic if a nil Registerer is passed to an SD. I could change it to returning an error instead of panicking? |
discovery/discovery.go
Outdated
| // A slice of prometheus collectors which need to be registered and unregistered: | ||
| // * Registration should happen prior to calling Run(). | ||
| // * Unregistration should happen after Run() returns. | ||
| GetCollectors() []prometheus.Collector |
There was a problem hiding this comment.
@beorn7 Do you think this new function is a good approach? If it looks good to you, I could implement it for all SD mechanisms.
If the manager handles registration, then it doesn't even need to pass a registerer to the SD. I personally think it would help reduce boilerplate and unintentional errors.
Another approach would be to have each SD manage its own registration and unregistration inside the Run() function, but I'm afraid this would be too manual and lead to errors.
There was a problem hiding this comment.
This would be an entirely new pattern (AFAIK). It feels weird to me that a Discoverer has to "hand out" internals, like the prometheus.Collector instances it owns, to the caller for some house keeping.
I would rather complete the Discoverer life cycle management by having it register all the Collectors on its own (as initially done in this PR) and then, before returning from the Run method, unregister them all again. In this way, we would also get rid of all the metrics related to an SD if that SD gets completely removed by a config change (rather than just restarted).
(Disclaimer: I'm not very familiar with the SD code. My idea above might not work.)
There was a problem hiding this comment.
This would be an entirely new pattern (AFAIK). It feels weird to me that a
Discovererhas to "hand out" internals, like theprometheus.Collectorinstances it owns, to the caller for some house keeping.
I agree, it does look weird.
I would rather complete the
Discovererlife cycle management by having it register all the Collectors on its own (as initially done in this PR) and then, before returning from theRunmethod, unregister them all again. In this way, we would also get rid of all the metrics related to an SD if that SD gets completely removed by a config change (rather than just restarted).
I'd be happy to let the SD manage their own registration, but would you mind if we do it a bit differently than before? Instead of registering the metrics as part of the NewDiscoverer(), call, it would be better to pass a registerer via the Run() function? This way we wouldn't "leak" a metric in case NewDiscoverer is called, but Run never gets called.
There was a problem hiding this comment.
This sounds good to me (given my limited insight ;) ). I'd say go ahead with that idea and see what breaks and if any of the devs with more SD experience intervene.
There was a problem hiding this comment.
So yes, I think the Registerer should be passed with the Run method, so that you don't need a separate Destroy method (as I have seen in the latest commit, but maybe that was added before you wrote above comment).
Let me know when I should commence reviewing.
There was a problem hiding this comment.
Apologies, I meant to write a GitHub comment last week about the Destroy method, but didn't get around to it. I think we have a choice between three approaches:
-
Register metrics when creating a new SD and unregister them by calling Destroy from the Manager:
- This is what I did in the latest commit - e30ce0f12758
- If a metric registration fails, it will fail during the creation of the SD, which to me seems more correct than failing during Run(). Metric registration feels like something which should be part of SD creation (the
NewDiscoverer()method). Otherwise it would be possible to "create" multiple of the same SDs and only realise that they fail to run after calling Run on the ones after the 1st (since all but the 1st will fail to register their metric) - This means the SD Manager can unregister metrics by calling Destroy() even if the Run() function was never ran.
- A downside to this approach is that it's easy for the Manager to miss calling a Destroy() in some situations. E.g. what if the creation of an SD failed, but metrics were still registered? Then Manager would still have to call Destroy().
-
Register metrics during Run() and unregistering them during Run()
- The nice thing is that SD Manager doesn't have to call Destroy()
- The bad thing is we only find out the registration failed after calling Run(). It feels a bit weird to me that the creation of an SD was successful, only to fail due to metric registration.
-
Register metrics during creation of the SD, and unregister them during Run()
- Again, the nice thing is Manager doesn't have to call a Destroy().
- Run() should not be used for cleaning up resources required outside of Run() IMO. So I'd really prefer not to go with this approach.
Approach 1 seems the cleanest to me, but I'm happy to implement any approach that you think is reasonable. The big question is - which function do you think should fail if metric registration fails - NewDiscoverer() or Run()?
BTW I am planning to close this PR and instead will open separate PRs for different parts of the codebase - one PR for the scrape system, and another for the SDs. But it would be nice if we agree on an approach for the SDs prior to the new SD PR.
There was a problem hiding this comment.
I like (2) best. This is under the assumption that calling the New... function usually happens in close connection with calling the Run method, so that it actually doesn't make a big difference which method fails due to metrics collisions.
And even if calling Run happens much later after calling New..., there is the question what you will do if it fails. I assume in most cases you are probably going to exit the program execution anyway, which again means it doesn't make a big difference where the error is returned.
|
All good points. First the simple answers:
I'll comment on the double registration problem as a follow-up to the separate comment above. |
|
@beorn7, thank you so much for the help with this PR! It is a very big one, so I am going to close it and will instead open two PRs in its place - one for the scrape package, and another for the SD. The PR for the scrape package is open for review - please feel free to take a look at #12958 I hope that by having the changes in two PRs they will be easier to review and easier to test. |
Projects which import Prometheus sometimes need to use their own registerers. This PR is an example implementation which would allow that to happen.
What I have not yet done:
A few unknowns:
Closes #12549