Skip to content

Prefer to not use the global registry#12565

Closed
ptodev wants to merge 6 commits intoprometheus:mainfrom
ptodev:prefer-to-not-register-metrics-globally
Closed

Prefer to not use the global registry#12565
ptodev wants to merge 6 commits intoprometheus:mainfrom
ptodev:prefer-to-not-register-metrics-globally

Conversation

@ptodev
Copy link
Copy Markdown
Contributor

@ptodev ptodev commented Jul 14, 2023

Projects which import Prometheus sometimes need to use their own registerers. This PR is an example implementation which would allow that to happen.

What I have not yet done:

  • I haven't tested this locally.
  • I haven't implemented this for all service discovery components. But if you confirm that the approach makes sense, I could do it for the rest of them too.

A few unknowns:

  • What is a good way of testing this change?
  • Do I need to add any unit tests?
  • Is it ok that SD components will panic if more than one tries to use the default registerer? This is a breaking change.
  • Is it ok that some metrics won't be visible now if components aren't setup in the config? E.g. if a SD component is not used, there won't be metrics for it. This is a breaking change.

Closes #12549

@ptodev ptodev marked this pull request as draft July 14, 2023 15:35
@ptodev ptodev changed the title [WIP] Prefer to not use the global registerer [WIP] Prefer to not use the global registry Jul 14, 2023
@ptodev ptodev marked this pull request as ready for review July 14, 2023 15:40
@ptodev ptodev marked this pull request as draft July 14, 2023 15:41
@ptodev
Copy link
Copy Markdown
Contributor Author

ptodev commented Jul 14, 2023

Hi @roidelapluie and @LeviHarrison 👋 Since you are the default maintainers, would you mind kindly taking a look at the PR please?

@ptodev ptodev force-pushed the prefer-to-not-register-metrics-globally branch from 19aee4d to 301ee8e Compare August 2, 2023 13:43
@bboreham
Copy link
Copy Markdown
Member

bboreham commented Aug 2, 2023

Do I need to add any unit tests?

On the face of it, no. There are existing tests checking metrics.

Is it ok that SD components will panic if more than one tries to use the default registerer? This is a breaking change.

I didn't understand this question.

Is it ok that some metrics won't be visible now if components aren't setup in the config? E.g. if a SD component is not used, there won't be metrics for it. This is a breaking change.

That sounds OK to me.

@beorn7
Copy link
Copy Markdown
Member

beorn7 commented Aug 8, 2023

Is it ok that SD components will panic if more than one tries to use the default registerer? This is a breaking change.

I didn't understand this question.

I guess this can only happen if the Prometheus code is used as a library (because Prometheus itself would never use multiple instances of the same SD). In the former case, I think panic'ing is even better than silently sharing the same metrics. (So it's a "breaking change" only for those that use Prometheus code as a library? Would be good indeed if you could clarify what you mean by "breaking change" here.)

Is it ok that some metrics won't be visible now if components aren't setup in the config? E.g. if a SD component is not used, there won't be metrics for it. This is a breaking change.

That sounds OK to me.

I think it's even better than before. Seeing tons of metrics for all the SD methods that I'm never going to use is annoying. Note that we explicitly allow changes in the exported metrics without bumping the major version. See https://prometheus.io/docs/prometheus/latest/stability/#api-stability-guarantees (“Things considered unstable […]: The metrics in the /metrics endpoint of Prometheus itself”).

I think this is an incremental improvement, and the same idea could be pushed further incrementally until we don't use the global registry anymore directly.

@roidelapluie we didn't really get to this during the bug scrub. If you have any objections, please make them known. Otherwise, I think @ptodev should finalize this PR. The aforementioned further increments can be done later at will.

@ptodev ptodev force-pushed the prefer-to-not-register-metrics-globally branch 5 times, most recently from 75459c4 to 8e2466d Compare August 22, 2023 08:45
@ptodev ptodev marked this pull request as ready for review August 23, 2023 08:41
@ptodev ptodev requested a review from dgl as a code owner August 23, 2023 08:41
@ptodev
Copy link
Copy Markdown
Contributor Author

ptodev commented Aug 23, 2023

Is it ok that SD components will panic if more than one tries to use the default registerer? This is a breaking change.

I didn't understand this question.

I guess this can only happen if the Prometheus code is used as a library (because Prometheus itself would never use multiple instances of the same SD). In the former case, I think panic'ing is even better than silently sharing the same metrics. (So it's a "breaking change" only for those that use Prometheus code as a library? Would be good indeed if you could clarify what you mean by "breaking change" here.)

This clarification is correct - it is only a "breaking change" for those that use Prometheus code as a library. For those folks, it won't be possible to use multiple instances of the same SD component with the global registry anymore, because the SD components no longer register their metrics using an init() function. As far as I can tell there is no advantage to using global metrics and an init() function, but I want to highlight this here since it's technically a breaking change.

@ptodev
Copy link
Copy Markdown
Contributor Author

ptodev commented Aug 23, 2023

Thank you for the feedback, @bboreham and @beorn7 ! Since there are no objections to the idea of moving away from the global registry, I polished the PR and it is now ready for review.

@beorn7 beorn7 changed the title [WIP] Prefer to not use the global registry Prefer to not use the global registry Aug 23, 2023
Copy link
Copy Markdown
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. I have already done a detailed review and found only some stylistic nits.

@roidelapluie could we get your general opinion on this? I think this is in line with what we general want to (avoid the global registry, for one because that is generally better in complex code bases, but also to make life easier for those using our packages as libraries).

@ptodev ptodev force-pushed the prefer-to-not-register-metrics-globally branch 3 times, most recently from c3d2477 to 66c6c68 Compare September 1, 2023 14:11
Copy link
Copy Markdown
Contributor Author

@ptodev ptodev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough review, @beorn7 ! I pushed a commit to address your comments. I also pushed a few other changes:

  • An update to discovery/README.md, which mentions the DiscovererOptions interface.
  • Passing prometheus.DefaultRegisterer as a a registerer in DiscovererOptions in a few other places:
    • discovery/legacymanager/manager.go
    • discovery/manager.go

The main things that worried me about this PR are:

  • Situations where a SD might get recreated, in which case the metrics will get registered again to the same reistry, causing a failure.
  • Places where we are not setting a registerer in DiscovererOptions. In that case the registerer would be nil, and the metrics registration would fail.

To test those concerns I also ported the file_sd mechanism and used it to test locally, with these configs:

Prometheus config
global:
  scrape_interval:     15s 

scrape_configs:
  - job_name: 'prometheus'

    file_sd_configs:
      - files: 
          - "/Users/paulintodev/Desktop/targets.json"
File SD config
[
  {
    "targets": [ "localhost:9090" ],
    "labels": {
      "paulin_prom": "paulin_label_value"
    }
  }
]

Prometheus ran fine, with no errors in the logs. The metrics for consul, azure, dns SDs were missing as expected. The scrape metrics and the file_sd ones were getting populated. I ran it for a few minutes and I don't have a reason to think something is going wrong.

I didn't test promtool.

@beorn7
Copy link
Copy Markdown
Member

beorn7 commented Sep 5, 2023

Situations where a SD might get recreated, in which case the metrics will get registered again to the same reistry, causing a failure.

If that ever happens legitimately, then we needed to come up with a way to un-register. Or use this pattern (but is a bit frowned upon). For now, I would expect that this case doesn't happen.

Places where we are not setting a registerer in DiscovererOptions. In that case the registerer would be nil, and the metrics registration would fail.

A pattern used at some places is to allow a nil registerer and treat it as "I don't want any metrics to get registered". I.e. you simply have to guard the Register call by a if reg != nil { ... }. That's actually quite convenient if you want to use the library in a test or something, where you aren't interested in the metrics, and instead of passing in a dummy registry, you can just say nil.

@beorn7
Copy link
Copy Markdown
Member

beorn7 commented Sep 5, 2023

And BTW, I chatted with @roidelapluie . Generally, he likes what you are doing here. So let's brush up final things here (like the nil registerer case discussed above), and then I can do the final code review.

with a non-default registerer.

Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
with a non-default registerer.

Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
for its metrics.

Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
@ptodev ptodev force-pushed the prefer-to-not-register-metrics-globally branch from 66c6c68 to ffcdbd0 Compare September 20, 2023 10:20
and unregistration for SD components.

The metrics for the Discovery Manager are no longer global and are
now registered when the Manager is created, instead of in an init().

The Prometheus Registry which the main function uses is now defined
in only one place.
@ptodev
Copy link
Copy Markdown
Contributor Author

ptodev commented Sep 21, 2023

Situations where a SD might get recreated, in which case the metrics will get registered again to the same reistry, causing a failure.

If that ever happens legitimately, then we needed to come up with a way to un-register. Or use this pattern (but is a bit frowned upon). For now, I would expect that this case doesn't happen.

Unfortunately I just realised there is a situation where we could indeed register the same metric twice. It would happen during a config reload. I see this error when trying a reload locally:

ts=2023-09-20T11:15:41.195Z caller=manager.go:319 level=error component="discovery manager scrape" msg="Cannot create service discovery" err="failed to register metric: duplicate metrics collector registration attempted" type=file config=prometheus

I think a nice way to solve this is to have the discovery.Manager manage metric registration and unregistration. I pushed a commit which implements this for Consul SD. Would you mind taking a look and letting me know if this would be ok please?

The bad thing about unregistration is that metrics will be cleared each time there is a config reload (e.g. counters would start from 0). However, I think this is the expected behaviour - there are other places in the code which do delete series with certain labels when config is reloaded (e.g. the stop() function for the scrape pool).

Places where we are not setting a registerer in DiscovererOptions. In that case the registerer would be nil, and the metrics registration would fail.

A pattern used at some places is to allow a nil registerer and treat it as "I don't want any metrics to get registered". I.e. you simply have to guard the Register call by a if reg != nil { ... }. That's actually quite convenient if you want to use the library in a test or something, where you aren't interested in the metrics, and instead of passing in a dummy registry, you can just say nil.

The first time I worked on this PR, I did actually implement it this way - if a nil registerer was passed to a SD, then it was ignored with no error. However, later I changed it to always requiring a registerer to be present. I like this approach more even in tests, because it helped me catch certain errors. Would you mind if we continue treating a nil registerer as an error please?

I just realised the current implementation will probably panic if a nil Registerer is passed to an SD. I could change it to returning an error instead of panicking?

// A slice of prometheus collectors which need to be registered and unregistered:
// * Registration should happen prior to calling Run().
// * Unregistration should happen after Run() returns.
GetCollectors() []prometheus.Collector
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beorn7 Do you think this new function is a good approach? If it looks good to you, I could implement it for all SD mechanisms.

If the manager handles registration, then it doesn't even need to pass a registerer to the SD. I personally think it would help reduce boilerplate and unintentional errors.

Another approach would be to have each SD manage its own registration and unregistration inside the Run() function, but I'm afraid this would be too manual and lead to errors.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be an entirely new pattern (AFAIK). It feels weird to me that a Discoverer has to "hand out" internals, like the prometheus.Collector instances it owns, to the caller for some house keeping.

I would rather complete the Discoverer life cycle management by having it register all the Collectors on its own (as initially done in this PR) and then, before returning from the Run method, unregister them all again. In this way, we would also get rid of all the metrics related to an SD if that SD gets completely removed by a config change (rather than just restarted).

(Disclaimer: I'm not very familiar with the SD code. My idea above might not work.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be an entirely new pattern (AFAIK). It feels weird to me that a Discoverer has to "hand out" internals, like the prometheus.Collector instances it owns, to the caller for some house keeping.

I agree, it does look weird.

I would rather complete the Discoverer life cycle management by having it register all the Collectors on its own (as initially done in this PR) and then, before returning from the Run method, unregister them all again. In this way, we would also get rid of all the metrics related to an SD if that SD gets completely removed by a config change (rather than just restarted).

I'd be happy to let the SD manage their own registration, but would you mind if we do it a bit differently than before? Instead of registering the metrics as part of the NewDiscoverer(), call, it would be better to pass a registerer via the Run() function? This way we wouldn't "leak" a metric in case NewDiscoverer is called, but Run never gets called.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good to me (given my limited insight ;) ). I'd say go ahead with that idea and see what breaks and if any of the devs with more SD experience intervene.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, I think the Registerer should be passed with the Run method, so that you don't need a separate Destroy method (as I have seen in the latest commit, but maybe that was added before you wrote above comment).

Let me know when I should commence reviewing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I meant to write a GitHub comment last week about the Destroy method, but didn't get around to it. I think we have a choice between three approaches:

  1. Register metrics when creating a new SD and unregister them by calling Destroy from the Manager:

    • This is what I did in the latest commit - e30ce0f12758
    • If a metric registration fails, it will fail during the creation of the SD, which to me seems more correct than failing during Run(). Metric registration feels like something which should be part of SD creation (the NewDiscoverer() method). Otherwise it would be possible to "create" multiple of the same SDs and only realise that they fail to run after calling Run on the ones after the 1st (since all but the 1st will fail to register their metric)
    • This means the SD Manager can unregister metrics by calling Destroy() even if the Run() function was never ran.
    • A downside to this approach is that it's easy for the Manager to miss calling a Destroy() in some situations. E.g. what if the creation of an SD failed, but metrics were still registered? Then Manager would still have to call Destroy().
  2. Register metrics during Run() and unregistering them during Run()

    • The nice thing is that SD Manager doesn't have to call Destroy()
    • The bad thing is we only find out the registration failed after calling Run(). It feels a bit weird to me that the creation of an SD was successful, only to fail due to metric registration.
  3. Register metrics during creation of the SD, and unregister them during Run()

    • Again, the nice thing is Manager doesn't have to call a Destroy().
    • Run() should not be used for cleaning up resources required outside of Run() IMO. So I'd really prefer not to go with this approach.

Approach 1 seems the cleanest to me, but I'm happy to implement any approach that you think is reasonable. The big question is - which function do you think should fail if metric registration fails - NewDiscoverer() or Run()?

BTW I am planning to close this PR and instead will open separate PRs for different parts of the codebase - one PR for the scrape system, and another for the SDs. But it would be nice if we agree on an approach for the SDs prior to the new SD PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like (2) best. This is under the assumption that calling the New... function usually happens in close connection with calling the Run method, so that it actually doesn't make a big difference which method fails due to metrics collisions.

And even if calling Run happens much later after calling New..., there is the question what you will do if it fails. I assume in most cases you are probably going to exit the program execution anyway, which again means it doesn't make a big difference where the error is returned.

@ptodev ptodev requested a review from beorn7 September 21, 2023 09:30
@beorn7
Copy link
Copy Markdown
Member

beorn7 commented Sep 21, 2023

All good points. First the simple answers:

  • Offering a "metrics no-op mode" when nil is passed in as a Registerer is a convenience feature for those using the Prometheus code as a library. We should ask them what they want. And I guess you are their representative here. If you think you don't need it, that's fine.
  • If you want to enforce passing in a non-nil Registerer, I think it is fine to panic because it can be seen as a programming error rather than something that happens based on user input or other external circumstances. But you should clearly document that a nil Registerer will cause a panic.
  • Unregistering metrics and starting counters from zero upon config reloads is probably fine. Depending on the situation, it might even be desired (for example if the used labels change due to the config reload – you don't want unused metrics to hang around).

I'll comment on the double registration problem as a follow-up to the separate comment above.

@ptodev
Copy link
Copy Markdown
Contributor Author

ptodev commented Oct 9, 2023

@beorn7, thank you so much for the help with this PR! It is a very big one, so I am going to close it and will instead open two PRs in its place - one for the scrape package, and another for the SD.

The PR for the scrape package is open for review - please feel free to take a look at #12958

I hope that by having the changes in two PRs they will be easier to review and easier to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prefer to not register metrics globally

3 participants