Add optional TLS support to /metrics endpoint#7255
Conversation
89c8dbf to
7ac7560
Compare
|
@chrisohaver @SuperQ Would you have time to review? :) |
|
@chrisohaver @SuperQ can you please take a look on this PR? It would be nice to have this PR moving forward |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7255 +/- ##
==========================================
+ Coverage 55.70% 63.19% +7.49%
==========================================
Files 224 278 +54
Lines 10016 15129 +5113
==========================================
+ Hits 5579 9561 +3982
- Misses 3978 4879 +901
- Partials 459 689 +230 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
It would be really nice to have someone review this and start the discussion rolling |
|
It would be helpful if someone could take a look at this. |
7ac7560 to
f12b5dd
Compare
|
@jameshartig @miekg @SuperQ @greenpau @Tantalor93 Can I get a review on this? What are the next steps needed to get this merged? |
f12b5dd to
ce041ac
Compare
|
I consider TLS on metric endpoints to be an anti feature |
I understand the desire to minimise complexity—especially on internal metrics endpoints—but in a zero-trust architecture, all network traffic, regardless of source or function, must be both authenticated and encrypted. Tampering with metrics may seem low-impact, but it can lead to unpredictable or harmful behavior. For example:
In multi-tenant or regulated environments where the network is not implicitly trusted, TLS is not an anti-feature—it’s a baseline security control. Even when the content isn’t sensitive, preserving integrity and controlling access are essential. To be clear, the PR does not propose mandatory TLS—it introduces optional support, as indicated in the title: “Add optional TLS support to /metrics endpoint”. Rejecting TLS outright disregards the realities of deployment models built on zero-trust principles. |
ce041ac to
8f81c13
Compare
9239ecc to
2b84a94
Compare
|
I've simplified this TLS implementation quite a bit. It's still using the Prometheus exporter-toolkit but now in the intended way. There is no more creating any temp files (this was the concern brought up in this comment: #7255 (comment)). @jameshartig @miekg @SuperQ @greenpau @Tantalor93 Can I get a review on this? What are the next steps needed to get this merged? |
2b84a94 to
8c5515c
Compare
kashifest
left a comment
There was a problem hiding this comment.
lgtm, it looks ok. It would nice to have someone with review rights have a look into this.
1fbe753 to
cd49281
Compare
|
@johnbelamaric would you please take a look at this one? |
|
@SuperQ any additional feedback on the updated PR? |
cd49281 to
5c93f42
Compare
|
@johnbelamaric @SuperQ Can we please decide how to proceed with this PR? |
5c93f42 to
839f096
Compare
|
@peppi-lotta can you rebase? |
839f096 to
ef076fe
Compare
|
@johnbelamaric I've rebased now :) |
| select { | ||
| case err := <-startResult: | ||
| return err | ||
| case <-time.After(200 * time.Millisecond): |
There was a problem hiding this comment.
I don't love this. Is there instead a status or something we can check on the control structure?
There was a problem hiding this comment.
Hi @johnbelamaric, joining the discussion.
I have been discussing this with @peppi-lotta, and I think there is a way to avoid waiting a predefined time for the service to be ready. I also agree that a timer-based approach can be problematic for various reasons. The ServeTLS method in net/http, which web.Serve ultimately calls, unfortunately does not provide a usable hook for checking service status. I don't think building an additional health-check mechanism that sends probe messages is a feasible solution either.
However, we know that ServeTLS essentially follows two paths: it either exits with an error or it calls Accept. ServeTLS creates a new listener but uses the listener argument to accept connections. Therefore, we can wrap the listener passed to web.Serve like this, e.g.:
// startupListener wraps a net.Listener to detect when Accept() is first called
type startupListener struct {
readyOnce sync.Once
ready chan struct{}
net.Listener
}
func newStartupListener(l net.Listener) *startupListener {
return &startupListener{
Listener: l,
ready: make(chan struct{}),
}
}
func (sl *startupListener) Accept() (net.Conn, error) {
// Signal ready on first Accept() call (server is running)
sl.readyOnce.Do(func() {
close(sl.ready)
})
return sl.Listener.Accept()
}
func (sl *startupListener) Ready() <-chan struct{} {
return sl.ready
}
...
ln, err := reuseport.Listen("tcp", m.Addr)
m.ln := newStartupListener(ln)
...
err := web.Serve(m.ln, server, webConfig, logger)
// Wait for server to be ready (first Accept() called) or error
select {
case <-startupListener.Ready():
log.Println("Server is ready and accepting connections")
case err := <-serverErr:
log.Fatalf("Server failed to start: %v", err)
}
What do you think?
There was a problem hiding this comment.
I pushed a commit implementing the suggested approach. Seems to work well :)
| WebConfigFile: &m.tlsConfigPath, | ||
| } | ||
|
|
||
| logger := slog.New(slog.NewTextHandler(os.Stderr, nil)) |
There was a problem hiding this comment.
I assume we need this particular object for the prometheus exporter? How does this interplay with our existing logging?
There was a problem hiding this comment.
Yes the prometheus web toolkit requires a slog.Logger type object to be passed to it. I think it interplays nicely. There is not a lot of logging when the the TLS setup is successful as can be seen from bellow.
$ kubectl logs -n kube-system coredns-67ddfdcbd-6lhfq
maxprocs: Leaving GOMAXPROCS=22: CPU quota undefined
time=2026-03-11T06:43:05.148Z level=INFO msg="Listening on" address=[::]:9153
time=2026-03-11T06:43:05.148Z level=INFO msg="TLS is enabled." http2=true address=[::]:9153
.:53
[INFO] plugin/reload: Running configuration SHA512 = 815c08e687bdc067019d64a774daf517a0fd3bb19531a7596741f57e78e3abb3dade4ae3bbe4898839f7bab7622c2bb91b27ce2f2c92d52b53e992dcd26ecbad
CoreDNS-1.14.2
linux/amd64, go1.25.5,
ef076fe to
b10bd56
Compare
…dpoint Signed-off-by: peppi-lotta <peppi-lotta.saari@est.tech>
b10bd56 to
8ee9ace
Compare
Signed-off-by: peppi-lotta <peppi-lotta.saari@est.tech>
|
thanks! |
* Use exporter-toolkit to enable optional TLS encryption on /metrics endpoint Signed-off-by: peppi-lotta <peppi-lotta.saari@est.tech> * Implement startup listener to signal server readiness Signed-off-by: peppi-lotta <peppi-lotta.saari@est.tech> --------- Signed-off-by: peppi-lotta <peppi-lotta.saari@est.tech>
1. Why is this pull request needed?
It adds optional TLS support to the CoreDNS metrics endpoint, allowing metrics to be served over HTTPS. This improves security by encrypting metrics traffic. TLS is disabled by default, so existing configurations continue to work.
2. Related issues
#7109
3. Documentation changes
4. Backward compatibility
No breaking changes. TLS is optional and off by default.