-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Description
Bug Report
I noticed that my Prometheus server is leaking goroutines, and that it correlates to established TCP connections:
The TCP connections in question all go to the same host, which is on the other end of a pretty unreliable VPN tunnel.
This is the top of the goroutine profile for this Prometheus server:
goroutine profile: total 6062
1638 @ 0x42e0cb 0x42e173 0x405a8e 0x40577b 0x6c7bb7 0x6c948d 0x6d374e 0x45c0c1
# 0x6c7bb6 net/http.(*persistConn).addTLS+0x1a6 /usr/local/go/src/net/http/transport.go:1177
# 0x6c948c net/http.(*Transport).dialConn+0x15ac /usr/local/go/src/net/http/transport.go:1250
# 0x6d374d net/http.(*Transport).getConn.func4+0x6d /usr/local/go/src/net/http/transport.go:999
1637 @ 0x42e0cb 0x4293f9 0x428aa6 0x4963fa 0x49650d 0x497259 0x534b1f 0x548298 0x845bba 0x60a2e9 0x60a7fd 0x60ce0a 0x60fa6a 0x60f757 0x60e481 0x6d3852 0x45c0c1
# 0x428aa5 internal/poll.runtime_pollWait+0x65 /usr/local/go/src/runtime/netpoll.go:173
# 0x4963f9 internal/poll.(*pollDesc).wait+0x99 /usr/local/go/src/internal/poll/fd_poll_runtime.go:85
# 0x49650c internal/poll.(*pollDesc).waitRead+0x3c /usr/local/go/src/internal/poll/fd_poll_runtime.go:90
# 0x497258 internal/poll.(*FD).Read+0x178 /usr/local/go/src/internal/poll/fd_unix.go:169
# 0x534b1e net.(*netFD).Read+0x4e /usr/local/go/src/net/fd_unix.go:202
# 0x548297 net.(*conn).Read+0x67 /usr/local/go/src/net/net.go:177
# 0x60a2e8 crypto/tls.(*block).readFromUntil+0x88 /usr/local/go/src/crypto/tls/conn.go:492
# 0x60a7fc crypto/tls.(*Conn).readRecord+0xdc /usr/local/go/src/crypto/tls/conn.go:593
# 0x60ce09 crypto/tls.(*Conn).readHandshake+0x99 /usr/local/go/src/crypto/tls/conn.go:955
# 0x60fa69 crypto/tls.(*clientHandshakeState).handshake+0xa9 /usr/local/go/src/crypto/tls/handshake_client.go:191
# 0x60f756 crypto/tls.(*Conn).clientHandshake+0x396 /usr/local/go/src/crypto/tls/handshake_client.go:168
# 0x60e480 crypto/tls.(*Conn).Handshake+0xf0 /usr/local/go/src/crypto/tls/conn.go:1272
# 0x6d3851 net/http.(*persistConn).addTLS.func2+0x41 /usr/local/go/src/net/http/transport.go:1171
1636 @ 0x42e0cb 0x42e173 0x405a8e 0x40577b 0x6d354a 0x45c0c1
# 0x6d3549 net/http.(*Transport).getConn.func2.1+0x49 /usr/local/go/src/net/http/transport.go:965
The relevant part of net/http.(*persistConn).addTLS() is
var timer *time.Timer // for canceling TLS handshake
if d := pconn.t.TLSHandshakeTimeout; d != 0 {
timer = time.AfterFunc(d, func() {
errc <- tlsHandshakeTimeoutError{}
})
}
go func() {
if trace != nil && trace.TLSHandshakeStart != nil {
trace.TLSHandshakeStart()
}
err := tlsConn.Handshake()
if timer != nil {
timer.Stop()
}
errc <- err
}()
if err := <-errc; err != nil {
plainConn.Close()
if trace != nil && trace.TLSHandshakeDone != nil {
trace.TLSHandshakeDone(tls.ConnectionState{}, err)
}
return err
}As you can see, it can hang forever if TLSHandshakeTimeout is set to zero. And in Prometheus, the HTTP client's transport comes from github.com/prometheus/common/config. NewRoundTripperFromConfig(), where TLSHandshakeTimeout isn't explicitly set (so it defaults to zero):
var rt http.RoundTripper = &http.Transport{
Proxy: http.ProxyURL(cfg.ProxyURL.URL),
MaxIdleConns: 20000,
MaxIdleConnsPerHost: 1000, // see https://github.com/golang/go/issues/13801
DisableKeepAlives: false,
TLSClientConfig: tlsConfig,
DisableCompression: true,
// 5 minutes is typically above the maximum sane scrape interval. So we can
// use keepalive for all configurations.
IdleConnTimeout: 5 * time.Minute,
DialContext: conntrack.NewDialContextFunc(
conntrack.DialWithTracing(),
conntrack.DialWithName(name),
),
}I tried setting TLSHandshakeTimeout to 10 seconds, and ran the patched and unpatched versions of Prometheus with the same configuration.
Here are the goroutine and TCP sessions graphs for the patched
and unpatched
versions. And the top of the goroutine profile for the patched version:
goroutine profile: total 1227
566 @ 0x42e0cb 0x43e0f6 0x17b0172 0x45c0c1
# 0x17b0171 github.com/prometheus/prometheus/scrape.(*scrapeLoop).run+0x821 /home/knecht/dev/git/prometheus/prometheus/scrape/scrape.go:886
219 @ 0x42e0cb 0x43e0f6 0x6cc2d3 0x45c0c1
# 0x6cc2d2 net/http.(*persistConn).writeLoop+0x112 /usr/lib/go-1.11/src/net/http/transport.go:1885
143 @ 0x42e0cb 0x4293f9 0x428aa6 0x4963fa 0x49650d 0x497259 0x534b1f 0x548298 0x845bba 0x6ca1c5 0x5b4dcf 0x5b4f2f 0x6cad12 0x45c0c1
# 0x428aa5 internal/poll.runtime_pollWait+0x65 /usr/lib/go-1.11/src/runtime/netpoll.go:173
# 0x4963f9 internal/poll.(*pollDesc).wait+0x99 /usr/lib/go-1.11/src/internal/poll/fd_poll_runtime.go:85
# 0x49650c internal/poll.(*pollDesc).waitRead+0x3c /usr/lib/go-1.11/src/internal/poll/fd_poll_runtime.go:90
# 0x497258 internal/poll.(*FD).Read+0x178 /usr/lib/go-1.11/src/internal/poll/fd_unix.go:169
# 0x534b1e net.(*netFD).Read+0x4e /usr/lib/go-1.11/src/net/fd_unix.go:202
# 0x548297 net.(*conn).Read+0x67 /usr/lib/go-1.11/src/net/net.go:177
# 0x6ca1c4 net/http.(*persistConn).Read+0x74 /usr/lib/go-1.11/src/net/http/transport.go:1497
# 0x5b4dce bufio.(*Reader).fill+0x10e /usr/lib/go-1.11/src/bufio/bufio.go:100
# 0x5b4f2e bufio.(*Reader).Peek+0x3e /usr/lib/go-1.11/src/bufio/bufio.go:132
# 0x6cad11 net/http.(*persistConn).readLoop+0x1a1 /usr/lib/go-1.11/src/net/http/transport.go:1645
I'm going to submit a PR for github.com/prometheus/common to set TLSHandshakeTimeout to the same value as net/http.DefaultTransport.
Other Prometheus components may be affected. I've seen the same issue in the blackbox_exporter at least.
Environment
- System information:
Linux 4.15.0-30-generic x86_64
- Prometheus version:
prometheus, version 2.8.0 (branch: HEAD, revision: 59369491cfdfe8dcb325723d6d28a837887a07b9)
build user: root@4c4d5c29b71f
build date: 20190312-07:46:58
go version: go1.11.5


