Skip to content

Prometheus can leak goroutines when scraping targets over unreliable connection #5394

@BenoitKnecht

Description

@BenoitKnecht

Bug Report

I noticed that my Prometheus server is leaking goroutines, and that it correlates to established TCP connections:

image

The TCP connections in question all go to the same host, which is on the other end of a pretty unreliable VPN tunnel.

This is the top of the goroutine profile for this Prometheus server:

goroutine profile: total 6062
1638 @ 0x42e0cb 0x42e173 0x405a8e 0x40577b 0x6c7bb7 0x6c948d 0x6d374e 0x45c0c1
#       0x6c7bb6        net/http.(*persistConn).addTLS+0x1a6            /usr/local/go/src/net/http/transport.go:1177
#       0x6c948c        net/http.(*Transport).dialConn+0x15ac           /usr/local/go/src/net/http/transport.go:1250
#       0x6d374d        net/http.(*Transport).getConn.func4+0x6d        /usr/local/go/src/net/http/transport.go:999

1637 @ 0x42e0cb 0x4293f9 0x428aa6 0x4963fa 0x49650d 0x497259 0x534b1f 0x548298 0x845bba 0x60a2e9 0x60a7fd 0x60ce0a 0x60fa6a 0x60f757 0x60e481 0x6d3852 0x45c0c1
#       0x428aa5        internal/poll.runtime_pollWait+0x65                     /usr/local/go/src/runtime/netpoll.go:173
#       0x4963f9        internal/poll.(*pollDesc).wait+0x99                     /usr/local/go/src/internal/poll/fd_poll_runtime.go:85
#       0x49650c        internal/poll.(*pollDesc).waitRead+0x3c                 /usr/local/go/src/internal/poll/fd_poll_runtime.go:90
#       0x497258        internal/poll.(*FD).Read+0x178                          /usr/local/go/src/internal/poll/fd_unix.go:169
#       0x534b1e        net.(*netFD).Read+0x4e                                  /usr/local/go/src/net/fd_unix.go:202
#       0x548297        net.(*conn).Read+0x67                                   /usr/local/go/src/net/net.go:177
#       0x60a2e8        crypto/tls.(*block).readFromUntil+0x88                  /usr/local/go/src/crypto/tls/conn.go:492
#       0x60a7fc        crypto/tls.(*Conn).readRecord+0xdc                      /usr/local/go/src/crypto/tls/conn.go:593
#       0x60ce09        crypto/tls.(*Conn).readHandshake+0x99                   /usr/local/go/src/crypto/tls/conn.go:955
#       0x60fa69        crypto/tls.(*clientHandshakeState).handshake+0xa9       /usr/local/go/src/crypto/tls/handshake_client.go:191
#       0x60f756        crypto/tls.(*Conn).clientHandshake+0x396                /usr/local/go/src/crypto/tls/handshake_client.go:168
#       0x60e480        crypto/tls.(*Conn).Handshake+0xf0                       /usr/local/go/src/crypto/tls/conn.go:1272
#       0x6d3851        net/http.(*persistConn).addTLS.func2+0x41               /usr/local/go/src/net/http/transport.go:1171

1636 @ 0x42e0cb 0x42e173 0x405a8e 0x40577b 0x6d354a 0x45c0c1
#       0x6d3549        net/http.(*Transport).getConn.func2.1+0x49      /usr/local/go/src/net/http/transport.go:965

The relevant part of net/http.(*persistConn).addTLS() is

var timer *time.Timer // for canceling TLS handshake
if d := pconn.t.TLSHandshakeTimeout; d != 0 {
        timer = time.AfterFunc(d, func() {
                errc <- tlsHandshakeTimeoutError{}
        })      
}
go func() {
        if trace != nil && trace.TLSHandshakeStart != nil {
                trace.TLSHandshakeStart()
        }       
        err := tlsConn.Handshake()
        if timer != nil {
                timer.Stop()
        }       
        errc <- err
}()
if err := <-errc; err != nil {
        plainConn.Close()
        if trace != nil && trace.TLSHandshakeDone != nil {
                trace.TLSHandshakeDone(tls.ConnectionState{}, err)
        }       
        return err
}

As you can see, it can hang forever if TLSHandshakeTimeout is set to zero. And in Prometheus, the HTTP client's transport comes from github.com/prometheus/common/config. NewRoundTripperFromConfig(), where TLSHandshakeTimeout isn't explicitly set (so it defaults to zero):

var rt http.RoundTripper = &http.Transport{
        Proxy:               http.ProxyURL(cfg.ProxyURL.URL),
        MaxIdleConns:        20000,
        MaxIdleConnsPerHost: 1000, // see https://github.com/golang/go/issues/13801
        DisableKeepAlives:   false,
        TLSClientConfig:     tlsConfig,
        DisableCompression:  true,
        // 5 minutes is typically above the maximum sane scrape interval. So we can
        // use keepalive for all configurations. 
        IdleConnTimeout: 5 * time.Minute,
        DialContext: conntrack.NewDialContextFunc(
                conntrack.DialWithTracing(),
                conntrack.DialWithName(name),
        ),
}

I tried setting TLSHandshakeTimeout to 10 seconds, and ran the patched and unpatched versions of Prometheus with the same configuration.

Here are the goroutine and TCP sessions graphs for the patched

image

and unpatched

image

versions. And the top of the goroutine profile for the patched version:

goroutine profile: total 1227
566 @ 0x42e0cb 0x43e0f6 0x17b0172 0x45c0c1
#	0x17b0171	github.com/prometheus/prometheus/scrape.(*scrapeLoop).run+0x821	/home/knecht/dev/git/prometheus/prometheus/scrape/scrape.go:886

219 @ 0x42e0cb 0x43e0f6 0x6cc2d3 0x45c0c1
#	0x6cc2d2	net/http.(*persistConn).writeLoop+0x112	/usr/lib/go-1.11/src/net/http/transport.go:1885

143 @ 0x42e0cb 0x4293f9 0x428aa6 0x4963fa 0x49650d 0x497259 0x534b1f 0x548298 0x845bba 0x6ca1c5 0x5b4dcf 0x5b4f2f 0x6cad12 0x45c0c1
#	0x428aa5	internal/poll.runtime_pollWait+0x65	/usr/lib/go-1.11/src/runtime/netpoll.go:173
#	0x4963f9	internal/poll.(*pollDesc).wait+0x99	/usr/lib/go-1.11/src/internal/poll/fd_poll_runtime.go:85
#	0x49650c	internal/poll.(*pollDesc).waitRead+0x3c	/usr/lib/go-1.11/src/internal/poll/fd_poll_runtime.go:90
#	0x497258	internal/poll.(*FD).Read+0x178		/usr/lib/go-1.11/src/internal/poll/fd_unix.go:169
#	0x534b1e	net.(*netFD).Read+0x4e			/usr/lib/go-1.11/src/net/fd_unix.go:202
#	0x548297	net.(*conn).Read+0x67			/usr/lib/go-1.11/src/net/net.go:177
#	0x6ca1c4	net/http.(*persistConn).Read+0x74	/usr/lib/go-1.11/src/net/http/transport.go:1497
#	0x5b4dce	bufio.(*Reader).fill+0x10e		/usr/lib/go-1.11/src/bufio/bufio.go:100
#	0x5b4f2e	bufio.(*Reader).Peek+0x3e		/usr/lib/go-1.11/src/bufio/bufio.go:132
#	0x6cad11	net/http.(*persistConn).readLoop+0x1a1	/usr/lib/go-1.11/src/net/http/transport.go:1645

I'm going to submit a PR for github.com/prometheus/common to set TLSHandshakeTimeout to the same value as net/http.DefaultTransport.

Other Prometheus components may be affected. I've seen the same issue in the blackbox_exporter at least.

Environment

  • System information:
Linux 4.15.0-30-generic x86_64
  • Prometheus version:
prometheus, version 2.8.0 (branch: HEAD, revision: 59369491cfdfe8dcb325723d6d28a837887a07b9)
  build user:       root@4c4d5c29b71f
  build date:       20190312-07:46:58
  go version:       go1.11.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions