Prometheus can leak goroutines when scraping targets over unreliable connection

## Bug Report

I noticed that my Prometheus server is leaking goroutines, and that it correlates to established TCP connections:

![image](https://user-images.githubusercontent.com/1238084/54747012-fc03d800-4bcd-11e9-97ad-d3d56931a656.png)

The TCP connections in question all go to the same host, which is on the other end of a pretty unreliable VPN tunnel.

This is the top of the goroutine profile for this Prometheus server:

```
goroutine profile: total 6062
1638 @ 0x42e0cb 0x42e173 0x405a8e 0x40577b 0x6c7bb7 0x6c948d 0x6d374e 0x45c0c1
#       0x6c7bb6        net/http.(*persistConn).addTLS+0x1a6            /usr/local/go/src/net/http/transport.go:1177
#       0x6c948c        net/http.(*Transport).dialConn+0x15ac           /usr/local/go/src/net/http/transport.go:1250
#       0x6d374d        net/http.(*Transport).getConn.func4+0x6d        /usr/local/go/src/net/http/transport.go:999

1637 @ 0x42e0cb 0x4293f9 0x428aa6 0x4963fa 0x49650d 0x497259 0x534b1f 0x548298 0x845bba 0x60a2e9 0x60a7fd 0x60ce0a 0x60fa6a 0x60f757 0x60e481 0x6d3852 0x45c0c1
#       0x428aa5        internal/poll.runtime_pollWait+0x65                     /usr/local/go/src/runtime/netpoll.go:173
#       0x4963f9        internal/poll.(*pollDesc).wait+0x99                     /usr/local/go/src/internal/poll/fd_poll_runtime.go:85
#       0x49650c        internal/poll.(*pollDesc).waitRead+0x3c                 /usr/local/go/src/internal/poll/fd_poll_runtime.go:90
#       0x497258        internal/poll.(*FD).Read+0x178                          /usr/local/go/src/internal/poll/fd_unix.go:169
#       0x534b1e        net.(*netFD).Read+0x4e                                  /usr/local/go/src/net/fd_unix.go:202
#       0x548297        net.(*conn).Read+0x67                                   /usr/local/go/src/net/net.go:177
#       0x60a2e8        crypto/tls.(*block).readFromUntil+0x88                  /usr/local/go/src/crypto/tls/conn.go:492
#       0x60a7fc        crypto/tls.(*Conn).readRecord+0xdc                      /usr/local/go/src/crypto/tls/conn.go:593
#       0x60ce09        crypto/tls.(*Conn).readHandshake+0x99                   /usr/local/go/src/crypto/tls/conn.go:955
#       0x60fa69        crypto/tls.(*clientHandshakeState).handshake+0xa9       /usr/local/go/src/crypto/tls/handshake_client.go:191
#       0x60f756        crypto/tls.(*Conn).clientHandshake+0x396                /usr/local/go/src/crypto/tls/handshake_client.go:168
#       0x60e480        crypto/tls.(*Conn).Handshake+0xf0                       /usr/local/go/src/crypto/tls/conn.go:1272
#       0x6d3851        net/http.(*persistConn).addTLS.func2+0x41               /usr/local/go/src/net/http/transport.go:1171

1636 @ 0x42e0cb 0x42e173 0x405a8e 0x40577b 0x6d354a 0x45c0c1
#       0x6d3549        net/http.(*Transport).getConn.func2.1+0x49      /usr/local/go/src/net/http/transport.go:965
```

The relevant part of `net/http.(*persistConn).addTLS()` is

```go
var timer *time.Timer // for canceling TLS handshake
if d := pconn.t.TLSHandshakeTimeout; d != 0 {
        timer = time.AfterFunc(d, func() {
                errc <- tlsHandshakeTimeoutError{}
        })      
}
go func() {
        if trace != nil && trace.TLSHandshakeStart != nil {
                trace.TLSHandshakeStart()
        }       
        err := tlsConn.Handshake()
        if timer != nil {
                timer.Stop()
        }       
        errc <- err
}()
if err := <-errc; err != nil {
        plainConn.Close()
        if trace != nil && trace.TLSHandshakeDone != nil {
                trace.TLSHandshakeDone(tls.ConnectionState{}, err)
        }       
        return err
}
```

As you can see, it can hang forever if `TLSHandshakeTimeout` is set to zero. And in Prometheus, the HTTP client's transport  comes from `github.com/prometheus/common/config. NewRoundTripperFromConfig()`, where `TLSHandshakeTimeout` isn't explicitly set (so it defaults to zero):

```go
var rt http.RoundTripper = &http.Transport{
        Proxy:               http.ProxyURL(cfg.ProxyURL.URL),
        MaxIdleConns:        20000,
        MaxIdleConnsPerHost: 1000, // see https://github.com/golang/go/issues/13801
        DisableKeepAlives:   false,
        TLSClientConfig:     tlsConfig,
        DisableCompression:  true,
        // 5 minutes is typically above the maximum sane scrape interval. So we can
        // use keepalive for all configurations. 
        IdleConnTimeout: 5 * time.Minute,
        DialContext: conntrack.NewDialContextFunc(
                conntrack.DialWithTracing(),
                conntrack.DialWithName(name),
        ),
}
```

I tried setting `TLSHandshakeTimeout` to 10 seconds, and ran the patched and unpatched versions of Prometheus with the same configuration.

Here are the goroutine and TCP sessions graphs for the patched

![image](https://user-images.githubusercontent.com/1238084/54751127-3ffcda00-4bda-11e9-8505-43b0795cbb27.png)

and unpatched

![image](https://user-images.githubusercontent.com/1238084/54751143-4ab76f00-4bda-11e9-935f-000874b59668.png)

versions. And the top of the goroutine profile for the patched version:

```
goroutine profile: total 1227
566 @ 0x42e0cb 0x43e0f6 0x17b0172 0x45c0c1
#	0x17b0171	github.com/prometheus/prometheus/scrape.(*scrapeLoop).run+0x821	/home/knecht/dev/git/prometheus/prometheus/scrape/scrape.go:886

219 @ 0x42e0cb 0x43e0f6 0x6cc2d3 0x45c0c1
#	0x6cc2d2	net/http.(*persistConn).writeLoop+0x112	/usr/lib/go-1.11/src/net/http/transport.go:1885

143 @ 0x42e0cb 0x4293f9 0x428aa6 0x4963fa 0x49650d 0x497259 0x534b1f 0x548298 0x845bba 0x6ca1c5 0x5b4dcf 0x5b4f2f 0x6cad12 0x45c0c1
#	0x428aa5	internal/poll.runtime_pollWait+0x65	/usr/lib/go-1.11/src/runtime/netpoll.go:173
#	0x4963f9	internal/poll.(*pollDesc).wait+0x99	/usr/lib/go-1.11/src/internal/poll/fd_poll_runtime.go:85
#	0x49650c	internal/poll.(*pollDesc).waitRead+0x3c	/usr/lib/go-1.11/src/internal/poll/fd_poll_runtime.go:90
#	0x497258	internal/poll.(*FD).Read+0x178		/usr/lib/go-1.11/src/internal/poll/fd_unix.go:169
#	0x534b1e	net.(*netFD).Read+0x4e			/usr/lib/go-1.11/src/net/fd_unix.go:202
#	0x548297	net.(*conn).Read+0x67			/usr/lib/go-1.11/src/net/net.go:177
#	0x6ca1c4	net/http.(*persistConn).Read+0x74	/usr/lib/go-1.11/src/net/http/transport.go:1497
#	0x5b4dce	bufio.(*Reader).fill+0x10e		/usr/lib/go-1.11/src/bufio/bufio.go:100
#	0x5b4f2e	bufio.(*Reader).Peek+0x3e		/usr/lib/go-1.11/src/bufio/bufio.go:132
#	0x6cad11	net/http.(*persistConn).readLoop+0x1a1	/usr/lib/go-1.11/src/net/http/transport.go:1645
```

I'm going to submit a PR for `github.com/prometheus/common` to set `TLSHandshakeTimeout` to the same value as `net/http.DefaultTransport`.

Other Prometheus components may be affected. I've seen the same issue in the `blackbox_exporter` at least.

**Environment**

* System information:

```
Linux 4.15.0-30-generic x86_64
```

* Prometheus version:

```
prometheus, version 2.8.0 (branch: HEAD, revision: 59369491cfdfe8dcb325723d6d28a837887a07b9)
  build user:       root@4c4d5c29b71f
  build date:       20190312-07:46:58
  go version:       go1.11.5
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus can leak goroutines when scraping targets over unreliable connection #5394

Bug Report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prometheus can leak goroutines when scraping targets over unreliable connection #5394

Description

Bug Report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions