Envoy does not release healthcheck log fd on SIGUSR1

*Title*: *`http_health_check` log file descriptor is not being released on SIGUSR1*

*Description*:
After rotating the logs and sending a `SIGUSR1` to the `envoy` process (technically, to the hot restarter script), we are still seeing healthcheck logs being written to the old (rotated) file. This is validated by looking at the list of open files by the `envoy` process:

```
hibox@some-server:/var/log/envoy$ ps -fe | grep envoy
root      8679     1  0 Aug28 ?        00:00:01 python /usr/local/sbin/envoy-hot-restarter.py /usr/local/sbin/envoy-hot-restarter.sh
root      8743  8679  0 Aug28 ?        03:37:30 /usr/bin/envoy -c /etc/envoy/envoy.yaml --restart-epoch 1 --service-cluster hibox --service-node some-server --service-zone unknown
hibox    25054 27885  0 11:55 pts/0    00:00:00 grep --color=auto envoy
hibox@some-server:/var/log/envoy$ sudo ls -l /proc/8743/fd | grep envoy
lrwx------ 1 root root 64 Sep 16 11:10 12 -> /var/log/envoy/admin_access.log
lrwx------ 1 root root 64 Sep 16 11:10 7 -> /dev/shm/envoy_shared_memory_0
lrwx------ 1 root root 64 Sep 16 11:10 78 -> /var/log/envoy/healthchecks.log.1
lrwx------ 1 root root 64 Sep 16 11:10 80 -> /var/log/envoy/http_access.log
lrwx------ 1 root root 64 Sep 16 11:10 82 -> /var/log/envoy/metrics_access.log 
```

*Repro steps*:
`envoy --version` is e349fb6139e4b7a59a9a359be0ea45dd61e589c5/1.11.1/clean-getenvoy-930d4a5/RELEASE/BoringSSL

The relevant part of the configuration looks roughly like this:

```yaml
  clusters:
    - name: cluster_tomcat
      connect_timeout: 0.25s
      type: STATIC
      dns_lookup_family: V4_ONLY
      lb_policy: ROUND_ROBIN
      common_lb_config: {
        healthy_panic_threshold: {
          value: 10
        }
      }

      health_checks: [
        {
          timeout: 0.5s,
          interval: 5s,
          unhealthy_threshold: 1,
          healthy_threshold: 2,

          http_health_check: {
            path: /some-path/api
          },

          event_log_path: /var/log/envoy/healthchecks.log
        }
      ]
```

My logrotate configuration file looks like this:

```shell
/var/log/envoy/*.log {
	rotate 15
	weekly
	compress
        delaycompress # Workaround for files being incorrectly removed
	missingok
	notifempty
	postrotate
		pkill -USR1 -f envoy-hot-restarter.py
	endscript
        sharedscripts
}
```

An interesting thing to note: I _do_ get an error from `logrotate` when performing the log rotation:

`error: error running shared postrotate script for '/var/log/envoy/*.log '`

The syslog indicates that the `SIGUSR1` was correctly received by Envoy, though:

`Sep 16 11:43:40 some-server envoy-hot-restarter.py[8679]: [2019-09-16 11:43:40.005][8743][warning][main] [external/envoy/source/server/server.cc:473] caught SIGUSR1`

The timestamps here also correlate well with the time when the other `*.log.1` files were created:

```
hibox@some-server:/var/log/envoy$ ls -l
total 124452
-rw-r--r-- 1 root root     38986 Sep 16 12:08 admin_access.log
-rw-r--r-- 1 root root     12352 Sep 16 11:43 admin_access.log.1
-rw-r--r-- 1 root root         0 Sep 16 11:27 healthchecks.log
-rw-r--r-- 1 root root     77279 Sep 14 21:00 healthchecks.log.1
-rw-r--r-- 1 root root    620140 Sep 16 12:08 http_access.log
-rw-r--r-- 1 root root    198509 Sep 16 11:43 http_access.log.1
-rw-r--r-- 1 root root         0 Sep 16 11:27 https_access.log
-rw-r--r-- 1 root root     60766 Mar 29 08:48 https_access.log.1
-rw-r--r-- 1 root root     38784 Sep 16 12:08 metrics_access.log
-rw-r--r-- 1 root root     12288 Sep 16 11:43 metrics_access.log.1
```

Any ideas? It feels superficially similar to #4060 but I don't think we ever concluded that there was a problem with Envoy itself back then.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Envoy does not release healthcheck log fd on SIGUSR1 #8249

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Envoy does not release healthcheck log fd on SIGUSR1 #8249

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions