Skip to content

Envoy does not release healthcheck log fd on SIGUSR1 #8249

@perlun

Description

@perlun

Title: http_health_check log file descriptor is not being released on SIGUSR1

Description:
After rotating the logs and sending a SIGUSR1 to the envoy process (technically, to the hot restarter script), we are still seeing healthcheck logs being written to the old (rotated) file. This is validated by looking at the list of open files by the envoy process:

hibox@some-server:/var/log/envoy$ ps -fe | grep envoy
root      8679     1  0 Aug28 ?        00:00:01 python /usr/local/sbin/envoy-hot-restarter.py /usr/local/sbin/envoy-hot-restarter.sh
root      8743  8679  0 Aug28 ?        03:37:30 /usr/bin/envoy -c /etc/envoy/envoy.yaml --restart-epoch 1 --service-cluster hibox --service-node some-server --service-zone unknown
hibox    25054 27885  0 11:55 pts/0    00:00:00 grep --color=auto envoy
hibox@some-server:/var/log/envoy$ sudo ls -l /proc/8743/fd | grep envoy
lrwx------ 1 root root 64 Sep 16 11:10 12 -> /var/log/envoy/admin_access.log
lrwx------ 1 root root 64 Sep 16 11:10 7 -> /dev/shm/envoy_shared_memory_0
lrwx------ 1 root root 64 Sep 16 11:10 78 -> /var/log/envoy/healthchecks.log.1
lrwx------ 1 root root 64 Sep 16 11:10 80 -> /var/log/envoy/http_access.log
lrwx------ 1 root root 64 Sep 16 11:10 82 -> /var/log/envoy/metrics_access.log 

Repro steps:
envoy --version is e349fb6/1.11.1/clean-getenvoy-930d4a5/RELEASE/BoringSSL

The relevant part of the configuration looks roughly like this:

  clusters:
    - name: cluster_tomcat
      connect_timeout: 0.25s
      type: STATIC
      dns_lookup_family: V4_ONLY
      lb_policy: ROUND_ROBIN
      common_lb_config: {
        healthy_panic_threshold: {
          value: 10
        }
      }

      health_checks: [
        {
          timeout: 0.5s,
          interval: 5s,
          unhealthy_threshold: 1,
          healthy_threshold: 2,

          http_health_check: {
            path: /some-path/api
          },

          event_log_path: /var/log/envoy/healthchecks.log
        }
      ]

My logrotate configuration file looks like this:

/var/log/envoy/*.log {
	rotate 15
	weekly
	compress
        delaycompress # Workaround for files being incorrectly removed
	missingok
	notifempty
	postrotate
		pkill -USR1 -f envoy-hot-restarter.py
	endscript
        sharedscripts
}

An interesting thing to note: I do get an error from logrotate when performing the log rotation:

error: error running shared postrotate script for '/var/log/envoy/*.log '

The syslog indicates that the SIGUSR1 was correctly received by Envoy, though:

Sep 16 11:43:40 some-server envoy-hot-restarter.py[8679]: [2019-09-16 11:43:40.005][8743][warning][main] [external/envoy/source/server/server.cc:473] caught SIGUSR1

The timestamps here also correlate well with the time when the other *.log.1 files were created:

hibox@some-server:/var/log/envoy$ ls -l
total 124452
-rw-r--r-- 1 root root     38986 Sep 16 12:08 admin_access.log
-rw-r--r-- 1 root root     12352 Sep 16 11:43 admin_access.log.1
-rw-r--r-- 1 root root         0 Sep 16 11:27 healthchecks.log
-rw-r--r-- 1 root root     77279 Sep 14 21:00 healthchecks.log.1
-rw-r--r-- 1 root root    620140 Sep 16 12:08 http_access.log
-rw-r--r-- 1 root root    198509 Sep 16 11:43 http_access.log.1
-rw-r--r-- 1 root root         0 Sep 16 11:27 https_access.log
-rw-r--r-- 1 root root     60766 Mar 29 08:48 https_access.log.1
-rw-r--r-- 1 root root     38784 Sep 16 12:08 metrics_access.log
-rw-r--r-- 1 root root     12288 Sep 16 11:43 metrics_access.log.1

Any ideas? It feels superficially similar to #4060 but I don't think we ever concluded that there was a problem with Envoy itself back then.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions