-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Title: http_health_check log file descriptor is not being released on SIGUSR1
Description:
After rotating the logs and sending a SIGUSR1 to the envoy process (technically, to the hot restarter script), we are still seeing healthcheck logs being written to the old (rotated) file. This is validated by looking at the list of open files by the envoy process:
hibox@some-server:/var/log/envoy$ ps -fe | grep envoy
root 8679 1 0 Aug28 ? 00:00:01 python /usr/local/sbin/envoy-hot-restarter.py /usr/local/sbin/envoy-hot-restarter.sh
root 8743 8679 0 Aug28 ? 03:37:30 /usr/bin/envoy -c /etc/envoy/envoy.yaml --restart-epoch 1 --service-cluster hibox --service-node some-server --service-zone unknown
hibox 25054 27885 0 11:55 pts/0 00:00:00 grep --color=auto envoy
hibox@some-server:/var/log/envoy$ sudo ls -l /proc/8743/fd | grep envoy
lrwx------ 1 root root 64 Sep 16 11:10 12 -> /var/log/envoy/admin_access.log
lrwx------ 1 root root 64 Sep 16 11:10 7 -> /dev/shm/envoy_shared_memory_0
lrwx------ 1 root root 64 Sep 16 11:10 78 -> /var/log/envoy/healthchecks.log.1
lrwx------ 1 root root 64 Sep 16 11:10 80 -> /var/log/envoy/http_access.log
lrwx------ 1 root root 64 Sep 16 11:10 82 -> /var/log/envoy/metrics_access.log
Repro steps:
envoy --version is e349fb6/1.11.1/clean-getenvoy-930d4a5/RELEASE/BoringSSL
The relevant part of the configuration looks roughly like this:
clusters:
- name: cluster_tomcat
connect_timeout: 0.25s
type: STATIC
dns_lookup_family: V4_ONLY
lb_policy: ROUND_ROBIN
common_lb_config: {
healthy_panic_threshold: {
value: 10
}
}
health_checks: [
{
timeout: 0.5s,
interval: 5s,
unhealthy_threshold: 1,
healthy_threshold: 2,
http_health_check: {
path: /some-path/api
},
event_log_path: /var/log/envoy/healthchecks.log
}
]My logrotate configuration file looks like this:
/var/log/envoy/*.log {
rotate 15
weekly
compress
delaycompress # Workaround for files being incorrectly removed
missingok
notifempty
postrotate
pkill -USR1 -f envoy-hot-restarter.py
endscript
sharedscripts
}An interesting thing to note: I do get an error from logrotate when performing the log rotation:
error: error running shared postrotate script for '/var/log/envoy/*.log '
The syslog indicates that the SIGUSR1 was correctly received by Envoy, though:
Sep 16 11:43:40 some-server envoy-hot-restarter.py[8679]: [2019-09-16 11:43:40.005][8743][warning][main] [external/envoy/source/server/server.cc:473] caught SIGUSR1
The timestamps here also correlate well with the time when the other *.log.1 files were created:
hibox@some-server:/var/log/envoy$ ls -l
total 124452
-rw-r--r-- 1 root root 38986 Sep 16 12:08 admin_access.log
-rw-r--r-- 1 root root 12352 Sep 16 11:43 admin_access.log.1
-rw-r--r-- 1 root root 0 Sep 16 11:27 healthchecks.log
-rw-r--r-- 1 root root 77279 Sep 14 21:00 healthchecks.log.1
-rw-r--r-- 1 root root 620140 Sep 16 12:08 http_access.log
-rw-r--r-- 1 root root 198509 Sep 16 11:43 http_access.log.1
-rw-r--r-- 1 root root 0 Sep 16 11:27 https_access.log
-rw-r--r-- 1 root root 60766 Mar 29 08:48 https_access.log.1
-rw-r--r-- 1 root root 38784 Sep 16 12:08 metrics_access.log
-rw-r--r-- 1 root root 12288 Sep 16 11:43 metrics_access.log.1
Any ideas? It feels superficially similar to #4060 but I don't think we ever concluded that there was a problem with Envoy itself back then.