-
Notifications
You must be signed in to change notification settings - Fork 18.9k
dockerd eating too much memory and keeps growing #45842
Description
Description
We have deployed kubernetes(rke) on a single node and some applications on it. Sometimes memory usage of dockerd process keeps growing and cpu usage is also very high. Finally that would take up all memory resources on that machine and we have to reboot.
Reproduce
Don't think this could be easily reproduced. This might be related to the applications. Arbitrarily speaking the steps could be deploying a rke kubernetes cluster, running grafana's loki stack and running some applications that produce numerous logs.
Expected behavior
No response
docker version
Client: Docker Engine - Community
Version: 24.0.2
API version: 1.43
Go version: go1.20.4
Git commit: cb74dfc
Built: Thu May 25 21:52:22 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 24.0.2
API version: 1.43 (minimum version 1.12)
Go version: go1.20.4
Git commit: 659604f
Built: Thu May 25 21:52:22 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.21
GitCommit: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc:
Version: 1.1.7
GitCommit: v1.1.7-0-g860f061
docker-init:
Version: 0.19.0
GitCommit: de40ad0docker info
Client: Docker Engine - Community
Version: 24.0.2
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.10.5
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.18.1
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 486
Running: 302
Paused: 0
Stopped: 184
Images: 264
Server Version: 24.0.2
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3dce8eb055cbb6872793272b4f20ed16117344f8
runc version: v1.1.7-0-g860f061
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
Kernel Version: 5.15.0-73-generic
Operating System: Ubuntu 20.04.6 LTS
OSType: linux
Architecture: x86_64
CPUs: 48
Total Memory: 502.6GiB
Name: server
ID: R3MZ:NUMU:RKFE:VSG2:V5PB:VOFO:UB3U:7NUS:F45X:73TR:RAXZ:KUSY
Docker Root Dir: /var/lib/docker
Debug Mode: true
File Descriptors: 5270
Goroutines: 2397
System Time: 2023-06-29T10:56:13.096261094+08:00
EventsListeners: 0
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
https://hub-mirror.c.163.com/
Live Restore Enabled: trueAdditional Info
I've run go pprof to capture some information about cpu usage and memory usage when dockerd memory is growing fast. You can click the two images below to see the full image.
Also the original pprof files are attached.
dockerd_prof.zip
The applications we're running on this kubernetes cluster are one that would create many argo workflows to do some sort of experiment and a log scraper (grafana's loki + promtail). The log volume is pretty numerous (sometimes more than 10k lines/s). Don't know if that could be responsible for this.