-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Description
Description
Hi,
About 2 days ago, I upgraded our production servers to the latest docker version (27.4.0). Since then, the docker daemon of all servers is eating memory like candy and getting killed by the oom killer once every couple hours. The result is not pretty.
I collected pprof output from the running docker daemon. They all look the across all servers. Here is a typical example:
mathieu@Host-001:~/$ go tool pprof -top ./heap-dump-de1-2-2024-12-12-10-56
File: dockerd
Build ID: 64214f994238df60afd9aa8c14b0322a8fc3412a
Type: inuse_space
Time: Dec 12, 2024 at 10:56am (CET)
Showing nodes accounting for 3326.65MB, 99.71% of 3336.46MB total
Dropped 142 nodes (cum <= 16.68MB)
flat flat% sum% cum cum%
2440.59MB 73.15% 73.15% 2440.59MB 73.15% github.com/moby/buildkit/util/tracing/detect.(*TraceRecorder).ExportSpans
502.54MB 15.06% 88.21% 502.54MB 15.06% go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).interfaceArrayToEventArray (inline)
383.52MB 11.49% 99.71% 383.52MB 11.49% go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).SetAttributes
0 0% 99.71% 2942.62MB 88.20% github.com/containerd/containerd/tracing.(*Span).End (inline)
0 0% 99.71% 383.52MB 11.49% github.com/containerd/containerd/tracing.(*Span).SetAttributes (inline)
0 0% 99.71% 3326.15MB 99.69% github.com/docker/docker/daemon/logger/loggerutils.(*LogFile).readLogsLocked
0 0% 99.71% 3326.15MB 99.69% github.com/docker/docker/daemon/logger/loggerutils.(*follow).Do
0 0% 99.71% 3326.15MB 99.69% github.com/docker/docker/daemon/logger/loggerutils.(*follow).forward
0 0% 99.71% 3326.15MB 99.69% github.com/docker/docker/daemon/logger/loggerutils.(*forwarder).Do
0 0% 99.71% 3326.15MB 99.69% github.com/docker/docker/daemon/logger/loggerutils.(*forwarder).Do.func1
0 0% 99.71% 2943.12MB 88.21% go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End
0 0% 99.71% 502.54MB 15.06% go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).snapshot
0 0% 99.71% 2440.59MB 73.15% go.opentelemetry.io/otel/sdk/trace.(*simpleSpanProcessor).OnEnd
The above points to otel as the likely source of a memory leak. So far, I have been unable to find a way to disable this tracing to revert to a sane(r) setup in production.
Reproduce
Unfortunatly, I am not able share the list of containers and images needed to reproduce this. I would be happy to collect more data if needed.
Expected behavior
Dockerd uses much less memory.
docker version
Client: Docker Engine - Community
Version: 27.4.0
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.19.2
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.31.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 24
Running: 6
Paused: 0
Stopped: 18
Images: 8
Server Version: 27.4.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 88bf19b2105c8b17560993bee28a01ddc2f97182
runc version: v1.2.2-0-g7cb3632
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.1.0-18-cloud-amd64
Operating System: Debian GNU/Linux 12 (bookworm)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 6.625GiB
Name: hive-prod-node-de1-2
ID: c7645856-eac3-4fdb-b1de-f1bf7ff5b355
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: falsedocker info
Client: Docker Engine - Community
Version: 27.4.0
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.19.2
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.31.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 24
Running: 6
Paused: 0
Stopped: 18
Images: 8
Server Version: 27.4.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 88bf19b2105c8b17560993bee28a01ddc2f97182
runc version: v1.2.2-0-g7cb3632
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.1.0-18-cloud-amd64
Operating System: Debian GNU/Linux 12 (bookworm)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 6.625GiB
Name: hive-prod-node-de1-2
ID: c7645856-eac3-4fdb-b1de-f1bf7ff5b355
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: falseAdditional Info
root@prod-node-de2-1:/home/debian# cat /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-file": "5",
"max-size": "10m"
}
}