Skip to content

Memory leak #49075

@hivenet-mathieu-lacage

Description

@hivenet-mathieu-lacage

Description

Hi,

About 2 days ago, I upgraded our production servers to the latest docker version (27.4.0). Since then, the docker daemon of all servers is eating memory like candy and getting killed by the oom killer once every couple hours. The result is not pretty.

I collected pprof output from the running docker daemon. They all look the across all servers. Here is a typical example:

mathieu@Host-001:~/$ go tool pprof -top ./heap-dump-de1-2-2024-12-12-10-56 
File: dockerd
Build ID: 64214f994238df60afd9aa8c14b0322a8fc3412a
Type: inuse_space
Time: Dec 12, 2024 at 10:56am (CET)
Showing nodes accounting for 3326.65MB, 99.71% of 3336.46MB total
Dropped 142 nodes (cum <= 16.68MB)
      flat  flat%   sum%        cum   cum%
 2440.59MB 73.15% 73.15%  2440.59MB 73.15%  github.com/moby/buildkit/util/tracing/detect.(*TraceRecorder).ExportSpans
  502.54MB 15.06% 88.21%   502.54MB 15.06%  go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).interfaceArrayToEventArray (inline)
  383.52MB 11.49% 99.71%   383.52MB 11.49%  go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).SetAttributes
         0     0% 99.71%  2942.62MB 88.20%  github.com/containerd/containerd/tracing.(*Span).End (inline)
         0     0% 99.71%   383.52MB 11.49%  github.com/containerd/containerd/tracing.(*Span).SetAttributes (inline)
         0     0% 99.71%  3326.15MB 99.69%  github.com/docker/docker/daemon/logger/loggerutils.(*LogFile).readLogsLocked
         0     0% 99.71%  3326.15MB 99.69%  github.com/docker/docker/daemon/logger/loggerutils.(*follow).Do
         0     0% 99.71%  3326.15MB 99.69%  github.com/docker/docker/daemon/logger/loggerutils.(*follow).forward
         0     0% 99.71%  3326.15MB 99.69%  github.com/docker/docker/daemon/logger/loggerutils.(*forwarder).Do
         0     0% 99.71%  3326.15MB 99.69%  github.com/docker/docker/daemon/logger/loggerutils.(*forwarder).Do.func1
         0     0% 99.71%  2943.12MB 88.21%  go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End
         0     0% 99.71%   502.54MB 15.06%  go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).snapshot
         0     0% 99.71%  2440.59MB 73.15%  go.opentelemetry.io/otel/sdk/trace.(*simpleSpanProcessor).OnEnd

The above points to otel as the likely source of a memory leak. So far, I have been unable to find a way to disable this tracing to revert to a sane(r) setup in production.

Reproduce

Unfortunatly, I am not able share the list of containers and images needed to reproduce this. I would be happy to collect more data if needed.

Expected behavior

Dockerd uses much less memory.

docker version

Client: Docker Engine - Community
 Version:    27.4.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.19.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.31.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 24
  Running: 6
  Paused: 0
  Stopped: 18
 Images: 8
 Server Version: 27.4.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 88bf19b2105c8b17560993bee28a01ddc2f97182
 runc version: v1.2.2-0-g7cb3632
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.1.0-18-cloud-amd64
 Operating System: Debian GNU/Linux 12 (bookworm)
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 6.625GiB
 Name: hive-prod-node-de1-2
 ID: c7645856-eac3-4fdb-b1de-f1bf7ff5b355
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

docker info

Client: Docker Engine - Community
 Version:    27.4.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.19.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.31.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 24
  Running: 6
  Paused: 0
  Stopped: 18
 Images: 8
 Server Version: 27.4.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 88bf19b2105c8b17560993bee28a01ddc2f97182
 runc version: v1.2.2-0-g7cb3632
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.1.0-18-cloud-amd64
 Operating System: Debian GNU/Linux 12 (bookworm)
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 6.625GiB
 Name: hive-prod-node-de1-2
 ID: c7645856-eac3-4fdb-b1de-f1bf7ff5b355
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional Info

root@prod-node-de2-1:/home/debian# cat /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-file": "5",
"max-size": "10m"
}
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions