-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
After upgrading from Docker (CE) 17.05.0 to 17.12.0, these kernel messages started showing up on my machines:
[43073.004575] SLUB: Unable to allocate memory on node -1 (gfp=0x8020)
[43073.022211] cache: ip6_dst_cache(0:18479a49f76b66461d8fedd17a6eba407f08cc8960bd0246d39df8eb21f20b4f), object size: 448, buffer size: 448, default order: 2, min order: 0
[43073.061420] node 0: slabs: 74, objs: 2610, free: 0
[43073.077875] node 1: slabs: 88, objs: 3141, free: 0
I was able to reproduce this on Docker 17.06.0, and eventually traced it to the commit introduced by #1350 which enables kmem accounting for all containers. But as Docker helpfully suggests, kmemcg are experimental before linux 4.0:
$ sudo docker update --kernel-memory 1000g 18479a49f76b
You specified a kernel memory limit on a kernel older than 4.0. Kernel memory limits are experimental on older kernels, it won't work as expected and can cause your system to be unstable.
After a few more tests I was able to reproduce the issue on 17.05.0, by passing --kernel-memory 1000g to docker run. The kernel log slowly fills up (10 messages per hour?) with SLUB warnings, and containers seem far less stable than normal (i.e. they crash).
Steps to reproduce
- CentOS 7, latest kernel (3.10.0-693.17.1.el7.x86_64)
- Docker 17.06.0+ or runc with equivalent settings
- kmem limit unset, mem limit 40G, with 80G free memory left for page caches
- An application that very heavily uses the local disk to cause caches to build up (in my case Apache Cassandra)
Eventually, these messages will start popping up in the kernel logs, and in rare cases it leads to an application getting killed/crashed.
With all of the above said, I'm 99% sure this is a kernel bug related to running an ancient kernel, and runc's patch would at best be a workaround. If only I had a way to get redhat's attention so they can fix it 😃