Skip to content

kubelet incorrectly configures kubepods.slice systemd unit when kube-reserved and/or system-reserved is used #88197

@olegch

Description

@olegch

What happened:

Kubelet is started on CentOS 7 with systemd with the following switches (among others) --cgroup-driver=systemd, --enforce-node-allocatable=pods, --kube-reserved=cpu=100m,ephemeral-storage=2Gi,memory=1.5Gi

With these settings kubelet correctly creates kubepods.slice cgroup with memory limit set to total RAM on the node (~4Gi in this example) minus kube-reserved (1.5Gi in this example)

# free -b
              total        used        free      shared  buff/cache   available
Mem:     3973447680   167432192  3489378304      528384   316637184  3584466944

# cat /sys/fs/cgroup/memory/kubepods.slice/memory.limit_in_bytes
2362834944

At the same time corresponding systemd unit kubepods.slice created by kubelet has a MemoryLimit property incorrectly set to total RAM of the node (without subtracting kube-reserved amount):

# systemctl show kubepods.slice
Slice=-.slice
ControlGroup=/kubepods.slice
...
MemoryLimit=3973447680
...

# systemctl cat kubepods.slice
...
# /run/systemd/system/kubepods.slice.d/50-MemoryLimit.conf
[Slice]
MemoryLimit=3973447680
...

This leads to systemd sometimes resetting the limit in /sys/fs/cgroup/memory/kubepods.slice/memory.limit_in_bytes cgroup file to total RAM on the node without accounting for the kube-reserved (and/or system-reserved) amount.
This in turn can lead to pods using more memory than expected and OOM kill events, and unstable node behavior in general

What you expected to happen:

kubepods.slice systemd unit MemoryLimit property should be set to the same value as /sys/fs/cgroup/memory/kubepods.slice/memory.limit_in_bytes taking into account kube-reserved and system-reserved amounts.

When kubelet s restarted with different kube-reserved and/or system-reserved values, both /sys/fs/cgroup/memory/kubepods.slice/memory.limit_in_bytes and MemoryLimit property of kubepods.slice systemd unit should be changed to the same value even if the cgroups and the systemd unit already exist.

How to reproduce it (as minimally and precisely as possible):

Run kubelet on any systemd linux with the switches specified above and observe the discrepancy between memory limits defined in /sys/fs/cgroup/memory/kubepods.slice/memory.limit_in_bytes and MemoryLimit property of kubepods.slice systemd unit.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
    1.16.0, but also applies to all earlier and later versions
  • Cloud provider or hardware configuration:
    virtualbox VM, but applies to any infrastructure
  • OS (e.g: cat /etc/os-release):
    centos 7, but applies to any systemd based linux flavor
  • Kernel (e.g. uname -a):
    3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux but applies to any other kernel
  • Install tools:
    kublr, but applies to any other tool
  • Network plugin and version (if this is a network-related bug): N/A
  • Others: N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions