Skip to content

Big CPU consumption following deletion of thousands of identities #8360

@joulaud

Description

@joulaud

Bug report

General Information

  • Cilium version
    Client: 1.5.1 6db814405 2019-05-16T12:54:32-07:00 go version go1.12.5 linux/amd64
    Daemon: 1.5.1 6db814405 2019-05-16T12:54:32-07:00 go version go1.12.5 linux/amd64
    
  • Kernel version (Linux ip-10-71-33-200 4.19.0-0.bpo.2-amd64 #1 SMP Debian 4.19.16-1~bpo9+1 (2019-02-07) x86_64 x86_64 x86_64 GNU/Linux)
  • Orchestration system version in use : kubectl
    Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.7", GitCommit:"6f482974b76db3f1e0f5d24605a9d1d38fad9a2b", GitTreeState:"clean", BuildDate:"2019-03-25T02:52:13Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.8", GitCommit:"4e209c9383fa00631d124c8adcc011d617339b3c", GitTreeState:"clean", BuildDate:"2019-02-28T18:40:05Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
    
  • sysdump transmitted to @aanm via slack

How to reproduce the issue

  1. generate a lot (thousands) of unused identities (e.g. by doing a lot of kubernetes pod creation and deletion with different labels each time) on a cluster without cilium-operator
  2. add cilium-operator which start to delete all unused identities
  3. observe CPU consumption of a lot of cilium-agents climb to the stars. Also, the time before taking in account new identities (and being able to effectively connect to those pods) is about several tens of minutes.

CPU consumption
We see here that all agents have a big spike in consumption but depending on the agent it can be quickly managed or last a lot longer. At the end we killed agents to come back more quickly to normal state and it worked. While the high CPU consumption lasted the concerned nodes dropped egress trafic to new identities (because of regeneration not done).

endpoint state and regeneration time

One suggestion of @aanm was patch Cilium to avoid GCollecting so many unused identities in a single loop.

cf. conversation on slack https://cilium.slack.com/archives/C1MATJ5U5/p1561039462124100 for some details.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.kind/performanceThere is a performance impact of this.staleThe stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions