Skip to content

Increased "hubble observe" CPU usage after upgrade from 1.15.15 to 1.15.16 or newer #42564

@tporeba

Description

@tporeba

I was planning to upgrade cilium images I use to run hubble observe and I noticed that newer images use significantly more CPU to do the same work.

I started from quay.io/cilium/cilium:v1.15.5 and switching to 1.15.16 or newer (1.17.x, 1.18.x) causes an increase in CPU usage ~5x in my setup.

I have 10 pods organized as daemonset, each is running 3 containers of hubble observe, each fetching a subset of logs. All those processes produce total ~30MB/min of logs for whole system.

# log for business traffic
# 20 exclusions total
hubble observe --follow --print-node-name --time-format RFC3339Milli  \
  --not --namespace kube-system \
  --not --namespace A \
  --not --namespace B \
  --not --namespace C 
...

# log for technical traffic
# 20 inclusions total
hubble observe --follow --print-node-name --time-format RFC3339Milli \
     --namespace kube-system \
     --namespace A \
     --namespace B \
     --namespace C
...

# log for dropped traffic
hubble observe --follow --print-node-name --time-format RFC3339Milli \
          --type drop --type l7 --verdict DROPPED --not --to-ip ff02::/16 

I tested a couple of cilium versions:

  • 1.15.5, 1.15.6, 1.15.12, 1.15.14, 1.15.15 --> these behave normally, my grafana shows that whole deamonset uses < 1 cpu in total
  • 1.15.16, 1.15.19, 1.16.16, 1.17.6, 1.17.9, 1.18.3 ---> for these I see increased CPU usage of almost 5 cpus total.

Here is a screen from graphana after I downgraded back to 1.15.15

Image

This is from a standard graphana Dashboard Kubernetes / Compute Resources / Namespace (Workloads), with metric plotted being more or less:

sum(
  node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{namespace="logs"}
* on(namespace,pod)
  group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{namespace="logs"}
) by (workload, workload_type)

Here is also hubble version from inside of the containers for those 2 closest image tags:

v1.15.15 >  hubble version
hubble v1.17.1@HEAD-0d65c11 compiled with go1.23.6 on linux/amd64

v1.15.16 > root@os-workernode06:/home/cilium# hubble version
hubble v1.17.2@HEAD-aba36c0 compiled with go1.23.7 on linux/amd64

Is this intended effect or a bug?
Is new hubble version doing something more, that requires more cpu?

Metadata

Metadata

Assignees

Labels

affects/v1.15This issue affects v1.15 branchaffects/v1.16This issue affects v1.16 branchaffects/v1.17This issue affects v1.17 branchaffects/v1.18This issue affects v1.18 brancharea/hubbleImpacts hubble server or relaykind/performanceThere is a performance impact of this.kind/regressionThis functionality worked fine before, but was broken in a newer release of Cilium.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions