-
Notifications
You must be signed in to change notification settings - Fork 3.7k
lb: BPF reconciler panics with nil pointer dereference on nil slot during ResetAndRestore #44896
Description
Is there an existing issue for this?
- I have searched the existing issues
Version
lower than v1.19.0
What happened?
The Cilium agent crashes with a nil pointer dereference in BPFOps.ResetAndRestore() when iterating over quarantined backend slots during BPF map restoration.
In pkg/loadbalancer/reconciler/bpf_reconciler.go, the loop that processes quarantined backend slots:
for _, slot := range slots[1+master.GetCount():] {
if addr, found := backendIDToAddress[slot.GetBackendID()]; found {
backends.Insert(addr)
}
}does not guard against nil entries. The crash occurs during agent startup or restore, making it especially disruptive since the agent cannot recover without the clearing the bpf map.
How can we reproduce the issue?
Haven't been able to replicate the issue on an actual cluster but have a set of steps which I think will lead to this issue. Basically need to have a bpf map state where the slotID for the backends are incorrectly published.
- Service has active + terminating backends. A service with 2 active and 2 terminating backends is reconciled. BPF maps Slots 1 and 2 hold active backends, slots 3 and 4 hold quarantined (terminating) backends.
- Agent crashes and restarts. All in-memory state is lost. The BPF map survives in the kernel. It reads the BPF map and populates quarantined backend with the two terminating backends from slot 3 and slot 4.
- At the same time of step 2. Service scales up with 2 ready and 1 not ready pod. New active backends and a not-ready pod (ready as false and terminating as false -> Maintenance state in cilium agent in memory) are added. Slot 3 and slot 4 pods are still terminating.
- First reconciliation fires,
slotIDbeing incrementally allocated in v1.18(i.ei+1), agent skips the maintenance state backend butistill increments and writes the other backend with a gap inslotID(i.e,1, 2, 3, 4, 6, 7instead of1,2,3,4,5,6). While internal reference state still references the maintenance backend. - Second reconciliation now sees the difference in the map state and the cilium backend reference and deletes the highest slot from the map.
- Now when the agent restarts, it tries to access slot 5 during the reconciliation loop but panic because it does not exist. And never recovers.
Why it does not happen more often ?
The check in the reconciler len(slots) == 1 + count + qcount) is what saves most restarts. After the first buggy reconciliation (step 3), the map has an extra slot (7) beyond what the master stores as the total count, so the guard fails (8 ≠ 7) and the quarantine loop is safely skipped. It takes the second reconciliation (step 4) to delete that extra slot and make the math line up. So you need the specific sequence to hit the panic.
Cilium Version
v1.18.6
Kernel Version
Not able to get from the cluster
Kubernetes Version
v1.34
Regression
No response
Sysdump
No response
Relevant log output
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x34fed87]
goroutine 1 [running]:
github.com/cilium/cilium/pkg/loadbalancer/reconciler.(*BPFOps).ResetAndRestore(0xc000d7cd20)
/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/bpf_reconciler.go:313 +0xde7
github.com/cilium/cilium/pkg/loadbalancer/reconciler.(*BPFOps).start(...)
/go/src/github.com/cilium/cilium/pkg/loadbalancer/reconciler/bpf_reconciler.go:226
github.com/cilium/hive/cell.Hook.Start(...)
/go/src/github.com/cilium/cilium/vendor/github.com/cilium/hive/cell/lifecycle.go:43
github.com/cilium/hive/cell.(*DefaultLifecycle).Start(0xc0008751a0, 0xc000486ff0, {0x50f16e0?, 0xc001c3e000?})
/go/src/github.com/cilium/cilium/vendor/github.com/cilium/hive/cell/lifecycle.go:128 +0x2fd
github.com/cilium/hive.(*Hive).Start(0xc0009b1b30, 0xc000486ff0, {0x50f16e0, 0xc001c3e000})
/go/src/github.com/cilium/cilium/vendor/github.com/cilium/hive/hive.go:359 +0x131
github.com/cilium/hive.(*Hive).Run(0xc0009b1b30, 0xc000486ff0)
/go/src/github.com/cilium/cilium/vendor/github.com/cilium/hive/hive.go:231 +0x85
github.com/cilium/cilium/daemon/cmd.NewAgentCmd.func1(0xc000304f08, {0x49c6cfe?, 0x4?, 0x49c6b9a?})
/go/src/github.com/cilium/cilium/daemon/cmd/root.go:52 +0x1f9
github.com/spf13/cobra.(*Command).execute(0xc000304f08, {0xc0001be110, 0x1, 0x1})
/go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:1019 +0xa91
github.com/spf13/cobra.(*Command).ExecuteC(0xc000304f08)
/go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:1148 +0x46f
github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/cilium/cilium/vendor/github.com/spf13/cobra/command.go:1071
github.com/cilium/cilium/daemon/cmd.Execute(0x4c29a38?)
/go/src/github.com/cilium/cilium/daemon/cmd/root.go:90 +0x13
main.main()
/go/src/github.com/cilium/cilium/daemon/main.go:15 +0x1fAnything else?
No response
Cilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct