At scale, Cilium users often face mysterious TCP connection failures from unexpected RST packets. This session explores a critical bug where Cilium's BPF-based SNAT and its LRU eviction policy prematurely terminate active sessions. We will dissect the root cause in the eBPF datapath and reveal the elegant fix, now merged upstream in Pull Request #37747: proactively restoring the original NAT entry on the reverse traffic path. This solution, born from a real-world production issue, reduced connection failures from up to 10% to nearly zero.
This talk is a must for operators debugging network instability and developers tackling real-world eBPF challenges. You will leave with a clear diagnosis for this "silent killer" and key insights into building robust, high-performance cloud networking.
Gyutae Bae is a software engineer on NAVER Corp.’s Container Platform team. He works on large-scale Kubernetes networking with Cilium/eBPF, focusing on reliability and performance across at scale. He diagnosed and fixed a connection-stability issue in Cilium and contributed the... Read More →