Conversation
Roll back the recent change from 089e2c1 ("fix(deps): update all go dependencies main"). It breaks the GKE conformance workflow. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
This is causing breakage on GKE. Revert for now, we can re-apply once it also works there. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
e7373c3 to
997d61d
Compare
|
/ci-gke |
1 similar comment
|
/ci-gke |
|
/scale-5 |
|
/net-perf-gke |
|
/ci-ipsec |
|
/test |
| github.com/cilium/coverbee v0.3.3-0.20240723084546-664438750fce | ||
| github.com/cilium/dns v1.1.51-0.20240603182237-af788769786a | ||
| github.com/cilium/ebpf v0.19.1-0.20251016154102-8f23ed69cf93 | ||
| github.com/cilium/ebpf v0.19.1-0.20251013125301-c27ff922fc6a |
There was a problem hiding this comment.
@dylandreimerink @ti-mo heads-up, I'd expect that cilium/ebpf#1858 is the culprit here.
There was a problem hiding this comment.
Sorry, which failure? Are there any logs available? The few failed runs I've spot-checked in the GKE conformance workflow are all sysdumps without (agent) logs.
There was a problem hiding this comment.
Ack, I didn't find anything useful either. Simply bisected down to this change.
There was a problem hiding this comment.
GKE COS sets net.core.bpf_jit_harden to 2 I believe (related). We noticed yesterday that Cilium stopped starting up in such environments, since startup probes like HaveBPFJIT and HaveDeadCodeElim would hit the ErrRestrictedKernel error.
There was a problem hiding this comment.
At a glance, I only see a handful of spots where prog.Info() is used in a way that would fail with the new ErrRestrictedKernel check, so I wonder if a forward fix is feasible:
HaveDeadCodeElimusesinfo.Instructions()to ensure that there are no jump instructions in the final program.HaveBPFJITprobe checks forinfo.JitedSize()- Unused map pruning:
verifyUnusedMapsusesinfo.Instructionsas well
As for the probes, there are environments like COS where these probes won't work, so I wonder if we can just print a warning if JitedSize or Instructions hits ErrRestrictedKernel here instead of failing startup altogether.
Patching these probes in such a way gets past the startup issues when bpf_jit_harden is enabled according to a simple kind cluster test I ran. With the unused map pruning, you get
time=2025-10-22T18:34:01.562199217Z level=debug source=/go/src/github.com/cilium/cilium/pkg/bpf/collection.go:256 msg="verifying unused maps: getting instructions for program tail_icmp6_send_time_exceeded: instructions: restricted by kernel.kptr_restrict and/or net.core.bpf_jit_harden sysctls" module=agent.datapath.loader
but it doesn't seem to prevent datapath regeneration from completing. Maybe the map pruning could just be disabled altogether in such a case to prevent log spam, but as a separate PR.
Environments like GKE COS enable BPF JIT hardening ( net.core.bpf_jit_harden=2). With this enabled, BPF_OBJ_GET_INFO_BY_FD does not provide xlated instructions, JITed instructions, or other related info in its response when querying program info. So, any Cilium feature that relies on these fields won't work. Currently, there are three such places: 1. The `HaveBPFJIT` probe calls info.JitedSize() to determine if the program was JITed. 2. The `HaveDeadCodeElim` probe calls info.Instructions() to inspect the final BPF instructions to see if dead code elimination was applied. 3. `verifyUnusedMaps` calls info.Instructions() in an attempt to walk the final program instructions and see which maps are referenced after dead code elimination. Recently, cilium/ebpf#1858 started returning ErrRestrictedKernel when querying these fields in such environments [1]. Before, these probes would silently fail, but now they fail loudly blocking Cilium startup. `verifyUnusedMaps` hits ErrRestrictedKernel when querying program instructions. AFAICT this does not impact functionality, but may result in unused maps being loaded. ``` msg="verifying unused maps: getting instructions for program tail_icmp6_send_time_exceeded: instructions: restricted by kernel.kptr_restrict and/or net.core.bpf_jit_harden sysctls" module=agent.datapath.loader ``` The version bump to cilium/ebpf in cilium/cilium was reverted in cilium#42327 [2] to fix these probe failures, so undo this and forward fix Cilium by tolerating ErrRestrictedKernel in the probes. This reverts Cilium back to its previous behavior, silently ignoring these probe failures in environments where BPF JIT hardening is enabled, while allowing cilium/ebpf to be upgraded. [1]: cilium/ebpf#1858 [2]: cilium#42327 Signed-off-by: Jordan Rife <jrife@google.com>
Environments like GKE COS enable BPF JIT hardening ( net.core.bpf_jit_harden=2). With this enabled, BPF_OBJ_GET_INFO_BY_FD does not provide xlated instructions, JITed instructions, or other related info in its response when querying program info. So, any Cilium feature that relies on these fields won't work. Currently, there are three such places: 1. The `HaveBPFJIT` probe calls info.JitedSize() to determine if the program was JITed. 2. The `HaveDeadCodeElim` probe calls info.Instructions() to inspect the final BPF instructions to see if dead code elimination was applied. 3. `verifyUnusedMaps` calls info.Instructions() in an attempt to walk the final program instructions and see which maps are referenced after dead code elimination. Recently, cilium/ebpf#1858 started returning ErrRestrictedKernel when querying these fields in such environments [1]. Before, these probes would silently fail, but now they fail loudly blocking Cilium startup. `verifyUnusedMaps` hits ErrRestrictedKernel when querying program instructions. AFAICT this does not impact functionality, but may result in unused maps being loaded. ``` msg="verifying unused maps: getting instructions for program tail_icmp6_send_time_exceeded: instructions: restricted by kernel.kptr_restrict and/or net.core.bpf_jit_harden sysctls" module=agent.datapath.loader ``` The version bump to cilium/ebpf in cilium/cilium was reverted in #42327 [2] to fix these probe failures, so undo this and forward fix Cilium by tolerating ErrRestrictedKernel in the probes. This reverts Cilium back to its previous behavior, silently ignoring these probe failures in environments where BPF JIT hardening is enabled, while allowing cilium/ebpf to be upgraded. [1]: cilium/ebpf#1858 [2]: #42327 Signed-off-by: Jordan Rife <jrife@google.com>
Revert some changes to address a bunch of breakages that affect the GKE conformance workflow on
main.One is described in #42121, the other one got introduced over the weekend from a renovate dependency bump. We'll need to follow-up with a patch to exclude the ebpf library from renovate updates, until the problem is addressed.