Skip to content

unbreak GKE workflows#42327

Merged
julianwiedmann merged 2 commits intomainfrom
pr/jwi/main/gke-unbreak
Oct 23, 2025
Merged

unbreak GKE workflows#42327
julianwiedmann merged 2 commits intomainfrom
pr/jwi/main/gke-unbreak

Conversation

@julianwiedmann
Copy link
Member

@julianwiedmann julianwiedmann commented Oct 22, 2025

Revert some changes to address a bunch of breakages that affect the GKE conformance workflow on main.

One is described in #42121, the other one got introduced over the weekend from a renovate dependency bump. We'll need to follow-up with a patch to exclude the ebpf library from renovate updates, until the problem is addressed.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 22, 2025
Roll back the recent change from
089e2c1 ("fix(deps): update all go dependencies main").

It breaks the GKE conformance workflow.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
This is causing breakage on GKE. Revert for now, we can re-apply once it
also works there.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann julianwiedmann force-pushed the pr/jwi/main/gke-unbreak branch from e7373c3 to 997d61d Compare October 22, 2025 07:50
@julianwiedmann
Copy link
Member Author

/ci-gke

1 similar comment
@julianwiedmann
Copy link
Member Author

/ci-gke

@julianwiedmann
Copy link
Member Author

/scale-5

@julianwiedmann
Copy link
Member Author

/net-perf-gke

@julianwiedmann
Copy link
Member Author

/ci-ipsec

@julianwiedmann julianwiedmann changed the title Pr/jwi/main/gke unbreak unbreak GKE workflow Oct 22, 2025
@julianwiedmann julianwiedmann added area/CI Continuous Integration testing issue or flake area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/misc This PR makes changes that have no direct user impact. dependencies Pull requests that update a dependency file labels Oct 22, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Oct 22, 2025
@julianwiedmann julianwiedmann changed the title unbreak GKE workflow unbreak GKE workflows Oct 22, 2025
@julianwiedmann julianwiedmann added the integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. label Oct 22, 2025
@julianwiedmann
Copy link
Member Author

/test

github.com/cilium/coverbee v0.3.3-0.20240723084546-664438750fce
github.com/cilium/dns v1.1.51-0.20240603182237-af788769786a
github.com/cilium/ebpf v0.19.1-0.20251016154102-8f23ed69cf93
github.com/cilium/ebpf v0.19.1-0.20251013125301-c27ff922fc6a
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dylandreimerink @ti-mo heads-up, I'd expect that cilium/ebpf#1858 is the culprit here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, which failure? Are there any logs available? The few failed runs I've spot-checked in the GKE conformance workflow are all sysdumps without (agent) logs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I didn't find anything useful either. Simply bisected down to this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GKE COS sets net.core.bpf_jit_harden to 2 I believe (related). We noticed yesterday that Cilium stopped starting up in such environments, since startup probes like HaveBPFJIT and HaveDeadCodeElim would hit the ErrRestrictedKernel error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance, I only see a handful of spots where prog.Info() is used in a way that would fail with the new ErrRestrictedKernel check, so I wonder if a forward fix is feasible:

  • HaveDeadCodeElim uses info.Instructions() to ensure that there are no jump instructions in the final program.
  • HaveBPFJIT probe checks for info.JitedSize()
  • Unused map pruning: verifyUnusedMaps uses info.Instructions as well

As for the probes, there are environments like COS where these probes won't work, so I wonder if we can just print a warning if JitedSize or Instructions hits ErrRestrictedKernel here instead of failing startup altogether.

Patching these probes in such a way gets past the startup issues when bpf_jit_harden is enabled according to a simple kind cluster test I ran. With the unused map pruning, you get

time=2025-10-22T18:34:01.562199217Z level=debug source=/go/src/github.com/cilium/cilium/pkg/bpf/collection.go:256 msg="verifying unused maps: getting instructions for program tail_icmp6_send_time_exceeded: instructions: restricted by kernel.kptr_restrict and/or net.core.bpf_jit_harden sysctls" module=agent.datapath.loader

but it doesn't seem to prevent datapath regeneration from completing. Maybe the map pruning could just be disabled altogether in such a case to prevent log spam, but as a separate PR.

@julianwiedmann julianwiedmann marked this pull request as ready for review October 22, 2025 10:56
@julianwiedmann julianwiedmann requested review from a team as code owners October 22, 2025 10:56
@julianwiedmann julianwiedmann added this pull request to the merge queue Oct 23, 2025
Merged via the queue into main with commit ee7ca87 Oct 23, 2025
566 of 575 checks passed
@julianwiedmann julianwiedmann deleted the pr/jwi/main/gke-unbreak branch October 23, 2025 11:44
jrife added a commit to jrife/cilium that referenced this pull request Oct 23, 2025
Environments like GKE COS enable BPF JIT hardening (
net.core.bpf_jit_harden=2). With this enabled, BPF_OBJ_GET_INFO_BY_FD
does not provide xlated instructions, JITed instructions, or other
related info in its response when querying program info. So, any Cilium
feature that relies on these fields won't work. Currently, there are
three such places:

1. The `HaveBPFJIT` probe calls info.JitedSize() to determine if the
   program was JITed.
2. The `HaveDeadCodeElim` probe calls info.Instructions() to inspect the
   final BPF instructions to see if dead code elimination was applied.
3. `verifyUnusedMaps` calls info.Instructions() in an attempt to walk
   the final program instructions and see which maps are referenced
   after dead code elimination.

Recently, cilium/ebpf#1858 started returning ErrRestrictedKernel when
querying these fields in such environments [1]. Before, these probes
would silently fail, but now they fail loudly blocking Cilium startup.

`verifyUnusedMaps` hits ErrRestrictedKernel when querying program
instructions. AFAICT this does not impact functionality, but may result
in unused maps being loaded.

```
msg="verifying unused maps: getting instructions for program tail_icmp6_send_time_exceeded: instructions: restricted by kernel.kptr_restrict and/or net.core.bpf_jit_harden sysctls" module=agent.datapath.loader
```

The version bump to cilium/ebpf in cilium/cilium was reverted in
cilium#42327 [2] to fix these probe failures, so undo this and
forward fix Cilium by tolerating ErrRestrictedKernel in the probes.
This reverts Cilium back to its previous behavior, silently ignoring
these probe failures in environments where BPF JIT hardening is enabled,
while allowing cilium/ebpf to be upgraded.

[1]: cilium/ebpf#1858
[2]: cilium#42327

Signed-off-by: Jordan Rife <jrife@google.com>
github-merge-queue bot pushed a commit that referenced this pull request Oct 28, 2025
Environments like GKE COS enable BPF JIT hardening (
net.core.bpf_jit_harden=2). With this enabled, BPF_OBJ_GET_INFO_BY_FD
does not provide xlated instructions, JITed instructions, or other
related info in its response when querying program info. So, any Cilium
feature that relies on these fields won't work. Currently, there are
three such places:

1. The `HaveBPFJIT` probe calls info.JitedSize() to determine if the
   program was JITed.
2. The `HaveDeadCodeElim` probe calls info.Instructions() to inspect the
   final BPF instructions to see if dead code elimination was applied.
3. `verifyUnusedMaps` calls info.Instructions() in an attempt to walk
   the final program instructions and see which maps are referenced
   after dead code elimination.

Recently, cilium/ebpf#1858 started returning ErrRestrictedKernel when
querying these fields in such environments [1]. Before, these probes
would silently fail, but now they fail loudly blocking Cilium startup.

`verifyUnusedMaps` hits ErrRestrictedKernel when querying program
instructions. AFAICT this does not impact functionality, but may result
in unused maps being loaded.

```
msg="verifying unused maps: getting instructions for program tail_icmp6_send_time_exceeded: instructions: restricted by kernel.kptr_restrict and/or net.core.bpf_jit_harden sysctls" module=agent.datapath.loader
```

The version bump to cilium/ebpf in cilium/cilium was reverted in
#42327 [2] to fix these probe failures, so undo this and
forward fix Cilium by tolerating ErrRestrictedKernel in the probes.
This reverts Cilium back to its previous behavior, silently ignoring
these probe failures in environments where BPF JIT hardening is enabled,
while allowing cilium/ebpf to be upgraded.

[1]: cilium/ebpf#1858
[2]: #42327

Signed-off-by: Jordan Rife <jrife@google.com>
@cilium-release-bot cilium-release-bot bot moved this to Released in cilium v1.19.0 Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/CI Continuous Integration testing issue or flake area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. dependencies Pull requests that update a dependency file integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. release-note/misc This PR makes changes that have no direct user impact.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

5 participants