Skip to content

cilium-cli: Skip external IP NodePort validation on GKE#44014

Merged
aanm merged 1 commit intocilium:mainfrom
pillai-ashwin:fix/43749-gke-nodeport-external-ip
Jan 27, 2026
Merged

cilium-cli: Skip external IP NodePort validation on GKE#44014
aanm merged 1 commit intocilium:mainfrom
pillai-ashwin:fix/43749-gke-nodeport-external-ip

Conversation

@pillai-ashwin
Copy link
Copy Markdown
Member

@pillai-ashwin pillai-ashwin commented Jan 26, 2026

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer's Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

Summary

Fixes #43749 by skipping external IP NodePort validation on GKE, where external IPs are not reachable from inside the cluster.

Details

PR #40812 enhanced NodePort validation to check all node IP addresses (including external IPs) to fix dual-stack clusters. However, on GKE, external IPs are not reachable from within the cluster due to firewall restrictions.

The actual connectivity tests already handle this correctly by skipping external IPs on GKE (service.go:252-257), but the upfront validation code did not have this check, causing test failures.

Changes

  • Added GKE-aware external IP skip logic to NodePort validation in deployment.go (lines 2742-2749)
  • Rebased on main to incorporate the new IPv4/IPv6 family filtering check
  • Works alongside the existing IP family validation (lines 2735-2740)
  • Uses the same pattern already present elsewhere in the codebase (service.go:252-257, deployment.go:1781-1786)
  • Only affects GKE environments (no change for other platforms)
  • Preserves the dual-stack fix from PR cilium-cli: Validate NodePort readiness on all node IP addresses #40812

Root Cause Analysis

Current Code Structure (After Rebase)

The NodePort validation now has three sequential checks to filter addresses:

for _, node := range ct.Nodes() {
    for _, addr := range node.Status.Addresses {
        // 1. Only check IP addresses (skip DNS names, hostnames)
        if addr.Type != slimcorev1.NodeInternalIP && addr.Type != slimcorev1.NodeExternalIP {
            continue
        }

        // 2. Skip IP addresses of families which are not enabled in Cilium
        //    (Added by upstream after PR #40812)
        addrFamily := features.GetIPFamily(addr.Address)
        if (addrFamily == features.IPFamilyV4 && !ct.Features[features.IPv4].Enabled) ||
            (addrFamily == features.IPFamilyV6 && !ct.Features[features.IPv6].Enabled) {
            continue
        }

        // 3. [THIS PR] On GKE, ExternalIP is not reachable from inside a cluster
        //    Skip validation for external IPs in GKE to match the actual test behavior
        if addr.Type == slimcorev1.NodeExternalIP {
            if f, ok := ct.Feature(features.Flavor); ok && f.Enabled && f.Mode == "gke" {
                ct.Debugf("Skipping NodePort validation for external IP %s on GKE", addr.Address)
                continue
            }
        }

        // Validate NodePort readiness
        for _, s := range ct.echoServices {
            if err := WaitForNodePorts(ctx, ct, *client, addr.Address, s); err != nil {
                return err
            }
        }
    }
}

What Broke

In PR #40812, the upfront NodePort validation was enhanced from:

// Old: Only validated HostIP (one IP per node, typically IPv4)
for _, ciliumPod := range ct.ciliumPods {
    hostIP := ciliumPod.Pod.Status.HostIP
    WaitForNodePorts(ctx, ct, *client, hostIP, s)
}

To:

// New: Validates ALL node addresses (internal + external)
for _, node := range ct.Nodes() {
    for _, addr := range node.Status.Addresses {
        if addr.Type != slimcorev1.NodeInternalIP && addr.Type != slimcorev1.NodeExternalIP {
            continue
        }
        WaitForNodePorts(ctx, ct, *client, addr.Address, s)  // Validates external IPs too
    }
}

This correctly fixed dual-stack IPv6 validation but inadvertently started validating external IPs that are unreachable on GKE due to firewall restrictions.

The Fix

The actual connectivity tests already skip external IPs on GKE:

// From service.go:252-257
if addr.Type == slimcorev1.NodeExternalIP {
    if f, ok := t.Context().Feature(features.Flavor); ok && f.Enabled && f.Mode == "gke" {
        continue  // ← Tests skip external IPs on GKE
    }
}

This same pattern exists in another part of deployment.go:

// From deployment.go:1781-1786
// On GKE ExternalIP is not reachable from inside a cluster
if addr.Type == slimcorev1.NodeExternalIP {
    if f, ok := ct.Feature(features.Flavor); ok && f.Enabled && f.Mode == "gke" {
        continue
    }
}

This PR adds the same check to the NodePort validation phase for consistency:

// In deployment.go:2742-2749 (this PR)
// On GKE, ExternalIP is not reachable from inside a cluster.
// Skip validation for external IPs in GKE to match the actual test behavior.
if addr.Type == slimcorev1.NodeExternalIP {
    if f, ok := ct.Feature(features.Flavor); ok && f.Enabled && f.Mode == "gke" {
        ct.Debugf("Skipping NodePort validation for external IP %s on GKE", addr.Address)
        continue
    }
}

Why External IPs Are Unreachable on GKE

On GKE:

  • Internal IPs (e.g., 10.128.0.2): ✅ Reachable from inside the cluster
  • External IPs (e.g., 34.106.17.216): ❌ Blocked by GKE firewall, not reachable from inside the cluster

The error logs from issue #43749 show timeouts on external IPs:

⌛ Waiting for NodePort 34.106.17.216:31639...
timeout reached waiting for NodePort 34.106.17.216:31639

Testing

This fix should resolve failures in:

  • GKE no-tunnel config
  • GKE tunnel config

Without affecting:

  • GKE tunnel-ingress-controller config (KPR)
  • Dual-stack IPv6 functionality (still validates IPv6 internal addresses)
  • Other cloud providers (EKS, AKS, bare metal, etc.)
  • IPv4/IPv6 family filtering

References

Fixes: #43749
See-also: #39793 (original dual-stack issue fixed by #40812)
See-also: #40812 (PR that introduced the validation enhancement)

Fix GKE conformance test NodePort timeouts by skipping unreachable external IP validation on GKE

@pillai-ashwin pillai-ashwin requested a review from a team as a code owner January 26, 2026 11:25
@pillai-ashwin pillai-ashwin requested a review from nebril January 26, 2026 11:25
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 26, 2026
@github-actions github-actions bot added cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary labels Jan 26, 2026
@julianwiedmann julianwiedmann added release-note/bug This PR fixes an issue in a previous release of Cilium. integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. labels Jan 26, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 26, 2026
@julianwiedmann
Copy link
Copy Markdown
Member

(note that this needs a rebase)

@julianwiedmann julianwiedmann added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Jan 26, 2026
@pillai-ashwin pillai-ashwin force-pushed the fix/43749-gke-nodeport-external-ip branch from ac34763 to f92de4d Compare January 26, 2026 11:45
@pillai-ashwin
Copy link
Copy Markdown
Member Author

Rebased on latest main (e382a33). Resolved merge conflict by keeping both the new IPv4/IPv6 family check and the GKE external IP skip logic.

PR cilium#40812 added validation for NodePorts on all node IP addresses
(internal and external) to fix dual-stack clusters. However, on GKE,
external IPs are not reachable from inside the cluster due to firewall
restrictions.

The connectivity tests already skip external IPs on GKE,
but the upfront validation added by cilium#40812 did not, causing timeouts
during test setup.

This commit adds the same GKE-aware external IP skip logic to the
NodePort validation phase, making it consistent with the actual test
behavior.

Fixes: cilium#43749
See-also: cilium#39793

Signed-off-by: Ashwin Pillai <pillaiashwin96@gmail.com>
@pillai-ashwin pillai-ashwin force-pushed the fix/43749-gke-nodeport-external-ip branch from f92de4d to f1a5a64 Compare January 26, 2026 11:46
@julianwiedmann
Copy link
Copy Markdown
Member

/ci-gke

@julianwiedmann julianwiedmann removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Jan 26, 2026
@julianwiedmann
Copy link
Copy Markdown
Member

/test

@gandro
Copy link
Copy Markdown
Member

gandro commented Jan 26, 2026

The linked issue says the problem does not occur if KPR is enabled. Do we know why? This PR feels more like a workaround without providing a root cause analysis.

@julianwiedmann
Copy link
Copy Markdown
Member

The linked issue says the problem does not occur if KPR is enabled. Do we know why? This PR feels more like a workaround without providing a root cause analysis.

just a hunch - but with KPR enabled, we most likely also have SocketLB attached to pods. Which then wildcard-matches all accesses to NodePort services, and applies service translation at the source (without the request ever actually hitting the remote node). And thus the remote node's external IP doesn't need to be accessible from within the cluster.

@aanm aanm enabled auto-merge January 26, 2026 15:18
@aanm aanm added this pull request to the merge queue Jan 27, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 27, 2026
Merged via the queue into cilium:main with commit 0a65fb9 Jan 27, 2026
77 checks passed
lambchop4prez added a commit to lambchop4prez/network that referenced this pull request Feb 26, 2026
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
|
[aqua:cilium/cilium-cli](https://redirect.github.com/cilium/cilium-cli)
| patch | `0.19.0` → `0.19.2` |

---

### Release Notes

<details>
<summary>cilium/cilium-cli (aqua:cilium/cilium-cli)</summary>

###
[`v0.19.2`](https://redirect.github.com/cilium/cilium-cli/compare/v0.19.1...v0.19.2)

[Compare
Source](https://redirect.github.com/cilium/cilium-cli/compare/v0.19.1...v0.19.2)

###
[`v0.19.1`](https://redirect.github.com/cilium/cilium-cli/releases/tag/v0.19.1)

[Compare
Source](https://redirect.github.com/cilium/cilium-cli/compare/v0.19.0...v0.19.1)

## Summary of Changes

**Minor Changes:**

- cli: clustermesh: use ca bundle to connect clusters
([cilium/cilium#42833](https://redirect.github.com/cilium/cilium/issues/42833),
[@&#8203;MrFreezeex](https://redirect.github.com/MrFreezeex))

**Bugfixes:**

- cilium-cli: Fix NodePort deployment check in dual-stack clusters
([cilium/cilium#43888](https://redirect.github.com/cilium/cilium/issues/43888),
[@&#8203;gandro](https://redirect.github.com/gandro))
- Fix GKE conformance test NodePort timeouts by skipping unreachable
external IP validation on GKE
([cilium/cilium#44014](https://redirect.github.com/cilium/cilium/issues/44014),
[@&#8203;pillai-ashwin](https://redirect.github.com/pillai-ashwin))

**CI Changes:**

- cli: Relax warning exclusion for "unable to find key in local cache"
([cilium/cilium#44149](https://redirect.github.com/cilium/cilium/issues/44149),
[@&#8203;brb](https://redirect.github.com/brb))

**Misc Changes:**

- chore(deps): update all-dependencies (main)
([cilium/cilium#43700](https://redirect.github.com/cilium/cilium/issues/43700),
[@&#8203;cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot])
- chore(deps): update all-dependencies (main)
([cilium/cilium#43824](https://redirect.github.com/cilium/cilium/issues/43824),
[@&#8203;cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot])
- chore(deps): update all-dependencies (main)
([cilium/cilium#43965](https://redirect.github.com/cilium/cilium/issues/43965),
[@&#8203;cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot])
- chore(deps): update all-dependencies (main)
([cilium/cilium#44090](https://redirect.github.com/cilium/cilium/issues/44090),
[@&#8203;cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot])
- chore(deps): update all-dependencies (main)
([cilium/cilium#44235](https://redirect.github.com/cilium/cilium/issues/44235),
[@&#8203;cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot])
- chore(deps): update base-images (main)
([cilium/cilium#43827](https://redirect.github.com/cilium/cilium/issues/43827),
[@&#8203;cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot])
- chore(deps): update base-images (main)
([cilium/cilium#43969](https://redirect.github.com/cilium/cilium/issues/43969),
[@&#8203;cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot])
- chore(deps): update base-images (main)
([cilium/cilium#44239](https://redirect.github.com/cilium/cilium/issues/44239),
[@&#8203;cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot])
- cilium-cli: Fix container name in connectivity test logs
([cilium/cilium#44076](https://redirect.github.com/cilium/cilium/issues/44076),
[@&#8203;HadrienPatte](https://redirect.github.com/HadrienPatte))
- docs: fix typos in comments
([cilium/cilium#43821](https://redirect.github.com/cilium/cilium/issues/43821),
[@&#8203;NAM-MAN](https://redirect.github.com/NAM-MAN))
- feat(cilium-cli): Add -r(estart) parameter to cilium upgrade
([cilium/cilium#43722](https://redirect.github.com/cilium/cilium/issues/43722),
[@&#8203;alagoutte](https://redirect.github.com/alagoutte))
- Introduce end-to-end tests for Cilium's ZTunnel integration.
([cilium/cilium#43166](https://redirect.github.com/cilium/cilium/issues/43166),
[@&#8203;ldelossa](https://redirect.github.com/ldelossa))
- Replace Index{,Byte} with Cut,Contains
([cilium/cilium#43708](https://redirect.github.com/cilium/cilium/issues/43708),
[@&#8203;joestringer](https://redirect.github.com/joestringer))
- sysdump: Use label selectors for Hubble UI/Relay deployment collection
([cilium/cilium#44227](https://redirect.github.com/cilium/cilium/issues/44227),
[@&#8203;darox](https://redirect.github.com/darox))
- chore(deps): update dependency cilium/cilium to v1.18.6 by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot] in
[#&#8203;3168](https://redirect.github.com/cilium/cilium-cli/pull/3168)
- Update stable release to v0.19.0 by
[@&#8203;tklauser](https://redirect.github.com/tklauser) in
[#&#8203;3169](https://redirect.github.com/cilium/cilium-cli/pull/3169)
- chore(deps): update go to v1.25.6 by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot] in
[#&#8203;3170](https://redirect.github.com/cilium/cilium-cli/pull/3170)
- chore(deps): update docker.io/library/golang:1.25.6 docker digest to
[`ce63a16`](https://redirect.github.com/cilium/cilium-cli/commit/ce63a16)
by [@&#8203;renovate](https://redirect.github.com/renovate)\[bot] in
[#&#8203;3172](https://redirect.github.com/cilium/cilium-cli/pull/3172)
- chore(deps): update actions/checkout action to v6.0.2 by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot] in
[#&#8203;3171](https://redirect.github.com/cilium/cilium-cli/pull/3171)
- ci: Harden the image build process by
[@&#8203;ferozsalam](https://redirect.github.com/ferozsalam) in
[#&#8203;3174](https://redirect.github.com/cilium/cilium-cli/pull/3174)
- chore(deps): update gcr.io/distroless/static:latest docker digest to
[`972618c`](https://redirect.github.com/cilium/cilium-cli/commit/972618c)
by [@&#8203;renovate](https://redirect.github.com/renovate)\[bot] in
[#&#8203;3176](https://redirect.github.com/cilium/cilium-cli/pull/3176)
- chore(deps): update all github action dependencies by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot] in
[#&#8203;3175](https://redirect.github.com/cilium/cilium-cli/pull/3175)
- chore(deps): update go to v1.25.7 by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot] in
[#&#8203;3178](https://redirect.github.com/cilium/cilium-cli/pull/3178)
- chore(deps): update golangci/golangci-lint docker tag to v2.9.0 by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot] in
[#&#8203;3179](https://redirect.github.com/cilium/cilium-cli/pull/3179)
- chore(deps): update go to v1.26.0 by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot] in
[#&#8203;3181](https://redirect.github.com/cilium/cilium-cli/pull/3181)
- chore(deps): update docker/build-push-action action to v6.19.0 by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot] in
[#&#8203;3180](https://redirect.github.com/cilium/cilium-cli/pull/3180)
- chore(deps): update docker/build-push-action action to v6.19.2 by
[@&#8203;renovate](https://redirect.github.com/renovate)\[bot] in
[#&#8203;3183](https://redirect.github.com/cilium/cilium-cli/pull/3183)
- Prepare for v0.19.1 release by
[@&#8203;tklauser](https://redirect.github.com/tklauser) in
[#&#8203;3184](https://redirect.github.com/cilium/cilium-cli/pull/3184)

#### New Contributors

- [@&#8203;ferozsalam](https://redirect.github.com/ferozsalam) made
their first contribution in
[#&#8203;3174](https://redirect.github.com/cilium/cilium-cli/pull/3174)

**Full Changelog**:
<cilium/cilium-cli@v0.19.0...v0.19.1>

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR was generated by [Mend Renovate](https://mend.io/renovate/).
View the [repository job
log](https://developer.mend.io/github/lambchop4prez/network).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My44LjUiLCJ1cGRhdGVkSW5WZXIiOiI0My4zNi4yIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZS9naXRodWItcmVsZWFzZXMiLCJ0eXBlL3BhdGNoIl19-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cilium-cli This PR contains changes related with cilium-cli cilium-cli-exclusive This PR only impacts cilium-cli binary integration/cloud Related to integration with cloud environments such as AKS, EKS, GKE, etc. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Conformance GKE: timeout reached waiting for NodePort

5 participants