Skip to content

Conversation

@kvaps
Copy link
Member

@kvaps kvaps commented Dec 16, 2025

Signed-off-by: Andrei Kvapil kvapss@gmail.com

What this PR does

Integrate fix

Release note

[kubernetes] Fix endpoints for cilium-gateway

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of incomplete endpoint data by introducing fallback detection mechanisms.
    • Enhanced service discovery to gather endpoints from all available resources when standard detection fails.
    • Updated logging to provide better visibility into fallback operations and current resource status.

✏️ Tip: You can customize this high-level summary in your review settings.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Dec 16, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This change adds a fallback mechanism to handle endpoints without NodeName in KubeVirt cloud provider's endpoint discovery. When endpoints lack NodeName, the controller now retrieves endpoints from all VMIs via a new helper function instead of filtering by node. Logging updated to reflect fallback behavior.

Changes

Cohort / File(s) Summary
KubeVirt Endpoints Controller Enhancement
pkg/controller/kubevirteps/kubevirteps_controller.go
Added hasEndpointsWithoutNodeName tracking to detect endpoints lacking NodeName. Introduced fallback flow that calls new getAllVMIEndpoints() helper function when standard NodeName-based filtering yields no results but endpoints exist. Updated logging to emit fallback notice and reflect current node set. Extended endpoint construction to derive endpoints from VMI default interface IPs and set Endpoint conditions (Ready, Serving, Terminating) based on VMI status. New public method getAllVMIEndpoints() lists VMIs in infra namespace, constructs Endpoint objects with error handling and warnings for missing fields.

Sequence Diagram(s)

sequenceDiagram
    participant Ctrl as Controller
    participant Api as Kubernetes API
    participant Log as Logger
    
    rect rgb(200, 220, 255)
    Note over Ctrl,Api: Original Flow: Endpoints with NodeName
    Ctrl->>Api: Get endpoints by NodeName
    Api-->>Ctrl: Return filtered endpoints
    Ctrl->>Log: Log "Desired nodes"
    end
    
    rect rgb(255, 220, 200)
    Note over Ctrl,Api: Fallback Flow: Endpoints without NodeName
    Ctrl->>Ctrl: Check hasEndpointsWithoutNodeName
    alt NodeName endpoints empty AND orphan endpoints exist
        Ctrl->>Api: getAllVMIEndpoints() - list all VMIs
        Api-->>Ctrl: Return all VMIs from infra namespace
        Ctrl->>Ctrl: Construct endpoints from VMI IPs<br/>(Ready, Serving, Terminating conditions)
        Ctrl->>Log: Emit fallback notice
        Ctrl->>Log: Log current node set
    else Standard path sufficient
        Ctrl->>Log: Normal logging
    end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A hop through the endpoints divine,
Where nodes may hide or fail to align,
Our fallback dance finds VMIs all,
Through namespaces' sprawl, we heed the call,
No orphaned endpoint left behind! 🌟

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: fixing endpoints for cilium-gateway by integrating a fix from the kubevirt cloud-provider project.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kvaps, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the Kubernetes endpoint management within a Kubevirt environment, specifically targeting the cilium-gateway. It introduces a more resilient approach to service endpoint discovery by implementing a fallback mechanism. This ensures that services remain accessible even when endpoint slices might not contain explicit node assignments, by dynamically leveraging all available Virtual Machine Instances as potential traffic destinations.

Highlights

  • Improved Endpoint Handling: Enhanced the kubevirteps_controller to gracefully handle endpoint slices that may lack NodeName information, preventing potential service disruption.
  • Fallback Mechanism for Missing NodeNames: Introduced a new fallback logic: if no endpoints with NodeName are found but there are endpoints without one, the system will dynamically generate endpoints for all Virtual Machine Instances (VMIs) in the infrastructure namespace, similar to ExternalTrafficPolicy=Cluster behavior.
  • New getAllVMIEndpoints Function: Added a dedicated function to list and convert all VMIs into Kubernetes discovery.Endpoint objects, ensuring service continuity by providing a comprehensive set of potential traffic destinations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@dosubot dosubot bot added the bug Something isn't working label Dec 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request modifies kubevirteps_controller.go to introduce a fallback mechanism in getDesiredEndpoints. This mechanism, triggered when no endpoints with NodeName are found but endpoints without NodeName exist, calls a new getAllVMIEndpoints function. The new function lists all VMIs in the infra namespace and converts them into discovery.Endpoint objects. A review comment suggests refactoring this change by extracting the common logic for converting a VirtualMachineInstance to a discovery.Endpoint into a shared helper function to avoid code duplication between getDesiredEndpoints and getAllVMIEndpoints.

Comment on lines +99 to +155
+// getAllVMIEndpoints returns endpoints for all VMIs in the infra namespace.
+// This is used as a fallback when tenant endpoints don't have NodeName specified,
+// similar to ExternalTrafficPolicy=Cluster behavior where traffic is distributed to all nodes.
+func (c *Controller) getAllVMIEndpoints() []*discovery.Endpoint {
+ var endpoints []*discovery.Endpoint
+
+ // List all VMIs in the infra namespace
+ vmiList, err := c.infraDynamic.
+ Resource(kubevirtv1.VirtualMachineInstanceGroupVersionKind.GroupVersion().WithResource("virtualmachineinstances")).
+ Namespace(c.infraNamespace).
+ List(context.TODO(), metav1.ListOptions{})
+ if err != nil {
+ klog.Errorf("Failed to list VMIs in namespace %q: %v", c.infraNamespace, err)
+ return endpoints
+ }
+
+ for _, obj := range vmiList.Items {
+ vmi := &kubevirtv1.VirtualMachineInstance{}
+ err = runtime.DefaultUnstructuredConverter.FromUnstructured(obj.Object, vmi)
+ if err != nil {
+ klog.Errorf("Failed to convert Unstructured to VirtualMachineInstance: %v", err)
+ continue
+ }
+
+ if vmi.Status.NodeName == "" {
+ klog.Warningf("Skipping VMI %s/%s: NodeName is empty", vmi.Namespace, vmi.Name)
+ continue
+ }
+ nodeNamePtr := &vmi.Status.NodeName
+
+ ready := vmi.Status.Phase == kubevirtv1.Running
+ serving := vmi.Status.Phase == kubevirtv1.Running
+ terminating := vmi.Status.Phase == kubevirtv1.Failed || vmi.Status.Phase == kubevirtv1.Succeeded
+
+ for _, i := range vmi.Status.Interfaces {
+ if i.Name == "default" {
+ if i.IP == "" {
+ klog.Warningf("VMI %s/%s interface %q has no IP, skipping", vmi.Namespace, vmi.Name, i.Name)
+ continue
+ }
+ endpoints = append(endpoints, &discovery.Endpoint{
+ Addresses: []string{i.IP},
+ Conditions: discovery.EndpointConditions{
+ Ready: &ready,
+ Serving: &serving,
+ Terminating: &terminating,
+ },
+ NodeName: nodeNamePtr,
+ })
+ break
+ }
+ }
+ }
+
+ klog.Infof("Fallback: created %d endpoints from all VMIs in namespace %s", len(endpoints), c.infraNamespace)
+ return endpoints
+}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This new function getAllVMIEndpoints duplicates a significant amount of logic from getDesiredEndpoints for converting a VirtualMachineInstance into a discovery.Endpoint. To improve maintainability and avoid code duplication, I recommend extracting this logic into a separate helper function. This function could accept a *kubevirtv1.VirtualMachineInstance and return a *discovery.Endpoint, or nil if conversion is not possible.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
packages/apps/kubernetes/images/kubevirt-cloud-provider/patches/379.diff (2)

99-155: Consider extracting shared endpoint creation logic.

The endpoint creation logic (NodeName validation, phase checks, interface iteration, IP validation, Endpoint construction) is duplicated between getAllVMIEndpoints and the main loop in getDesiredEndpoints.

Extracting a helper would reduce duplication and ensure consistent behavior:

// createEndpointFromVMI creates a discovery.Endpoint from a VMI, or returns nil if the VMI is not ready
func (c *Controller) createEndpointFromVMI(vmi *kubevirtv1.VirtualMachineInstance) *discovery.Endpoint {
    if vmi.Status.NodeName == "" {
        klog.Warningf("Skipping VMI %s/%s: NodeName is empty", vmi.Namespace, vmi.Name)
        return nil
    }
    
    ready := vmi.Status.Phase == kubevirtv1.Running
    serving := vmi.Status.Phase == kubevirtv1.Running
    terminating := vmi.Status.Phase == kubevirtv1.Failed || vmi.Status.Phase == kubevirtv1.Succeeded
    
    for _, i := range vmi.Status.Interfaces {
        if i.Name == "default" {
            if i.IP == "" {
                klog.Warningf("VMI %s/%s interface %q has no IP, skipping", vmi.Namespace, vmi.Name, i.Name)
                return nil
            }
            return &discovery.Endpoint{
                Addresses:  []string{i.IP},
                Conditions: discovery.EndpointConditions{
                    Ready:       &ready,
                    Serving:     &serving,
                    Terminating: &terminating,
                },
                NodeName: &vmi.Status.NodeName,
            }
        }
    }
    return nil
}

169-219: Good test coverage for the skip behavior.

The test correctly validates that endpoints and VMIs without NodeName are skipped. Consider adding an additional test case for the fallback path (getAllVMIEndpoints) to ensure full coverage of the new functionality - specifically testing that when hasEndpointsWithoutNodeName is true and nodeSet is empty, the fallback returns endpoints from all valid VMIs.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f0b5ff and 4700287.

📒 Files selected for processing (1)
  • packages/apps/kubernetes/images/kubevirt-cloud-provider/patches/379.diff (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-27T11:21:45.808Z
Learnt from: CR
Repo: cozystack/cozystack PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-27T11:21:45.808Z
Learning: Applies to **/*.go : Use Controller-runtime patterns and kubebuilder style for Go code

Applied to files:

  • packages/apps/kubernetes/images/kubevirt-cloud-provider/patches/379.diff
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (4)
packages/apps/kubernetes/images/kubevirt-cloud-provider/patches/379.diff (4)

17-28: LGTM - Proper handling for endpoints without NodeName.

The flag tracking and skip logic with appropriate warning logging is well implemented. This enables the fallback behavior when needed.


31-36: Well-designed fallback mechanism.

The conditional correctly ensures fallback only when endpoints exist but lack NodeName information. The comment explaining the ExternalTrafficPolicy=Cluster similarity is helpful for future maintainers.


56-67: Good reliability improvement - removing Fatal crash.

Replacing klog.Fatal(err) with continue is an important fix. The controller will now gracefully skip problematic VMIs instead of crashing entirely. The added NodeName validation provides additional robustness.


76-92: Correct semantic change from continue to break.

The IP validation is a good defensive addition. Changing to break is semantically correct - once the "default" interface is processed, there's no need to iterate remaining interfaces.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Dec 26, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
hack/e2e-apps/gateway-api.bats (2)

159-165: Consider increasing request sample size for load balancing verification.

With only 10 requests, there's a small probability that both backends might not be hit, especially if the load balancer has warm-up behavior or sticky sessions. Consider increasing to 20-30 requests to improve test reliability.

🔎 Proposed enhancement
  # Test load balancing
  responses=""
- for i in {1..10}; do
+ for i in {1..20}; do
    response=$(curl -s --max-time 5 $test_url)
    responses="$responses$response\n"
  done
  # Check that both backends are hit
  echo -e "$responses" | grep -q "Hello from backend1"
  echo -e "$responses" | grep -q "Hello from backend2"

167-173: Consider trap-based cleanup for robustness.

The current cleanup only runs if the test completes successfully. If the test fails earlier, resources may be leaked. Consider using a trap to ensure cleanup runs even on test failure.

🔎 Example trap-based cleanup pattern
@test "test gateway api tcp load balancer in tenant cluster" {
  name=test
  
  # Setup trap for cleanup
  cleanup() {
    if [ -n "$PF_PID" ]; then kill $PF_PID 2>/dev/null || true; fi
    kubectl delete deployment backend1 backend2 --namespace tenant-test --ignore-not-found
    kubectl delete configmap backend1-config backend2-config --namespace tenant-test --ignore-not-found
    kubectl delete gateway ${name}-gateway --namespace tenant-test --ignore-not-found
    kubectl delete tcproute ${name}-tcproute --namespace tenant-test --ignore-not-found
    kubectl delete service backend-svc --namespace tenant-test --ignore-not-found
  }
  trap cleanup EXIT
  
  # ... rest of test ...
}
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4700287 and 8f2bdbc.

📒 Files selected for processing (2)
  • hack/e2e-apps/gateway-api.bats
  • hack/e2e-install-cozystack.bats
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (3)
hack/e2e-install-cozystack.bats (1)

127-129: LGTM! Gateway API enablement is correctly positioned.

The configmap patch appropriately enables Gateway API as a precondition before tenant configuration, ensuring the feature is available for the subsequent tests.

hack/e2e-apps/gateway-api.bats (2)

151-151: Verify netcat availability in test environment.

The nc -z command requires netcat to be installed. Ensure it's available in the test environment, or consider using an alternative like checking the port-forward process health or using a curl retry loop.

Alternative approach without nc dependency
- timeout 30 sh -ec "until nc -z localhost 8080; do sleep 1; done"
+ timeout 30 sh -ec "until curl -s --max-time 1 http://localhost:8080 > /dev/null 2>&1; do sleep 1; done"

149-149: No issues found. The service name cilium-gateway-${name}-gateway follows the official Cilium Gateway API naming convention <gatewayClassName>-gateway-<gatewayName>, so the port-forward fallback will correctly target the service created by Cilium.

gatewayClassName: cilium
listeners:
- name: tcp
hostname: "*.example.com"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

Kubernetes Gateway API hostname field TCP protocol listener support

💡 Result:

Short answer: No — the Listener.hostname is ignored for TCP listeners. Hostname matching is defined for TLS/HTTP/HTTPS (SNI/Host header) in the Gateway API; protocols that don't require hostname‑based matching (e.g., TCP, UDP) ignore the hostname field. [1][2]

Sources:

  • Gateway API Listener spec (hostname rules: applied for TLS/HTTP/HTTPS; ignored for protocols that don’t require hostname matching). [1]
  • Gateway API TCP guide and TCPRoute examples (shows TCP listeners identified by port/sectionName; no hostname matching). [2]

References:
[1] gateway-api.sigs.k8s.io — Listener / Hostname description.
[2] gateway-api.sigs.k8s.io — TCP routing guide (TCPRoute examples).


Remove hostname field from TCP listener.

The hostname: "*.example.com" field has no effect on TCP listeners. According to the Kubernetes Gateway API specification, hostname matching applies only to TLS/HTTP/HTTPS protocols; TCP listeners ignore this field since they operate at L4 and don't perform hostname-based routing. Remove it to avoid confusion.

🤖 Prompt for AI Agents
In hack/e2e-apps/gateway-api.bats around line 98, the TCP listener includes a
hostname: "*.example.com" field which has no effect for L4/TCP listeners; remove
the hostname line from the TCP listener definition so the listener contains only
fields valid for TCP (port, protocol, routes/etc.) to avoid misleading
configuration and match the Gateway API spec.

# Test load balancing
responses=""
for i in {1..10}; do
response=$(curl -s $test_url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add timeout to curl command.

The curl request lacks a timeout, which could cause the test to hang indefinitely if the service is unresponsive.

🔎 Proposed fix
-    response=$(curl -s $test_url)
+    response=$(curl -s --max-time 5 $test_url)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
response=$(curl -s $test_url)
response=$(curl -s --max-time 5 $test_url)
🤖 Prompt for AI Agents
In hack/e2e-apps/gateway-api.bats around line 160, the curl invocation uses no
timeout and can hang; update the curl command to include timeouts (for example
add a short connect timeout and a total/max-time, e.g., --connect-timeout 5 and
--max-time 10), keep silent output (-s) but consider using -S/--fail so failures
return non-zero, and ensure any new flags still capture response in the response
variable so the test fails fast instead of hanging.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
hack/e2e-install-cozystack.bats (1)

127-129: LGTM! Gateway API enablement is correctly placed.

The ConfigMap patch to enable Gateway API is syntactically correct and appropriately positioned before the tenant patch. The change aligns with the PR objective to fix cilium-gateway endpoints and supports the new gateway API tests.

Optional: Add verification step

While the current pattern is consistent with other ConfigMap patches in this file (lines 148, 175), you could optionally add a verification step to ensure the Gateway API feature is fully reconciled before proceeding:

 # Enable Gateway API for testing
 kubectl patch configmap/cozystack -n cozy-system --type merge -p '{"data":{"addons.gatewayAPI.enabled":"true"}}'
+
+# Wait for Gateway API resources to be available
+timeout 60 sh -ec 'until kubectl get crd gateways.gateway.networking.k8s.io >/dev/null 2>&1; do sleep 1; done'

This would guard against potential race conditions, though the current implementation may be sufficient if the Gateway API controller reconciles quickly.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f2bdbc and 4639044.

📒 Files selected for processing (1)
  • hack/e2e-install-cozystack.bats
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
hack/e2e-apps/loadbalancer-service.bats (3)

31-36: Consider increasing readiness probe delays.

The readiness probe has aggressive timing (initialDelaySeconds: 2, periodSeconds: 2). In slower test environments or under load, nginx might not be fully ready within 2 seconds, potentially causing test flakiness.

🔎 Suggested adjustment
         readinessProbe:
           httpGet:
             path: /
             port: 80
-          initialDelaySeconds: 2
-          periodSeconds: 2
+          initialDelaySeconds: 5
+          periodSeconds: 5

111-114: Consider adding functional verification of the LoadBalancer service.

The test verifies that endpoints are registered but doesn't confirm the LoadBalancer service is actually reachable and functional. For a complete LoadBalancer test, consider making an HTTP request through the LB IP/hostname to verify end-to-end connectivity.

🔎 Suggested addition after endpoint verification
  # Get LoadBalancer IP or hostname
  lb_address=$(kubectl get svc $name -n $ns \
    -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
  if [ -z "$lb_address" ]; then
    lb_address=$(kubectl get svc $name -n $ns \
      -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
  fi
  
  # Verify service is reachable through LoadBalancer
  timeout 30 sh -ec "
    until curl -sf http://$lb_address/ | grep -q 'Hello from backend'; do
      sleep 2
    done
  "

116-118: Consider more robust cleanup.

The cleanup doesn't use --ignore-not-found, which could cause errors if resources weren't created due to earlier test failures. While BATS typically doesn't fail on cleanup errors, adding this flag improves robustness.

🔎 Suggested improvement
   # Cleanup
-  kubectl delete deployment backend1 backend2 -n $ns
-  kubectl delete service $name -n $ns
+  kubectl delete deployment backend1 backend2 -n $ns --ignore-not-found=true
+  kubectl delete service $name -n $ns --ignore-not-found=true
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4639044 and 4752cb6.

📒 Files selected for processing (1)
  • hack/e2e-apps/loadbalancer-service.bats
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (4)
hack/e2e-apps/loadbalancer-service.bats (4)

83-97: LGTM!

The LoadBalancer service configuration is correct. The selector properly targets both backend deployments.


99-101: LGTM!

Properly waits for both deployments to be available with reasonable timeouts.


103-109: LGTM!

The LoadBalancer provisioning wait logic is correct and uses appropriate timeout and polling intervals.


111-114: Endpoint verification assumes all endpoints are ready.

The jsonpath query {.subsets[0].addresses[*].ip} only counts endpoints in the addresses array, which contains ready endpoints. If endpoints exist but are not ready (in notReadyAddresses), this check will incorrectly pass or fail.

Given the PR's focus on fixing endpoint resolution and the controller's handling of Ready/Serving/Terminating conditions (as mentioned in the AI summary), this verification should explicitly confirm endpoint readiness.

🔎 Proposed fix to verify ready endpoints
-  # Ensure exactly 2 ready endpoints
-  endpoints_count=$(kubectl get endpoints $name -n $ns \
-    -o jsonpath='{.subsets[0].addresses[*].ip}' | wc -w)
-  [ "$endpoints_count" -eq 2 ]
+  # Ensure exactly 2 ready endpoints
+  ready_endpoints=$(kubectl get endpoints $name -n $ns \
+    -o jsonpath='{range .subsets[*].addresses[*]}{.ip}{"\n"}{end}')
+  endpoints_count=$(echo "$ready_endpoints" | grep -c '^')
+  [ "$endpoints_count" -eq 2 ]

Likely an incorrect or invalid review comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
hack/e2e-apps/run-kubernetes.sh (1)

188-197: Consider fixing indentation for consistency.

These lines lack the 2-space indentation used elsewhere in the function, making the code harder to read.

🔎 Proposed fix
-LB_ADDR=$(
-  kubectl get svc --kubeconfig tenantkubeconfig-${test_name} "${test_name}-backend" \
-    -n tenant-test \
-    -o jsonpath='{.status.loadBalancer.ingress[0].ip}{.status.loadBalancer.ingress[0].hostname}'
-)
-
-if [ -z "$LB_ADDR" ]; then
-  echo "LoadBalancer address is empty" >&2
-  exit 1
-fi
+  LB_ADDR=$(
+    kubectl get svc --kubeconfig tenantkubeconfig-${test_name} "${test_name}-backend" \
+      -n tenant-test \
+      -o jsonpath='{.status.loadBalancer.ingress[0].ip}{.status.loadBalancer.ingress[0].hostname}'
+  )
+
+  if [ -z "$LB_ADDR" ]; then
+    echo "LoadBalancer address is empty" >&2
+    exit 1
+  fi
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4752cb6 and 68bb84f.

📒 Files selected for processing (1)
  • hack/e2e-apps/run-kubernetes.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (4)
hack/e2e-apps/run-kubernetes.sh (4)

75-75: LGTM!

Increasing the timeout from 2m to 5m for the Kubernetes resources readiness check is reasonable for E2E environments.


103-103: LGTM!

Creating the namespace after nodes are ready is the correct sequence.


177-186: LGTM!

The wait logic for deployment availability and LoadBalancer provisioning is correct.


199-216: LGTM!

The health check using a temporary curl pod with retries is a reasonable approach for validating LoadBalancer connectivity. The cleanup properly removes the test resources.

Comment on lines 89 to 90
# Set up port forwarding to the Kubernetes API server for a 200 second timeout
bash -c 'timeout 300s kubectl port-forward service/kubernetes-'"${test_name}"' -n tenant-test '"${port}"':6443 > /dev/null 2>&1 &'
bash -c 'timeout 500s kubectl port-forward service/kubernetes-'"${test_name}"' -n tenant-test '"${port}"':6443 > /dev/null 2>&1 &'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stale comment: timeout value doesn't match.

The comment says "200 second timeout" but the actual timeout is 500s.

🔎 Proposed fix
-  # Set up port forwarding to the Kubernetes API server for a 200 second timeout
+  # Set up port forwarding to the Kubernetes API server for a 500 second timeout
   bash -c 'timeout 500s kubectl port-forward service/kubernetes-'"${test_name}"' -n tenant-test '"${port}"':6443 > /dev/null 2>&1 &'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Set up port forwarding to the Kubernetes API server for a 200 second timeout
bash -c 'timeout 300s kubectl port-forward service/kubernetes-'"${test_name}"' -n tenant-test '"${port}"':6443 > /dev/null 2>&1 &'
bash -c 'timeout 500s kubectl port-forward service/kubernetes-'"${test_name}"' -n tenant-test '"${port}"':6443 > /dev/null 2>&1 &'
# Set up port forwarding to the Kubernetes API server for a 500 second timeout
bash -c 'timeout 500s kubectl port-forward service/kubernetes-'"${test_name}"' -n tenant-test '"${port}"':6443 > /dev/null 2>&1 &'
🤖 Prompt for AI Agents
In hack/e2e-apps/run-kubernetes.sh around lines 89 to 90, the inline comment
says "200 second timeout" but the command uses timeout 500s; update one of them
so they match — either change the comment to "500 second timeout" or change the
timeout invocation to "200s" depending on the intended behavior; keep wording
precise and consistent with the chosen timeout value.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
hack/e2e-apps/run-kubernetes.sh (2)

89-90: Stale comment: timeout value doesn't match.

The comment says "200 second timeout" but the actual timeout is 500s. This was previously flagged in past reviews but remains unfixed.

🔎 Proposed fix
-  # Set up port forwarding to the Kubernetes API server for a 200 second timeout
+  # Set up port forwarding to the Kubernetes API server for a 500 second timeout
   bash -c 'timeout 500s kubectl port-forward service/kubernetes-'"${test_name}"' -n tenant-test '"${port}"':6443 > /dev/null 2>&1 &'

128-174: Critical: Missing namespace and service selector mismatch.

Two issues prevent this LoadBalancer test from working:

  1. Missing namespace creation: The Deployment and Service reference namespace: tenant-test (lines 134, 166) in the tenant cluster, but this namespace is never created. The kubectl apply commands (lines 129, 161) will fail because the namespace doesn't exist in the tenant cluster.

  2. Service selector mismatch: The Service selector app: "${test_name}-backend" (line 170) won't match the pod labels app: backend and backend: "${test_name}-backend" (lines 139-140, 144-145). The LoadBalancer will have no endpoints and health checks will fail. This issue was flagged in previous reviews but remains unfixed.

🔎 Proposed fixes

Fix 1: Create the namespace in the tenant cluster

Add this before line 128:

+  # Create namespace in tenant cluster for backend test
+  kubectl create namespace tenant-test --kubeconfig tenantkubeconfig-${test_name} --dry-run=client -o yaml | \
+    kubectl apply --kubeconfig tenantkubeconfig-${test_name} -f-
+
   # Backend 1

Fix 2: Align service selector with pod labels

   selector:
-    app: "${test_name}-backend"
+    app: backend
+    backend: "${test_name}-backend"
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68bb84f and 44a331a.

📒 Files selected for processing (1)
  • hack/e2e-apps/run-kubernetes.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (1)
hack/e2e-apps/run-kubernetes.sh (1)

75-75: LGTM: Reasonable timeout increase.

Increasing the readiness timeout from 2m to 5m provides more buffer for resource provisioning in E2E tests.

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps kvaps force-pushed the fix-cilium-gateway branch from 691d126 to f596652 Compare January 4, 2026 09:35
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jan 4, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
packages/apps/kubernetes/images/kubevirt-cloud-provider/patches/379.diff (1)

99-155: Address code duplication as previously noted.

This function duplicates significant logic from getDesiredEndpoints (lines 40-94). The conversion of a VMI to an Endpoint object is repeated in both places.

Consider extracting the VMI-to-Endpoint conversion logic into a separate helper function as suggested in the previous review. This would improve maintainability and reduce the risk of divergence between the two code paths.

🔎 Example helper function structure
// vmiToEndpoint converts a VirtualMachineInstance to a discovery.Endpoint.
// Returns nil if the VMI cannot be converted (missing NodeName, no default interface, etc.)
func vmiToEndpoint(vmi *kubevirtv1.VirtualMachineInstance) *discovery.Endpoint {
	if vmi.Status.NodeName == "" {
		klog.Warningf("Skipping VMI %s/%s: NodeName is empty", vmi.Namespace, vmi.Name)
		return nil
	}
	
	ready := vmi.Status.Phase == kubevirtv1.Running
	serving := vmi.Status.Phase == kubevirtv1.Running
	terminating := vmi.Status.Phase == kubevirtv1.Failed || vmi.Status.Phase == kubevirtv1.Succeeded
	
	for _, i := range vmi.Status.Interfaces {
		if i.Name == "default" {
			if i.IP == "" {
				klog.Warningf("VMI %s/%s interface %q has no IP, skipping", vmi.Namespace, vmi.Name, i.Name)
				return nil
			}
			nodeNamePtr := &vmi.Status.NodeName
			return &discovery.Endpoint{
				Addresses: []string{i.IP},
				Conditions: discovery.EndpointConditions{
					Ready:       &ready,
					Serving:     &serving,
					Terminating: &terminating,
				},
				NodeName: nodeNamePtr,
			}
		}
	}
	return nil
}

Then use it in both functions:

// In getDesiredEndpoints
if endpoint := vmiToEndpoint(vmi); endpoint != nil {
	desiredEndpoints = append(desiredEndpoints, endpoint)
}

// In getAllVMIEndpoints
if endpoint := vmiToEndpoint(vmi); endpoint != nil {
	endpoints = append(endpoints, endpoint)
}
🧹 Nitpick comments (1)
packages/apps/kubernetes/images/kubevirt-cloud-provider/patches/379.diff (1)

169-219: Consider adding test coverage for the fallback behavior.

This test validates that endpoints without NodeName are skipped, but it doesn't test the new fallback logic (lines 31-36) where getAllVMIEndpoints() is called. Consider adding additional test cases to cover:

  1. The fallback to getAllVMIEndpoints() when all tenant endpoints lack NodeName
  2. Direct testing of the getAllVMIEndpoints() function with various VMI states
  3. Scenarios with valid VMIs that have both NodeName and IP addresses
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 691d126 and f596652.

📒 Files selected for processing (1)
  • packages/apps/kubernetes/images/kubevirt-cloud-provider/patches/379.diff
🧰 Additional context used
📓 Path-based instructions (1)
packages/apps/**

📄 CodeRabbit inference engine (AGENTS.md)

Use Helm Chart umbrella pattern with vendored upstream charts in charts/ directory

Files:

  • packages/apps/kubernetes/images/kubevirt-cloud-provider/patches/379.diff
🧠 Learnings (3)
📚 Learning: 2025-12-25T09:45:26.524Z
Learnt from: CR
Repo: cozystack/cozystack PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-25T09:45:26.524Z
Learning: Applies to pkg/**/*.go : Follow controller-runtime patterns and kubebuilder style for Go code in pkg directory

Applied to files:

  • packages/apps/kubernetes/images/kubevirt-cloud-provider/patches/379.diff
📚 Learning: 2025-12-25T09:45:26.524Z
Learnt from: CR
Repo: cozystack/cozystack PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-25T09:45:26.524Z
Learning: Applies to internal/**/*.go : Follow controller-runtime patterns and kubebuilder style for Go code in internal directory

Applied to files:

  • packages/apps/kubernetes/images/kubevirt-cloud-provider/patches/379.diff
📚 Learning: 2025-12-25T09:45:26.524Z
Learnt from: CR
Repo: cozystack/cozystack PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-25T09:45:26.524Z
Learning: Applies to cmd/**/*.go : Follow controller-runtime patterns and kubebuilder style for Go code in cmd directory

Applied to files:

  • packages/apps/kubernetes/images/kubevirt-cloud-provider/patches/379.diff
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (5)
packages/apps/kubernetes/images/kubevirt-cloud-provider/patches/379.diff (5)

56-62: Good improvement to error handling.

Replacing klog.Fatal(err) with continue is a critical fix. Fatal would crash the entire controller process, whereas continue allows the controller to skip problematic VMIs and continue processing others.


64-68: Good validation added for NodeName.

The validation check prevents creating endpoints with nil NodeName pointers, and the early continue ensures the pointer created on line 68 always points to a non-empty string.


76-79: Good validation added for IP address.

Skipping interfaces without IP addresses prevents creating invalid endpoints.


91-91: Correct use of break instead of continue.

Using break here is correct since the code should stop iterating through interfaces once the "default" interface is found and processed.


31-36: The fallback mechanism is intentional and defensive. It only triggers when no endpoints with NodeName are found in the tenant slices AND endpoints without NodeName exist—meaning there are no valid endpoints to route traffic to anyway. This is not a change to routing behavior but rather a safeguard against incomplete endpoint data.

The patch documents this explicitly: "distribute traffic to all VMIs (similar to ExternalTrafficPolicy=Cluster behavior)" and the test case validates that endpoints lacking NodeName are properly skipped. When the fallback does activate, it re-applies the same validation (skipping VMIs without NodeName and IPs), ensuring consistency.

This behavior does not conflict with ExternalTrafficPolicy; it only applies when the normal endpoint selection path produces no results due to malformed data.

@kvaps kvaps added the backport Should change be backported on previus release label Jan 4, 2026
@kvaps kvaps merged commit 07b406e into main Jan 5, 2026
26 checks passed
@kvaps kvaps deleted the fix-cilium-gateway branch January 5, 2026 17:11
@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Successfully created backport PR for release-0.39:

kvaps added a commit that referenced this pull request Jan 5, 2026
kvaps added a commit that referenced this pull request Jan 8, 2026
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

<!-- Thank you for making a contribution! Here are some tips for you:
- Start the PR title with the [label] of Cozystack component:
- For system components: [platform], [system], [linstor], [cilium],
[kube-ovn], [dashboard], [cluster-api], etc.
- For managed apps: [apps], [tenant], [kubernetes], [postgres],
[virtual-machine] etc.
- For development and maintenance: [tests], [ci], [docs], [maintenance].
- If it's a work in progress, consider creating this PR as a draft.
- Don't hesistate to ask for opinion and review in the community chats,
even if it's still a draft.
- Add the label `backport` if it's a bugfix that needs to be backported
to a previous version.
-->

## What this PR does

Integrate fix
- kubevirt/cloud-provider-kubevirt#379

### Release note

<!--  Write a release note:
- Explain what has changed internally and for users.
- Start with the same [label] as in the PR title
- Follow the guidelines at
https://github.com/kubernetes/community/blob/master/contributors/guide/release-notes.md.
-->

```release-note
[kubernetes] Fix endpoints for cilium-gateway
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved handling of incomplete endpoint data by introducing fallback
detection mechanisms.
* Enhanced service discovery to gather endpoints from all available
resources when standard detection fails.
* Updated logging to provide better visibility into fallback operations
and current resource status.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit that referenced this pull request Jan 8, 2026
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

<!-- Thank you for making a contribution! Here are some tips for you:
- Start the PR title with the [label] of Cozystack component:
- For system components: [platform], [system], [linstor], [cilium],
[kube-ovn], [dashboard], [cluster-api], etc.
- For managed apps: [apps], [tenant], [kubernetes], [postgres],
[virtual-machine] etc.
- For development and maintenance: [tests], [ci], [docs], [maintenance].
- If it's a work in progress, consider creating this PR as a draft.
- Don't hesistate to ask for opinion and review in the community chats,
even if it's still a draft.
- Add the label `backport` if it's a bugfix that needs to be backported
to a previous version.
-->

## What this PR does

Integrate fix
- kubevirt/cloud-provider-kubevirt#379

### Release note

<!--  Write a release note:
- Explain what has changed internally and for users.
- Start with the same [label] as in the PR title
- Follow the guidelines at
https://github.com/kubernetes/community/blob/master/contributors/guide/release-notes.md.
-->

```release-note
[kubernetes] Fix endpoints for cilium-gateway
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved handling of incomplete endpoint data by introducing fallback
detection mechanisms.
* Enhanced service discovery to gather endpoints from all available
resources when standard detection fails.
* Updated logging to provide better visibility into fallback operations
and current resource status.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit that referenced this pull request Jan 9, 2026
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

<!-- Thank you for making a contribution! Here are some tips for you:
- Start the PR title with the [label] of Cozystack component:
- For system components: [platform], [system], [linstor], [cilium],
[kube-ovn], [dashboard], [cluster-api], etc.
- For managed apps: [apps], [tenant], [kubernetes], [postgres],
[virtual-machine] etc.
- For development and maintenance: [tests], [ci], [docs], [maintenance].
- If it's a work in progress, consider creating this PR as a draft.
- Don't hesistate to ask for opinion and review in the community chats,
even if it's still a draft.
- Add the label `backport` if it's a bugfix that needs to be backported
to a previous version.
-->

## What this PR does

Integrate fix
- kubevirt/cloud-provider-kubevirt#379

### Release note

<!--  Write a release note:
- Explain what has changed internally and for users.
- Start with the same [label] as in the PR title
- Follow the guidelines at
https://github.com/kubernetes/community/blob/master/contributors/guide/release-notes.md.
-->

```release-note
[kubernetes] Fix endpoints for cilium-gateway
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved handling of incomplete endpoint data by introducing fallback
detection mechanisms.
* Enhanced service discovery to gather endpoints from all available
resources when standard detection fails.
* Updated logging to provide better visibility into fallback operations
and current resource status.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Should change be backported on previus release bug Something isn't working size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants