Skip to content

Add Istio Ambient critical upstream failure detection rule#87

Merged
tonymeehan merged 6 commits intoprequel-dev:mainfrom
varshith257:istio-1
Jul 2, 2025
Merged

Add Istio Ambient critical upstream failure detection rule#87
tonymeehan merged 6 commits intoprequel-dev:mainfrom
varshith257:istio-1

Conversation

@varshith257
Copy link
Copy Markdown
Contributor

@varshith257 varshith257 commented Jun 21, 2025

This PR adds a new CRE rule to detect Ambient CNI sandbox‐creation failures in Istio Ambient mode. The rule catches high‐severity failure mode:

No ztunnel connection: CNI plugin cannot contact the node‐level ztunnel agent

These failure leave pods stuck permanently in ContainerCreating (or Pending), preventing any workloads from starting in Istion Ambient mesh.

Part of #81
Closes #81

Test Environment

Live CRE link: CRE Playground Link

Reproducer: Ambient CNI Sandbox Creation Failure

Follow these steps to reproduce the “Failed to create pod sandbox” error due to no ztunnel connection in Istio Ambient mode.

1. Create the test pod

~/git/istio$ kubectl run sandbox-fail \
  --image=busybox \
  --restart=Never \
  -- sleep 3600

2. Verify Pod status

~/git/istio$ kubectl get pod sandbox-fail

You should see output similar to:

NAME           READY   STATUS              RESTARTS   AGE
sandbox-fail   0/1     ContainerCreating   0          10s

3. Inspect failure events with real timestamps

~/git/istio$ kubectl get events \
  --field-selector involvedObject.name=sandbox-fail,involvedObject.kind=Pod \
  --sort-by='.metadata.creationTimestamp' \
  -o custom-columns='TIME:.lastTimestamp,REASON:.reason,MESSAGE:.message'

Expected output:


TIME                   REASON                   MESSAGE
2025-06-21T16:42:03Z   Created                  Created container istio-proxy
2025-06-21T16:42:03Z   Started                  Started container istio-proxy
2025-06-21T16:42:46Z   Killing                  Stopping container istio-proxy
2025-06-21T16:42:46Z   Killing                  Stopping container sandbox-fail
2025-06-21T16:43:10Z   Unhealthy                Readiness probe failed: Get "http://10.244.0.15:15021/healthz/ready": dial tcp 10.244.0.15:15021: connect: connection refused
2025-06-21T16:43:26Z   Scheduled                Successfully assigned default/sandbox-fail to istio-demo-control-plane
2025-06-21T16:43:26Z   FailedCreatePodSandBox   Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "80a85c307b25f18b974eb09330ea2ff6314efe11f61e68dd8fea371b73733548": plugin type="istio-cni" name="istio-cni" failed (add): istio-cni cmdAdd failed to contact node Istio CNI agent: unable to push CNI event (status code 500): no ztunnel connection

DEMO

clideo_editor_a9fa5d1e22a64929847c791a4fb8be1f.mp4

@varshith257
Copy link
Copy Markdown
Contributor Author

A few left and will open all PRs once they are done

cc: @Lyndon-prequel

@varshith257 varshith257 changed the title Add Istio Ambient critical upstream failure detection rules - 1 Add Istio Ambient critical upstream failure detection rule Jun 26, 2025
@varshith257 varshith257 requested a review from tonymeehan June 27, 2025 19:19
@tonymeehan tonymeehan merged commit 70f658a into prequel-dev:main Jul 2, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Istio Ambient Troubleshooting Rules

2 participants