Skip to content

nightly-e2e: network-policy-e2e fails — NVIDIA API timeout (infra flake) #3033

@hunglp6d

Description

@hunglp6d

Nightly E2E Failure — network-policy-e2e: NVIDIA API timeout during onboard

Workflow run: 25349684174
Branch: codex/openshell-docker-gpu-onboard
Commit: ad613cf
Failed job: network-policy-e2e

Root Cause

The network-policy-e2e job failed during the NemoClaw onboard step when the NVIDIA API inference validation endpoint (https://integrate.api.nvidia.com/v1/chat/completions) timed out. curl exited with code 28 (operation timed out), causing the onboard process to abort with exit code 1 before the test script itself ran.

This is an infrastructure flake — the NVIDIA API endpoint was transiently unavailable during the CI window. No code change in the repository caused this failure.

Evidence

Log excerpt (from job 74326422530):

[preflight] ✓ Docker is running.
[preflight] ✓ openshell CLI v0.6.11 found at /usr/local/bin/openshell.
[preflight] ✓ Node.js v22.16.0 meets minimum v22.0.0.
[preflight] ✓ Sandbox image check passed.
...
[onboard] [5/8] Validating inference endpoint...
curl: (28) Operation timed out after 30001 milliseconds with 0 bytes received
Error: Inference validation failed (exit code 28).

The curl timeout (exit code 28) indicates a network-level timeout reaching the NVIDIA API endpoint, not an application-level error. The onboard flow's preflight and sandbox creation succeeded — only the inference validation call failed.

Classification

Field Value
failure_class infra_flake
confidence high
suggested_fix No code change — retry the nightly run. Consider adding retry logic or a longer timeout for the inference validation curl call in src/lib/onboard.ts.

Suggested Follow-Up

  1. Re-run the nightly — this failure is transient and should not recur with stable API availability.
  2. Optional hardening — the inference validation curl in src/lib/onboard.ts uses a 30s timeout. For nightly CI where transient API delays are expected, consider:
    • Adding --retry 2 --retry-delay 5 to the curl call
    • Increasing the timeout to 60s in CI environments (e.g., gated on CI=true)

Related

  • PR: none (infra flake — no code fix needed)
  • Other failures in this run: double-onboard-e2e and onboard-repair-e2e failed due to a separate Docker-driver sandbox lifecycle issue (tracked separately)

Auto-diagnosed by nemoclaw-diagnosis skill • run 25349684174

Metadata

Metadata

Assignees

Labels

VRDCIssues and PRs submitted by NVIDIA VRDC test team.area: ciCI workflows, checks, release automation, or GitHub Actionsarea: e2eEnd-to-end tests, nightly failures, or validation infrastructureauto-diagnosedAutomatically diagnosed by CI agentci-failureAuto-created by nemoclaw-diagnosis skillneeds: triageAwaiting maintainer classification

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions