-
Notifications
You must be signed in to change notification settings - Fork 2.8k
nightly-e2e: network-policy-e2e fails — NVIDIA API timeout (infra flake) #3033
Copy link
Copy link
Closed
Labels
VRDCIssues and PRs submitted by NVIDIA VRDC test team.Issues and PRs submitted by NVIDIA VRDC test team.area: ciCI workflows, checks, release automation, or GitHub ActionsCI workflows, checks, release automation, or GitHub Actionsarea: e2eEnd-to-end tests, nightly failures, or validation infrastructureEnd-to-end tests, nightly failures, or validation infrastructureauto-diagnosedAutomatically diagnosed by CI agentAutomatically diagnosed by CI agentci-failureAuto-created by nemoclaw-diagnosis skillAuto-created by nemoclaw-diagnosis skillneeds: triageAwaiting maintainer classificationAwaiting maintainer classification
Metadata
Metadata
Assignees
Labels
VRDCIssues and PRs submitted by NVIDIA VRDC test team.Issues and PRs submitted by NVIDIA VRDC test team.area: ciCI workflows, checks, release automation, or GitHub ActionsCI workflows, checks, release automation, or GitHub Actionsarea: e2eEnd-to-end tests, nightly failures, or validation infrastructureEnd-to-end tests, nightly failures, or validation infrastructureauto-diagnosedAutomatically diagnosed by CI agentAutomatically diagnosed by CI agentci-failureAuto-created by nemoclaw-diagnosis skillAuto-created by nemoclaw-diagnosis skillneeds: triageAwaiting maintainer classificationAwaiting maintainer classification
Type
Fields
Give feedbackNo fields configured for issues without a type.
Nightly E2E Failure —
network-policy-e2e: NVIDIA API timeout during onboardWorkflow run: 25349684174
Branch:
codex/openshell-docker-gpu-onboardCommit:
ad613cfFailed job:
network-policy-e2eRoot Cause
The
network-policy-e2ejob failed during the NemoClaw onboard step when the NVIDIA API inference validation endpoint (https://integrate.api.nvidia.com/v1/chat/completions) timed out.curlexited with code 28 (operation timed out), causing the onboard process to abort with exit code 1 before the test script itself ran.This is an infrastructure flake — the NVIDIA API endpoint was transiently unavailable during the CI window. No code change in the repository caused this failure.
Evidence
Log excerpt (from job 74326422530):
The
curltimeout (exit code 28) indicates a network-level timeout reaching the NVIDIA API endpoint, not an application-level error. The onboard flow's preflight and sandbox creation succeeded — only the inference validation call failed.Classification
failure_classinfra_flakeconfidencehighsuggested_fixsrc/lib/onboard.ts.Suggested Follow-Up
src/lib/onboard.tsuses a 30s timeout. For nightly CI where transient API delays are expected, consider:--retry 2 --retry-delay 5to the curl callCI=true)Related
double-onboard-e2eandonboard-repair-e2efailed due to a separate Docker-driver sandbox lifecycle issue (tracked separately)Auto-diagnosed by
nemoclaw-diagnosisskill • run 25349684174