nightly-e2e: network-policy-e2e fails — NVIDIA API timeout (infra flake)

## Nightly E2E Failure — `network-policy-e2e`: NVIDIA API timeout during onboard

**Workflow run:** [25349684174](https://github.com/NVIDIA/NemoClaw/actions/runs/25349684174)
**Branch:** `codex/openshell-docker-gpu-onboard`
**Commit:** `ad613cf`
**Failed job:** [`network-policy-e2e`](https://github.com/NVIDIA/NemoClaw/actions/runs/25349684174/job/74326422530)

### Root Cause

The `network-policy-e2e` job failed during the NemoClaw onboard step when the NVIDIA API inference validation endpoint (`https://integrate.api.nvidia.com/v1/chat/completions`) timed out. `curl` exited with code 28 (operation timed out), causing the onboard process to abort with exit code 1 before the test script itself ran.

This is an infrastructure flake — the NVIDIA API endpoint was transiently unavailable during the CI window. No code change in the repository caused this failure.

### Evidence

**Log excerpt** (from job 74326422530):
```
[preflight] ✓ Docker is running.
[preflight] ✓ openshell CLI v0.6.11 found at /usr/local/bin/openshell.
[preflight] ✓ Node.js v22.16.0 meets minimum v22.0.0.
[preflight] ✓ Sandbox image check passed.
...
[onboard] [5/8] Validating inference endpoint...
curl: (28) Operation timed out after 30001 milliseconds with 0 bytes received
Error: Inference validation failed (exit code 28).
```

The `curl` timeout (exit code 28) indicates a network-level timeout reaching the NVIDIA API endpoint, not an application-level error. The onboard flow's preflight and sandbox creation succeeded — only the inference validation call failed.

### Classification

| Field | Value |
|---|---|
| `failure_class` | `infra_flake` |
| `confidence` | `high` |
| `suggested_fix` | No code change — retry the nightly run. Consider adding retry logic or a longer timeout for the inference validation curl call in `src/lib/onboard.ts`. |

### Suggested Follow-Up

1. **Re-run the nightly** — this failure is transient and should not recur with stable API availability.
2. **Optional hardening** — the inference validation curl in `src/lib/onboard.ts` uses a 30s timeout. For nightly CI where transient API delays are expected, consider:
   - Adding `--retry 2 --retry-delay 5` to the curl call
   - Increasing the timeout to 60s in CI environments (e.g., gated on `CI=true`)

### Related

- **PR:** none (infra flake — no code fix needed)
- **Other failures in this run:** `double-onboard-e2e` and `onboard-repair-e2e` failed due to a separate Docker-driver sandbox lifecycle issue (tracked separately)

---
*Auto-diagnosed by `nemoclaw-diagnosis` skill • run [25349684174](https://github.com/NVIDIA/NemoClaw/actions/runs/25349684174)*


Field	Value
`failure_class`	`infra_flake`
`confidence`	`high`
`suggested_fix`	No code change — retry the nightly run. Consider adding retry logic or a longer timeout for the inference validation curl call in `src/lib/onboard.ts`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nightly-e2e: network-policy-e2e fails — NVIDIA API timeout (infra flake) #3033

Nightly E2E Failure — `network-policy-e2e`: NVIDIA API timeout during onboard

Root Cause

Evidence

Classification

Suggested Follow-Up

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

nightly-e2e: network-policy-e2e fails — NVIDIA API timeout (infra flake) #3033

Description

Nightly E2E Failure — network-policy-e2e: NVIDIA API timeout during onboard

Root Cause

Evidence

Classification

Suggested Follow-Up

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Nightly E2E Failure — `network-policy-e2e`: NVIDIA API timeout during onboard