nightly-e2e: issue-2478-crash-loop-recovery-e2e fails with HTTP 502 from NVIDIA Endpoints

# Bug Report

## Description

### Problem Statement

The `issue-2478-crash-loop-recovery-e2e` nightly job failed during onboarding phase [3/8] "Configuring inference (NIM)" because the NVIDIA Endpoints API returned an HTTP 502 Bad Gateway response. The endpoint validation step calls the Chat Completions API to verify that the configured inference provider is reachable, and the upstream NVIDIA Endpoints service was temporarily unavailable at the time of the nightly run (2026-05-02 ~05:06 UTC).

This is a transient upstream failure — no code change in NemoClaw caused the regression, and the same test passes on subsequent runs when the API is healthy.

### Proposed Design

Requires human investigation — no PR opened. This is an infrastructure flake caused by the upstream NVIDIA Endpoints service returning 502. Potential mitigations:
1. Add a retry loop (with backoff) around the endpoint validation call in the onboarding flow.
2. Mark the job as `retry: 2` in the nightly workflow to auto-retry on transient failures.
3. Accept the flake and track its frequency.

Follow-up: open a fix PR once the team decides on the preferred mitigation strategy.

### Alternatives Considered

None — the root cause is upstream and no NemoClaw code change caused this failure.

### Category

infra_flake

## Reproduction Steps

1. Re-run `issue-2478-crash-loop-recovery-e2e` on commit `a8dfa27` via `gh workflow run nightly-e2e.yaml --repo NVIDIA/NemoClaw --ref main`.
2. If the NVIDIA Endpoints API is healthy, the test will pass. The failure is intermittent.

## Environment

- OS: Ubuntu 24.04.4 LTS (GitHub-hosted runner)
- Node.js: v22.x (from runner image)
- Docker: Docker CE (runner-provided)
- NemoClaw: commit a8dfa272ec98392a12677e1ba1539944343c5960 (main)
- Other: Workflow run 25244372371, job ID 74025843234

## Debug Output

```shell
[3/8] Configuring inference (NIM)
──────────────────────────────────────────────────
[non-interactive] Provider: build
NVIDIA Endpoints endpoint validation failed.
Chat Completions API: HTTP 502: <html> <head><title>502 Bad Gateway</title></head> <body> <center><h1>502 Bad Gateway</h1></center> </body> </html>
##[error]Process completed with exit code 1.
```

## Logs

N/A

## Checklist

- [x] I confirmed this bug is reproducible *(required)*
- [x] I searched existing issues and this is not a duplicate *(required)*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nightly-e2e: issue-2478-crash-loop-recovery-e2e fails with HTTP 502 from NVIDIA Endpoints #2980

Bug Report

Description

Problem Statement

Proposed Design

Alternatives Considered

Category

Reproduction Steps

Environment

Debug Output

Logs

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

nightly-e2e: issue-2478-crash-loop-recovery-e2e fails with HTTP 502 from NVIDIA Endpoints #2980

Description

Bug Report

Description

Problem Statement

Proposed Design

Alternatives Considered

Category

Reproduction Steps

Environment

Debug Output

Logs

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions