Skip to content

nightly-e2e: issue-2478-crash-loop-recovery-e2e fails with HTTP 502 from NVIDIA Endpoints #2980

@gaveezy

Description

@gaveezy

Bug Report

Description

Problem Statement

The issue-2478-crash-loop-recovery-e2e nightly job failed during onboarding phase [3/8] "Configuring inference (NIM)" because the NVIDIA Endpoints API returned an HTTP 502 Bad Gateway response. The endpoint validation step calls the Chat Completions API to verify that the configured inference provider is reachable, and the upstream NVIDIA Endpoints service was temporarily unavailable at the time of the nightly run (2026-05-02 ~05:06 UTC).

This is a transient upstream failure — no code change in NemoClaw caused the regression, and the same test passes on subsequent runs when the API is healthy.

Proposed Design

Requires human investigation — no PR opened. This is an infrastructure flake caused by the upstream NVIDIA Endpoints service returning 502. Potential mitigations:

  1. Add a retry loop (with backoff) around the endpoint validation call in the onboarding flow.
  2. Mark the job as retry: 2 in the nightly workflow to auto-retry on transient failures.
  3. Accept the flake and track its frequency.

Follow-up: open a fix PR once the team decides on the preferred mitigation strategy.

Alternatives Considered

None — the root cause is upstream and no NemoClaw code change caused this failure.

Category

infra_flake

Reproduction Steps

  1. Re-run issue-2478-crash-loop-recovery-e2e on commit a8dfa27 via gh workflow run nightly-e2e.yaml --repo NVIDIA/NemoClaw --ref main.
  2. If the NVIDIA Endpoints API is healthy, the test will pass. The failure is intermittent.

Environment

  • OS: Ubuntu 24.04.4 LTS (GitHub-hosted runner)
  • Node.js: v22.x (from runner image)
  • Docker: Docker CE (runner-provided)
  • NemoClaw: commit a8dfa27 (main)
  • Other: Workflow run 25244372371, job ID 74025843234

Debug Output

[3/8] Configuring inference (NIM)
──────────────────────────────────────────────────
[non-interactive] Provider: build
NVIDIA Endpoints endpoint validation failed.
Chat Completions API: HTTP 502: <html> <head><title>502 Bad Gateway</title></head> <body> <center><h1>502 Bad Gateway</h1></center> </body> </html>
##[error]Process completed with exit code 1.

Logs

N/A

Checklist

  • I confirmed this bug is reproducible (required)
  • I searched existing issues and this is not a duplicate (required)

Metadata

Metadata

Assignees

Labels

VRDCIssues and PRs submitted by NVIDIA VRDC test team.area: ciCI workflows, checks, release automation, or GitHub Actionsarea: e2eEnd-to-end tests, nightly failures, or validation infrastructureauto-diagnosedAutomatically diagnosed by CI agentci-failureAuto-created by nemoclaw-diagnosis skillneeds: triageAwaiting maintainer classification

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions