-
Notifications
You must be signed in to change notification settings - Fork 2.8k
nightly-e2e: issue-2478-crash-loop-recovery-e2e fails with HTTP 502 from NVIDIA Endpoints #2980
Copy link
Copy link
Closed
Labels
VRDCIssues and PRs submitted by NVIDIA VRDC test team.Issues and PRs submitted by NVIDIA VRDC test team.area: ciCI workflows, checks, release automation, or GitHub ActionsCI workflows, checks, release automation, or GitHub Actionsarea: e2eEnd-to-end tests, nightly failures, or validation infrastructureEnd-to-end tests, nightly failures, or validation infrastructureauto-diagnosedAutomatically diagnosed by CI agentAutomatically diagnosed by CI agentci-failureAuto-created by nemoclaw-diagnosis skillAuto-created by nemoclaw-diagnosis skillneeds: triageAwaiting maintainer classificationAwaiting maintainer classification
Metadata
Metadata
Assignees
Labels
VRDCIssues and PRs submitted by NVIDIA VRDC test team.Issues and PRs submitted by NVIDIA VRDC test team.area: ciCI workflows, checks, release automation, or GitHub ActionsCI workflows, checks, release automation, or GitHub Actionsarea: e2eEnd-to-end tests, nightly failures, or validation infrastructureEnd-to-end tests, nightly failures, or validation infrastructureauto-diagnosedAutomatically diagnosed by CI agentAutomatically diagnosed by CI agentci-failureAuto-created by nemoclaw-diagnosis skillAuto-created by nemoclaw-diagnosis skillneeds: triageAwaiting maintainer classificationAwaiting maintainer classification
Type
Fields
Give feedbackNo fields configured for issues without a type.
Bug Report
Description
Problem Statement
The
issue-2478-crash-loop-recovery-e2enightly job failed during onboarding phase [3/8] "Configuring inference (NIM)" because the NVIDIA Endpoints API returned an HTTP 502 Bad Gateway response. The endpoint validation step calls the Chat Completions API to verify that the configured inference provider is reachable, and the upstream NVIDIA Endpoints service was temporarily unavailable at the time of the nightly run (2026-05-02 ~05:06 UTC).This is a transient upstream failure — no code change in NemoClaw caused the regression, and the same test passes on subsequent runs when the API is healthy.
Proposed Design
Requires human investigation — no PR opened. This is an infrastructure flake caused by the upstream NVIDIA Endpoints service returning 502. Potential mitigations:
retry: 2in the nightly workflow to auto-retry on transient failures.Follow-up: open a fix PR once the team decides on the preferred mitigation strategy.
Alternatives Considered
None — the root cause is upstream and no NemoClaw code change caused this failure.
Category
infra_flake
Reproduction Steps
issue-2478-crash-loop-recovery-e2eon commita8dfa27viagh workflow run nightly-e2e.yaml --repo NVIDIA/NemoClaw --ref main.Environment
Debug Output
Logs
N/A
Checklist