fix(e2e): stabilize runtime overrides and Docker auth by cv · Pull Request #4928 · NVIDIA/NemoClaw

cv · 2026-06-08T01:51:21Z

Summary

Builds on the fd 3 output bypass from #4924 to stabilize runtime override E2E captures, and extends the same protection to config hash probes that can hit the same short-lived container stdout race. Docker Hub auth now retries transient registry timeouts and falls back to anonymous pulls so unrelated nightly jobs can still execute.

Related Issue

Fixes #4926

Changes

Write runtime override config and hash-check probe output directly to fd 3 to avoid entrypoint tee teardown races.
Validate and retry captured runtime override configs before extracting required fields.
Retry Docker Hub login in the reusable and direct nightly E2E workflows, then warn and continue anonymously after repeated timeouts.
Extend workflow contract coverage for Docker Hub retry and fallback behavior.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
npm run docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Carlos Villela cvillela@nvidia.com

Summary by CodeRabbit

Chores
- Strengthened CI/CD pipeline reliability by implementing automatic retry logic for Docker Hub authentication with timeout protection and graceful fallback to anonymous Docker pulls when authentication fails.
Tests
- Expanded test coverage to validate the enhanced Docker authentication retry mechanism and fallback behavior.

Signed-off-by: Carlos Villela <cvillela@nvidia.com>

coderabbitai · 2026-06-08T01:51:33Z

📝 Walkthrough

Walkthrough

This PR addresses two independent CI/E2E reliability issues: Docker Hub authentication failures in workflows are now retried up to three times with timeouts and backoff, and a race condition in E2E test config capture is fixed by routing JSON output through a dedicated file descriptor to bypass tee buffering.

Changes

Docker Hub Authentication Retry Logic

Layer / File(s)	Summary
Workflow retry implementation `.github/workflows/e2e-script.yaml`, `.github/workflows/nightly-e2e.yaml`	Both workflows replace single `docker login` with 3-attempt retry loops using 30s timeout per attempt, logging warnings on intermediate failures and continuing with anonymous pulls on total failure.
Workflow test assertions `test/e2e-script-workflow.test.ts`	Test assertions extended to verify that both the reusable runner and direct nightly E2E jobs' authentication steps contain retry-loop syntax, timeout-wrapped docker login, and multi-attempt failure messages.

E2E Config Capture Reliability

Layer / File(s)	Summary
Config capture helper functions `test/e2e/test-runtime-overrides.sh`	Introduces `run_override`, `run_config_hash_check`, `valid_config`, `capture_config` (with structured validation and up to 3 retries), and `jq_config` to extract config fields reliably by routing output to dedicated fd 3.
Override test refactoring `test/e2e/test-runtime-overrides.sh`	Baseline setup and all override test scenarios (model, context window, max tokens, reasoning, CORS origins, combined, and rejected override) refactored to use the new reliable capture/validation pipeline and `jq_config` extraction helpers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

NVIDIA/NemoClaw#4697: Added initial Docker Hub guarded login/auth flow and secrets to workflows; this PR updates the same authentication step's run script with retry/timeout logic and matching test assertions.

Suggested labels

E2E, nightly-e2e, Docker, CI/CD, area: ci, area: e2e, fix

Suggested reviewers

prekshivyas
cjagwani

Poem

🐰 A race condition fled and retries now bloom,
When Docker's gates stumble, we try once more soon,
File descriptors flutter through fd 3 so keen,
The flakiest tests find a fix so serene!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title directly addresses both main changes: stabilizing Docker Hub authentication (retry logic) and fixing flaky runtime overrides test (capture reliability).
Linked Issues check	✅ Passed	Changes fully implement the proposed fix [`#4926`]: writing baseline config directly to fd 3, validating with retries, and adding Docker auth retry logic with fallback to anonymous pulls.
Out of Scope Changes check	✅ Passed	All changes align with stated objectives: fixing container capture races, adding config validation/retries, implementing Docker auth retries, and extending test coverage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/e2e-flake-hardening

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-08T01:52:44Z

E2E Advisor Recommendation

Required E2E: runtime-overrides-e2e
Optional E2E: cloud-inference-e2e, docs-validation-e2e

Dispatch hint: runtime-overrides-e2e

Auto-dispatched E2E: runtime-overrides-e2e via nightly-e2e.yaml at 34d695a8b33bedf98839d40602ebfe8aa4a1955a — nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

runtime-overrides-e2e (medium; timeout 45 minutes): This PR changes test/e2e/test-runtime-overrides.sh itself. Run the owning E2E job to verify the new fd-3 capture path, retry/validation behavior, config hash check helper, and all runtime override scenarios still pass in the real Docker image/entrypoint environment.

Optional E2E

cloud-inference-e2e (medium; timeout 30 minutes): Useful representative reusable-workflow smoke for .github/workflows/e2e-script.yaml because it runs through the shared E2E script runner used by most nightly jobs. It provides confidence that checkout, env export, secret gating, install, and inference still work with the workflow changes.
docs-validation-e2e (low; timeout 15 minutes): Useful low-cost direct nightly job using the shared direct-job Docker Hub auth anchor changed in nightly-e2e.yaml. It helps validate the direct-job path distinct from reusable-workflow jobs.

New E2E recommendations

E2E CI workflow runner / Docker Hub authentication (medium): Selective target_ref dispatches usually execute the trusted main workflow definition, so PR changes to workflow YAML retry logic may not be exercised by existing E2E jobs. Add a hermetic workflow-level integration test or dry-run job that mocks docker login failures/successes and verifies retry, timeout, anonymous fallback, and secret-withholding behavior for both reusable and direct-job auth steps.
- Suggested test: Add an E2E workflow-auth smoke/dry-run job for Docker Hub auth retry and target_ref secret gating.

Dispatch hint

Workflow: nightly-e2e.yaml
jobs input: runtime-overrides-e2e

github-actions · 2026-06-08T01:52:45Z

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

None. Changed files affect legacy/non-scenario E2E workflows and tests only; no files under test/e2e-scenario/ or .github/workflows/e2e-scenarios*.yaml were changed, and the diff does not affect scenario E2E runner/runtime/catalog behavior.

Optional scenario E2E

None.

Relevant changed files

None.

github-actions · 2026-06-08T01:56:24Z

Selective E2E Results — ✅ All requested jobs passed

Run: 27111643288
Target ref: 761eb274f125947ba4f0bd01f8b6c81b60a2cb7d
Workflow ref: main
Requested jobs: runtime-overrides-e2e,cloud-inference-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job	Result
cloud-inference-e2e	⚠️ cancelled
runtime-overrides-e2e	⚠️ cancelled

coderabbitai

🧹 Nitpick comments (1)

test/e2e-script-workflow.test.ts (1)

120-123: ⚡ Quick win

Strengthen contract coverage for retry warnings and backoff.

Add assertions for the intermediate warning and exponential backoff expression so the tests lock the full retry contract, not only loop/timeout/final fallback.

♻️ Suggested assertion additions

     expect(authStep?.run).toContain("for attempt in 1 2 3");
     expect(authStep?.run).toContain("timeout 30s docker login");
+    expect(authStep?.run).toContain("Docker Hub login attempt ${attempt} failed; retrying.");
+    expect(authStep?.run).toContain("sleep $((attempt * 5))");
     expect(authStep?.run).toContain("Docker Hub login failed after 3 attempts");

       expect(authStep?.run, name).toContain("for attempt in 1 2 3");
       expect(authStep?.run, name).toContain("timeout 30s docker login");
+      expect(authStep?.run, name).toContain("Docker Hub login attempt ${attempt} failed; retrying.");
+      expect(authStep?.run, name).toContain("sleep $((attempt * 5))");
       expect(authStep?.run, name).toContain("Docker Hub login failed after 3 attempts");

As per coding guidelines, the workflow contract requires warnings on attempts 1-2 and exponential backoff (sleep $((attempt * 5))), so tests should assert those strings too.

Also applies to: 169-172

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-script-workflow.test.ts` around lines 120 - 123, The test must
assert the intermediate retry warnings and backoff expression in addition to the
loop/timeout/fallback checks: add expects against authStep?.run (and the other
analogous block at the 169-172 range) that .toContain("Docker Hub login failed
on attempt 1"), .toContain("Docker Hub login failed on attempt 2") and
.toContain("sleep $((attempt * 5))") so the test locks the warnings for attempts
1–2 and the exponential backoff expression.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e-script-workflow.test.ts`:
- Around line 120-123: The test must assert the intermediate retry warnings and
backoff expression in addition to the loop/timeout/fallback checks: add expects
against authStep?.run (and the other analogous block at the 169-172 range) that
.toContain("Docker Hub login failed on attempt 1"), .toContain("Docker Hub login
failed on attempt 2") and .toContain("sleep $((attempt * 5))") so the test locks
the warnings for attempts 1–2 and the exponential backoff expression.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e48cb5cc-4529-493f-b99e-d07fe4331bad

📥 Commits

Reviewing files that changed from the base of the PR and between 5aa2c95 and 761eb27.

📒 Files selected for processing (4)

.github/workflows/e2e-script.yaml
.github/workflows/nightly-e2e.yaml
test/e2e-script-workflow.test.ts
test/e2e/test-runtime-overrides.sh

github-actions · 2026-06-08T01:58:57Z

PR Review Advisor

Findings: 0 needs attention, 3 worth checking, 0 nice ideas
Top item: Clarify fd 3 workaround ownership/removal and overlap with #4924

Review findings

🛠️ Needs attention

None.

🔎 Worth checking

Source-of-truth review needed: test/e2e/test-runtime-overrides.sh fd 3 output bypass: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: test/e2e/test-runtime-overrides.sh adds `>&3` in run_override and run_config_hash_check; scripts/nemoclaw-start.sh establishes fd 3 before redirecting stdout through tee.
Clarify fd 3 workaround ownership/removal and overlap with fix(e2e): write config capture to fd 3, bypassing tee race #4924 (test/e2e/test-runtime-overrides.sh:54): The fd 3 capture bypass is a localized workaround for entrypoint tee teardown races and this PR overlaps active PR fix(e2e): write config capture to fd 3, bypassing tee race #4924 on the same file and linked issue. The code identifies the invalid state and uses the right fd 3 boundary, but it does not document when the workaround can be removed, and the comment references the competing PR rather than the linked issue/root source. This creates source-of-truth drift risk if both PRs evolve the same workaround differently.
- Recommendation: Coordinate with or supersede fix(e2e): write config capture to fd 3, bypassing tee race #4924 explicitly, reference the issue/root cause in the code comment, and add a short removal condition such as removing the bypass once entrypoint logging no longer routes command stdout through a teardown-prone tee process.
- Evidence: Open PR overlap data shows fix(e2e): write config capture to fd 3, bypassing tee race #4924 changes test/e2e/test-runtime-overrides.sh for duplicate linked issue nightly-e2e: runtime-overrides-e2e baseline capture race — tee flush vs container teardown #4926. The new comment says "write captured payloads to fd 3 to bypass that pipe (fix(e2e): write config capture to fd 3, bypassing tee race #4924)" while the root source is scripts/nemoclaw-start.sh's `exec 3>&1` followed by stdout tee redirection.
Docker auth retry path lacks behavioral negative coverage (.github/workflows/e2e-script.yaml:184): The workflow trusted-code boundary remains intact and no credential leak was found, but the new Docker Hub login recovery path is covered only by workflow string assertions. For a secret-handling path, behavior-level coverage would better prove failed logins retry exactly as intended, continue without exposing credentials, and preserve the target_ref secret-withholding boundary.
- Recommendation: Add or identify a small behavioral test using a fake `docker` command/script fixture for the auth shell block that verifies three failed login attempts, generic warnings only, continuation after failure, and no execution of auth when workflow_dispatch supplies an explicit target_ref.
- Evidence: The retry/fallback logic loops `for attempt in 1 2 3`, runs `timeout 30s docker login`, and warns after failure in .github/workflows/e2e-script.yaml and the nightly anchor. test/e2e-script-workflow.test.ts only checks that these strings are present.

🌱 Nice ideas

None.

Consider writing more tests for

**Runtime validation** — Docker Hub auth retries exactly three failed docker login attempts and logs only generic warnings before continuing.. The changed workflow and sandbox/E2E capture paths depend on runtime behavior: GitHub Actions secret guards, Docker CLI retry/fallback, and entrypoint fd inheritance. Existing tests are useful contract/string checks plus the E2E script itself, but behavior-level validation would improve confidence.
**Runtime validation** — Docker Hub auth skips credential exposure on workflow_dispatch with an explicit target_ref after retry logic is added.. The changed workflow and sandbox/E2E capture paths depend on runtime behavior: GitHub Actions secret guards, Docker CLI retry/fallback, and entrypoint fd inheritance. Existing tests are useful contract/string checks plus the E2E script itself, but behavior-level validation would improve confidence.
**Runtime validation** — runtime-overrides capture rejects empty or truncated JSON instead of deriving empty baseline fields.. The changed workflow and sandbox/E2E capture paths depend on runtime behavior: GitHub Actions secret guards, Docker CLI retry/fallback, and entrypoint fd inheritance. Existing tests are useful contract/string checks plus the E2E script itself, but behavior-level validation would improve confidence.
**Runtime validation** — runtime-overrides hash probe uses the same fd 3 capture path as config capture.. The changed workflow and sandbox/E2E capture paths depend on runtime behavior: GitHub Actions secret guards, Docker CLI retry/fallback, and entrypoint fd inheritance. Existing tests are useful contract/string checks plus the E2E script itself, but behavior-level validation would improve confidence.
**test/e2e/test-runtime-overrides.sh fd 3 output bypass** — The E2E harness now writes config and hash outputs to fd 3 and validates config shape before extraction, but there is no isolated test proving fd 3 inheritance or the hash-probe path.. test/e2e/test-runtime-overrides.sh adds `>&3` in run_override and run_config_hash_check; scripts/nemoclaw-start.sh establishes fd 3 before redirecting stdout through tee.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

github-actions · 2026-06-08T02:03:53Z

Selective E2E Results — ✅ All requested jobs passed

Run: 27111736171
Target ref: 34d695a8b33bedf98839d40602ebfe8aa4a1955a
Workflow ref: main
Requested jobs: runtime-overrides-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job	Result
runtime-overrides-e2e	✅ success

fix(e2e): stabilize runtime overrides and Docker auth

761eb27

Signed-off-by: Carlos Villela <cvillela@nvidia.com>

cv self-assigned this Jun 8, 2026

Merge branch 'main' into codex/e2e-flake-hardening

34d695a

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

cv added the v0.0.61 Release target label Jun 8, 2026

cv merged commit 40d68d7 into main Jun 8, 2026
33 checks passed

cv deleted the codex/e2e-flake-hardening branch June 8, 2026 02:41

cv mentioned this pull request Jun 8, 2026

fix(e2e): write config capture to fd 3, bypassing tee race #4924

Closed

3 tasks

hunglp6d mentioned this pull request Jun 8, 2026

nightly-e2e: vm-driver-privileged-exec-routing-e2e failed — Docker Hub timeout #4925

Closed

wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026

Conversation

cv commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

github-actions Bot commented Jun 8, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 8, 2026

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

github-actions Bot commented Jun 8, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cv commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading