fix(e2e): fix double-onboard assertion strings, CI timeout, and per-phase diagnostics by jyaunches · Pull Request #2658 · NVIDIA/NemoClaw

jyaunches · 2026-04-29T03:09:57Z

Summary

Fix the double-onboard-e2e nightly failures caused by a test assertion string mismatch (Phase 3 + Phase 4) and a CI job-level timeout exhaustion (Phase 4). The product code is correct — the gateway IS being reused — the test just could not detect it because it grepped for a string the product never emits. Additionally, the 60-minute CI budget was insufficient for 5 sequential sandbox operations.

Related Issue

Fixes #2654

Changes

test/e2e/test-double-onboard.sh: Changed grep assertions from "Reusing existing NemoClaw gateway" to "Reusing healthy NemoClaw gateway" to match the actual string emitted by onboard.ts line 7150
.github/workflows/nightly-e2e.yaml: Increased double-onboard-e2e job timeout-minutes from 60 to 90 to accommodate 5 sequential sandbox operations
test/e2e/test-double-onboard.sh: Added per-phase run_with_timeout wrapping via the existing e2e-timeout.sh helper (default 1200s, overridable via NEMOCLAW_E2E_PHASE_TIMEOUT)
test/e2e/test-double-onboard.sh: Added elapsed-time logging (phase_start_time/phase_elapsed) at the start and end of each onboard phase
test/e2e/test-double-onboard.sh: Added dump_diagnostics() that dumps openshell status, openshell sandbox list, and docker ps on any phase timeout (exit 124) or failure
test/e2e/test-double-onboard.sh: Added TODO(#2562) markers for future unified timeout abstraction

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes (shell-relevant hooks: shellcheck, shfmt, yaml, gitleaks all pass; ESLint/vitest/tsc failures are pre-existing missing node_modules in worktree)
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
make docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

AI Disclosure

AI-assisted — tool: Claude (pi agent)

Signed-off-by: Julie Yaunches jyaunches@nvidia.com

Summary by CodeRabbit

Chores
- Increased CI job timeout to allow longer end-to-end test runs.
- Improved test harness with configurable phase timeouts, robust timeout handling, and per-phase elapsed-time logging.
- Added richer diagnostic capture on failures, stricter runtime checks during multi-node onboarding, and extra cleanup/status probes to help recover stale state.

…hase diagnostics - Fix grep assertions in test-double-onboard.sh to match the actual product string 'Reusing healthy NemoClaw gateway' instead of the non-existent 'Reusing existing NemoClaw gateway' (RC-1: Phase 3 + Phase 4 failures) - Increase double-onboard-e2e job timeout from 60 to 90 minutes to prevent Phase 4 timeout caused by 5 sequential sandbox operations (RC-2) - Add per-phase run_with_timeout wrapping via e2e-timeout.sh helper - Add elapsed-time logging for each onboard phase - Add dump_diagnostics() with openshell status, sandbox list, and docker ps on any phase timeout or failure - Add TODO(#2562) markers for future unified timeout abstraction Fixes #2654

coderabbitai · 2026-04-29T03:10:07Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0ccaf592-75f3-4ce2-b424-4a2acd7570ba

📥 Commits

Reviewing files that changed from the base of the PR and between 00da573 and f70342d.

📒 Files selected for processing (1)

test/e2e/test-double-onboard.sh

🚧 Files skipped from review as they are similar to previous changes (1)

test/e2e/test-double-onboard.sh

📝 Walkthrough

Walkthrough

Increases the nightly E2E job timeout from 60 to 90 minutes and enhances the double-onboard test script with per-phase timeout enforcement, elapsed-time logging, explicit timeout detection (exit code 124), richer diagnostic dumps on failures, stricter gateway-reuse matching, and additional cleanup/status calls.

Changes

Cohort / File(s)	Summary
CI workflow timeout `.github/workflows/nightly-e2e.yaml`	Increases `double-onboard-e2e` job `timeout-minutes` from `60` to `90`.
Double-onboard E2E script `test/e2e/test-double-onboard.sh`	Adds `PHASE_TIMEOUT` and `run_with_timeout` wrapper; treats exit `124` as timeout; logs per-phase elapsed times; invokes `dump_diagnostics` on timeouts and non-zero exits; tightens gateway reuse check to match `"Reusing healthy NemoClaw gateway"`; adds best-effort `nemoclaw status` calls for both sandboxes in cleanup to trigger registry reconciliation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through logs in the pale moonlight,
Timers set snug to keep phases in sight,
When things timeout I shout and I dig for the cause,
Dumping the clues with diagnostic applause.
Now gateways and sandboxes dance through the night. 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main changes: fixing double-onboard assertion strings, CI timeout, and per-phase diagnostics.
Linked Issues check	✅ Passed	The PR directly addresses all stated objectives from issue `#2654`: corrects gateway reuse assertion strings [Phase 3], increases CI timeout for Phase 4 multi-sandbox operations, and adds per-phase timeouts and diagnostics.
Out of Scope Changes check	✅ Passed	All changes are scoped to fixing the double-onboard test: timeout configuration in CI workflow and test script modifications for assertions and diagnostics.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch issue-2654-double-onboard-gateway-reuse-fix

_{Review rate limit: 8/10 reviews remaining, refill in 9 minutes and 10 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-29T03:10:26Z

Selective E2E Results — ✅ All requested jobs passed

Run: 25088973729
Branch: issue-2654-double-onboard-gateway-reuse-fix
Requested jobs: double-onboard-e2e
Summary: 0 passed, 0 failed, 22 skipped

Job	Result
cloud-e2e	⏭️ skipped
cloud-inference-e2e	⏭️ skipped
cloud-onboard-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
docs-validation-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	⏭️ skipped
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skill-agent-e2e	⏭️ skipped
skip-permissions-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-double-onboard.sh`:
- Around line 45-46: The per-phase PHASE_TIMEOUT (variable PHASE_TIMEOUT set
from NEMOCLAW_E2E_PHASE_TIMEOUT) exceeds the global script wrapper timeout
NEMOCLAW_E2E_DEFAULT_TIMEOUT applied by sourcing e2e-timeout.sh, so align them:
update the script to derive NEMOCLAW_E2E_DEFAULT_TIMEOUT from the desired
per-phase budget (or set PHASE_TIMEOUT to not exceed
NEMOCLAW_E2E_DEFAULT_TIMEOUT), i.e., ensure NEMOCLAW_E2E_DEFAULT_TIMEOUT >=
PHASE_TIMEOUT (or set
PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-${NEMOCLAW_E2E_DEFAULT_TIMEOUT}}")
so the global wrapper won't preempt per-phase timeouts; modify the variables
where they are defined/used (PHASE_TIMEOUT, NEMOCLAW_E2E_PHASE_TIMEOUT,
NEMOCLAW_E2E_DEFAULT_TIMEOUT and the e2e-timeout.sh sourcing) to enforce this
invariant.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6ad494d6-5c82-4996-a3ba-28c1f6449a56

📥 Commits

Reviewing files that changed from the base of the PR and between 9113dca and 00da573.

📒 Files selected for processing (2)

.github/workflows/nightly-e2e.yaml
test/e2e/test-double-onboard.sh

coderabbitai · 2026-04-29T03:13:55Z

+# Per-phase timeout in seconds (20 min per onboard phase, generous for CI)
+PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}"


⚠️ Potential issue | 🟠 Major

Per-phase 1200s timeout is effectively preempted by the 900s global script timeout.

e2e-timeout.sh is sourced at Line [21] and self-wraps the entire script using NEMOCLAW_E2E_DEFAULT_TIMEOUT (set to 900 at Line [18]). That means the script can be terminated around 15 minutes before a 1200s phase timeout path (and its diagnostics) is reached.

Suggested fix (align global timeout with phase budget)

-export NEMOCLAW_E2E_DEFAULT_TIMEOUT=900 +export NEMOCLAW_E2E_DEFAULT_TIMEOUT="${NEMOCLAW_E2E_DEFAULT_TIMEOUT:-4500}" @@ -# Per-phase timeout in seconds (20 min per onboard phase, generous for CI) -PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}" +# Per-phase timeout in seconds (20 min per onboard phase, generous for CI) +PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Per-phase timeout in seconds (20 min per onboard phase, generous for CI)

PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}"

export NEMOCLAW_E2E_DEFAULT_TIMEOUT="${NEMOCLAW_E2E_DEFAULT_TIMEOUT:-4500}"

# ...

# Per-phase timeout in seconds (20 min per onboard phase, generous for CI)

PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-double-onboard.sh` around lines 45 - 46, The per-phase PHASE_TIMEOUT (variable PHASE_TIMEOUT set from NEMOCLAW_E2E_PHASE_TIMEOUT) exceeds the global script wrapper timeout NEMOCLAW_E2E_DEFAULT_TIMEOUT applied by sourcing e2e-timeout.sh, so align them: update the script to derive NEMOCLAW_E2E_DEFAULT_TIMEOUT from the desired per-phase budget (or set PHASE_TIMEOUT to not exceed NEMOCLAW_E2E_DEFAULT_TIMEOUT), i.e., ensure NEMOCLAW_E2E_DEFAULT_TIMEOUT >= PHASE_TIMEOUT (or set PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-${NEMOCLAW_E2E_DEFAULT_TIMEOUT}}") so the global wrapper won't preempt per-phase timeouts; modify the variables where they are defined/used (PHASE_TIMEOUT, NEMOCLAW_E2E_PHASE_TIMEOUT, NEMOCLAW_E2E_DEFAULT_TIMEOUT and the e2e-timeout.sh sourcing) to enforce this invariant.

After Phase 6 stops the gateway, 'nemoclaw destroy' restarts it to delete the sandbox but may fail to clean its own registry entry when the gateway was in a degraded state. Add 'nemoclaw <name> status' calls after destroy to trigger the stale-entry reconciliation path, ensuring the registry is clean before the assertion checks. Fixes #2654

github-actions · 2026-04-29T03:52:19Z

Selective E2E Results — ✅ All requested jobs passed

Run: 25090010575
Branch: issue-2654-double-onboard-gateway-reuse-fix
Requested jobs: double-onboard-e2e
Summary: 0 passed, 0 failed, 22 skipped

Job	Result
cloud-e2e	⏭️ skipped
cloud-inference-e2e	⏭️ skipped
cloud-onboard-e2e	⏭️ skipped
deployment-services-e2e	⏭️ skipped
diagnostics-e2e	⏭️ skipped
docs-validation-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped
hermes-e2e	⏭️ skipped
inference-routing-e2e	⏭️ skipped
messaging-providers-e2e	⏭️ skipped
network-policy-e2e	⏭️ skipped
overlayfs-autofix-e2e	⏭️ skipped
rebuild-hermes-e2e	⏭️ skipped
rebuild-openclaw-e2e	⏭️ skipped
sandbox-operations-e2e	⏭️ skipped
sandbox-survival-e2e	⏭️ skipped
shields-config-e2e	⏭️ skipped
skill-agent-e2e	⏭️ skipped
skip-permissions-e2e	⏭️ skipped
snapshot-commands-e2e	⏭️ skipped
token-rotation-e2e	⏭️ skipped
upgrade-stale-sandbox-e2e	⏭️ skipped

…800s (#2676) ## Summary Follow-up to #2658 — the script-level self-wrap timeout (`NEMOCLAW_E2E_DEFAULT_TIMEOUT`) of 900s (15 min) is far too tight for `test-double-onboard.sh`, which runs three sequential sandbox creations that each take 5-7 minutes on CI. Phase 4 was being killed by the script timeout (exit 124) before the per-phase timeout could fire. This commit was pushed to #2658 after it was already merged. ## Related Issue Fixes #2654 ## Changes - **`test/e2e/test-double-onboard.sh`**: Increase `NEMOCLAW_E2E_DEFAULT_TIMEOUT` from 900s to 4800s (80 min), leaving a 10-min buffer under the 90-min CI job timeout ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Verified via nightly-e2e workflow_dispatch run 25090813418 — double-onboard-e2e passed 33/33 with this change - [ ] Docs updated for user-facing behavior changes ## AI Disclosure - [x] AI-assisted — tool: Claude (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>  ## Summary by CodeRabbit * **Tests** * Extended E2E test timeout to improve reliability for complex test scenarios requiring extended execution time.

…hase diagnostics (NVIDIA#2658) ## Summary Fix the `double-onboard-e2e` nightly failures caused by a test assertion string mismatch (Phase 3 + Phase 4) and a CI job-level timeout exhaustion (Phase 4). The product code is correct — the gateway IS being reused — the test just could not detect it because it grepped for a string the product never emits. Additionally, the 60-minute CI budget was insufficient for 5 sequential sandbox operations. ## Related Issue Fixes NVIDIA#2654 ## Changes - **`test/e2e/test-double-onboard.sh`**: Changed grep assertions from `"Reusing existing NemoClaw gateway"` to `"Reusing healthy NemoClaw gateway"` to match the actual string emitted by `onboard.ts` line 7150 - **`.github/workflows/nightly-e2e.yaml`**: Increased `double-onboard-e2e` job `timeout-minutes` from 60 to 90 to accommodate 5 sequential sandbox operations - **`test/e2e/test-double-onboard.sh`**: Added per-phase `run_with_timeout` wrapping via the existing `e2e-timeout.sh` helper (default 1200s, overridable via `NEMOCLAW_E2E_PHASE_TIMEOUT`) - **`test/e2e/test-double-onboard.sh`**: Added elapsed-time logging (`phase_start_time`/`phase_elapsed`) at the start and end of each onboard phase - **`test/e2e/test-double-onboard.sh`**: Added `dump_diagnostics()` that dumps `openshell status`, `openshell sandbox list`, and `docker ps` on any phase timeout (exit 124) or failure - **`test/e2e/test-double-onboard.sh`**: Added `TODO(NVIDIA#2562)` markers for future unified timeout abstraction ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes (shell-relevant hooks: shellcheck, shfmt, yaml, gitleaks all pass; ESLint/vitest/tsc failures are pre-existing missing node_modules in worktree) - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>  ## Summary by CodeRabbit * **Chores** * Increased CI job timeout to allow longer end-to-end test runs. * Improved test harness with configurable phase timeouts, robust timeout handling, and per-phase elapsed-time logging. * Added richer diagnostic capture on failures, stricter runtime checks during multi-node onboarding, and extra cleanup/status probes to help recover stale state.

…800s (NVIDIA#2676) ## Summary Follow-up to NVIDIA#2658 — the script-level self-wrap timeout (`NEMOCLAW_E2E_DEFAULT_TIMEOUT`) of 900s (15 min) is far too tight for `test-double-onboard.sh`, which runs three sequential sandbox creations that each take 5-7 minutes on CI. Phase 4 was being killed by the script timeout (exit 124) before the per-phase timeout could fire. This commit was pushed to NVIDIA#2658 after it was already merged. ## Related Issue Fixes NVIDIA#2654 ## Changes - **`test/e2e/test-double-onboard.sh`**: Increase `NEMOCLAW_E2E_DEFAULT_TIMEOUT` from 900s to 4800s (80 min), leaving a 10-min buffer under the 90-min CI job timeout ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Verified via nightly-e2e workflow_dispatch run 25090813418 — double-onboard-e2e passed 33/33 with this change - [ ] Docs updated for user-facing behavior changes ## AI Disclosure - [x] AI-assisted — tool: Claude (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>  ## Summary by CodeRabbit * **Tests** * Extended E2E test timeout to improve reliability for complex test scenarios requiring extended execution time.

jyaunches self-assigned this Apr 29, 2026

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

cv enabled auto-merge (squash) April 29, 2026 04:07

cv approved these changes Apr 29, 2026

View reviewed changes

cv merged commit d861564 into main Apr 29, 2026
49 of 50 checks passed

jyaunches mentioned this pull request Apr 29, 2026

fix(e2e): increase double-onboard script-level timeout from 900s to 4800s #2676

Merged

11 tasks

jyaunches mentioned this pull request Apr 29, 2026

test(e2e): cover dashboard port auto-alloc in nightly #2455

Closed

7 tasks

wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): fix double-onboard assertion strings, CI timeout, and per-phase diagnostics#2658

fix(e2e): fix double-onboard assertion strings, CI timeout, and per-phase diagnostics#2658
cv merged 2 commits into
mainfrom
issue-2654-double-onboard-gateway-reuse-fix

jyaunches commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Per-phase timeout in seconds (20 min per onboard phase, generous for CI)
		PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}"

Conversation

jyaunches commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

AI Disclosure

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 29, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 29, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jyaunches commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading