Skip to content

fix(e2e): fix double-onboard assertion strings, CI timeout, and per-phase diagnostics#2658

Merged
cv merged 2 commits into
mainfrom
issue-2654-double-onboard-gateway-reuse-fix
Apr 29, 2026
Merged

fix(e2e): fix double-onboard assertion strings, CI timeout, and per-phase diagnostics#2658
cv merged 2 commits into
mainfrom
issue-2654-double-onboard-gateway-reuse-fix

Conversation

@jyaunches

@jyaunches jyaunches commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Fix the double-onboard-e2e nightly failures caused by a test assertion string mismatch (Phase 3 + Phase 4) and a CI job-level timeout exhaustion (Phase 4). The product code is correct — the gateway IS being reused — the test just could not detect it because it grepped for a string the product never emits. Additionally, the 60-minute CI budget was insufficient for 5 sequential sandbox operations.

Related Issue

Fixes #2654

Changes

  • test/e2e/test-double-onboard.sh: Changed grep assertions from "Reusing existing NemoClaw gateway" to "Reusing healthy NemoClaw gateway" to match the actual string emitted by onboard.ts line 7150
  • .github/workflows/nightly-e2e.yaml: Increased double-onboard-e2e job timeout-minutes from 60 to 90 to accommodate 5 sequential sandbox operations
  • test/e2e/test-double-onboard.sh: Added per-phase run_with_timeout wrapping via the existing e2e-timeout.sh helper (default 1200s, overridable via NEMOCLAW_E2E_PHASE_TIMEOUT)
  • test/e2e/test-double-onboard.sh: Added elapsed-time logging (phase_start_time/phase_elapsed) at the start and end of each onboard phase
  • test/e2e/test-double-onboard.sh: Added dump_diagnostics() that dumps openshell status, openshell sandbox list, and docker ps on any phase timeout (exit 124) or failure
  • test/e2e/test-double-onboard.sh: Added TODO(#2562) markers for future unified timeout abstraction

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes (shell-relevant hooks: shellcheck, shfmt, yaml, gitleaks all pass; ESLint/vitest/tsc failures are pre-existing missing node_modules in worktree)
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

AI Disclosure

  • AI-assisted — tool: Claude (pi agent)

Signed-off-by: Julie Yaunches jyaunches@nvidia.com

Summary by CodeRabbit

  • Chores
    • Increased CI job timeout to allow longer end-to-end test runs.
    • Improved test harness with configurable phase timeouts, robust timeout handling, and per-phase elapsed-time logging.
    • Added richer diagnostic capture on failures, stricter runtime checks during multi-node onboarding, and extra cleanup/status probes to help recover stale state.

…hase diagnostics

- Fix grep assertions in test-double-onboard.sh to match the actual product
  string 'Reusing healthy NemoClaw gateway' instead of the non-existent
  'Reusing existing NemoClaw gateway' (RC-1: Phase 3 + Phase 4 failures)
- Increase double-onboard-e2e job timeout from 60 to 90 minutes to prevent
  Phase 4 timeout caused by 5 sequential sandbox operations (RC-2)
- Add per-phase run_with_timeout wrapping via e2e-timeout.sh helper
- Add elapsed-time logging for each onboard phase
- Add dump_diagnostics() with openshell status, sandbox list, and docker ps
  on any phase timeout or failure
- Add TODO(#2562) markers for future unified timeout abstraction

Fixes #2654
@jyaunches jyaunches self-assigned this Apr 29, 2026
@coderabbitai

coderabbitai Bot commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0ccaf592-75f3-4ce2-b424-4a2acd7570ba

📥 Commits

Reviewing files that changed from the base of the PR and between 00da573 and f70342d.

📒 Files selected for processing (1)
  • test/e2e/test-double-onboard.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/test-double-onboard.sh

📝 Walkthrough

Walkthrough

Increases the nightly E2E job timeout from 60 to 90 minutes and enhances the double-onboard test script with per-phase timeout enforcement, elapsed-time logging, explicit timeout detection (exit code 124), richer diagnostic dumps on failures, stricter gateway-reuse matching, and additional cleanup/status calls.

Changes

Cohort / File(s) Summary
CI workflow timeout
.github/workflows/nightly-e2e.yaml
Increases double-onboard-e2e job timeout-minutes from 60 to 90.
Double-onboard E2E script
test/e2e/test-double-onboard.sh
Adds PHASE_TIMEOUT and run_with_timeout wrapper; treats exit 124 as timeout; logs per-phase elapsed times; invokes dump_diagnostics on timeouts and non-zero exits; tightens gateway reuse check to match "Reusing healthy NemoClaw gateway"; adds best-effort nemoclaw status calls for both sandboxes in cleanup to trigger registry reconciliation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through logs in the pale moonlight,
Timers set snug to keep phases in sight,
When things timeout I shout and I dig for the cause,
Dumping the clues with diagnostic applause.
Now gateways and sandboxes dance through the night. 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: fixing double-onboard assertion strings, CI timeout, and per-phase diagnostics.
Linked Issues check ✅ Passed The PR directly addresses all stated objectives from issue #2654: corrects gateway reuse assertion strings [Phase 3], increases CI timeout for Phase 4 multi-sandbox operations, and adds per-phase timeouts and diagnostics.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the double-onboard test: timeout configuration in CI workflow and test script modifications for assertions and diagnostics.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-2654-double-onboard-gateway-reuse-fix

Review rate limit: 8/10 reviews remaining, refill in 9 minutes and 10 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25088973729
Branch: issue-2654-double-onboard-gateway-reuse-fix
Requested jobs: double-onboard-e2e
Summary: 0 passed, 0 failed, 22 skipped

Job Result
cloud-e2e ⏭️ skipped
cloud-inference-e2e ⏭️ skipped
cloud-onboard-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
docs-validation-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ⏭️ skipped
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skill-agent-e2e ⏭️ skipped
skip-permissions-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-double-onboard.sh`:
- Around line 45-46: The per-phase PHASE_TIMEOUT (variable PHASE_TIMEOUT set
from NEMOCLAW_E2E_PHASE_TIMEOUT) exceeds the global script wrapper timeout
NEMOCLAW_E2E_DEFAULT_TIMEOUT applied by sourcing e2e-timeout.sh, so align them:
update the script to derive NEMOCLAW_E2E_DEFAULT_TIMEOUT from the desired
per-phase budget (or set PHASE_TIMEOUT to not exceed
NEMOCLAW_E2E_DEFAULT_TIMEOUT), i.e., ensure NEMOCLAW_E2E_DEFAULT_TIMEOUT >=
PHASE_TIMEOUT (or set
PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-${NEMOCLAW_E2E_DEFAULT_TIMEOUT}}")
so the global wrapper won't preempt per-phase timeouts; modify the variables
where they are defined/used (PHASE_TIMEOUT, NEMOCLAW_E2E_PHASE_TIMEOUT,
NEMOCLAW_E2E_DEFAULT_TIMEOUT and the e2e-timeout.sh sourcing) to enforce this
invariant.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6ad494d6-5c82-4996-a3ba-28c1f6449a56

📥 Commits

Reviewing files that changed from the base of the PR and between 9113dca and 00da573.

📒 Files selected for processing (2)
  • .github/workflows/nightly-e2e.yaml
  • test/e2e/test-double-onboard.sh

Comment on lines +45 to +46
# Per-phase timeout in seconds (20 min per onboard phase, generous for CI)
PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Per-phase 1200s timeout is effectively preempted by the 900s global script timeout.

e2e-timeout.sh is sourced at Line [21] and self-wraps the entire script using NEMOCLAW_E2E_DEFAULT_TIMEOUT (set to 900 at Line [18]). That means the script can be terminated around 15 minutes before a 1200s phase timeout path (and its diagnostics) is reached.

Suggested fix (align global timeout with phase budget)
-export NEMOCLAW_E2E_DEFAULT_TIMEOUT=900
+export NEMOCLAW_E2E_DEFAULT_TIMEOUT="${NEMOCLAW_E2E_DEFAULT_TIMEOUT:-4500}"
@@
-# Per-phase timeout in seconds (20 min per onboard phase, generous for CI)
-PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}"
+# Per-phase timeout in seconds (20 min per onboard phase, generous for CI)
+PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Per-phase timeout in seconds (20 min per onboard phase, generous for CI)
PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}"
export NEMOCLAW_E2E_DEFAULT_TIMEOUT="${NEMOCLAW_E2E_DEFAULT_TIMEOUT:-4500}"
# ...
# Per-phase timeout in seconds (20 min per onboard phase, generous for CI)
PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-double-onboard.sh` around lines 45 - 46, The per-phase
PHASE_TIMEOUT (variable PHASE_TIMEOUT set from NEMOCLAW_E2E_PHASE_TIMEOUT)
exceeds the global script wrapper timeout NEMOCLAW_E2E_DEFAULT_TIMEOUT applied
by sourcing e2e-timeout.sh, so align them: update the script to derive
NEMOCLAW_E2E_DEFAULT_TIMEOUT from the desired per-phase budget (or set
PHASE_TIMEOUT to not exceed NEMOCLAW_E2E_DEFAULT_TIMEOUT), i.e., ensure
NEMOCLAW_E2E_DEFAULT_TIMEOUT >= PHASE_TIMEOUT (or set
PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-${NEMOCLAW_E2E_DEFAULT_TIMEOUT}}")
so the global wrapper won't preempt per-phase timeouts; modify the variables
where they are defined/used (PHASE_TIMEOUT, NEMOCLAW_E2E_PHASE_TIMEOUT,
NEMOCLAW_E2E_DEFAULT_TIMEOUT and the e2e-timeout.sh sourcing) to enforce this
invariant.

After Phase 6 stops the gateway, 'nemoclaw destroy' restarts it to
delete the sandbox but may fail to clean its own registry entry when
the gateway was in a degraded state. Add 'nemoclaw <name> status'
calls after destroy to trigger the stale-entry reconciliation path,
ensuring the registry is clean before the assertion checks.

Fixes #2654
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25090010575
Branch: issue-2654-double-onboard-gateway-reuse-fix
Requested jobs: double-onboard-e2e
Summary: 0 passed, 0 failed, 22 skipped

Job Result
cloud-e2e ⏭️ skipped
cloud-inference-e2e ⏭️ skipped
cloud-onboard-e2e ⏭️ skipped
deployment-services-e2e ⏭️ skipped
diagnostics-e2e ⏭️ skipped
docs-validation-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-e2e ⏭️ skipped
inference-routing-e2e ⏭️ skipped
messaging-providers-e2e ⏭️ skipped
network-policy-e2e ⏭️ skipped
overlayfs-autofix-e2e ⏭️ skipped
rebuild-hermes-e2e ⏭️ skipped
rebuild-openclaw-e2e ⏭️ skipped
sandbox-operations-e2e ⏭️ skipped
sandbox-survival-e2e ⏭️ skipped
shields-config-e2e ⏭️ skipped
skill-agent-e2e ⏭️ skipped
skip-permissions-e2e ⏭️ skipped
snapshot-commands-e2e ⏭️ skipped
token-rotation-e2e ⏭️ skipped
upgrade-stale-sandbox-e2e ⏭️ skipped

@cv cv enabled auto-merge (squash) April 29, 2026 04:07
@cv cv merged commit d861564 into main Apr 29, 2026
49 of 50 checks passed
cv pushed a commit that referenced this pull request Apr 29, 2026
…800s (#2676)

## Summary

Follow-up to #2658 — the script-level self-wrap timeout
(`NEMOCLAW_E2E_DEFAULT_TIMEOUT`) of 900s (15 min) is far too tight for
`test-double-onboard.sh`, which runs three sequential sandbox creations
that each take 5-7 minutes on CI. Phase 4 was being killed by the script
timeout (exit 124) before the per-phase timeout could fire. This commit
was pushed to #2658 after it was already merged.

## Related Issue

Fixes #2654

## Changes

- **`test/e2e/test-double-onboard.sh`**: Increase
`NEMOCLAW_E2E_DEFAULT_TIMEOUT` from 900s to 4800s (80 min), leaving a
10-min buffer under the 90-min CI job timeout

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Verified via nightly-e2e workflow_dispatch run 25090813418 —
double-onboard-e2e passed 33/33 with this change
- [ ] Docs updated for user-facing behavior changes

## AI Disclosure
- [x] AI-assisted — tool: Claude (pi agent)

---
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Extended E2E test timeout to improve reliability for complex test
scenarios requiring extended execution time.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
…hase diagnostics (NVIDIA#2658)

## Summary

Fix the `double-onboard-e2e` nightly failures caused by a test assertion
string mismatch (Phase 3 + Phase 4) and a CI job-level timeout
exhaustion (Phase 4). The product code is correct — the gateway IS being
reused — the test just could not detect it because it grepped for a
string the product never emits. Additionally, the 60-minute CI budget
was insufficient for 5 sequential sandbox operations.

## Related Issue

Fixes NVIDIA#2654

## Changes

- **`test/e2e/test-double-onboard.sh`**: Changed grep assertions from
`"Reusing existing NemoClaw gateway"` to `"Reusing healthy NemoClaw
gateway"` to match the actual string emitted by `onboard.ts` line 7150
- **`.github/workflows/nightly-e2e.yaml`**: Increased
`double-onboard-e2e` job `timeout-minutes` from 60 to 90 to accommodate
5 sequential sandbox operations
- **`test/e2e/test-double-onboard.sh`**: Added per-phase
`run_with_timeout` wrapping via the existing `e2e-timeout.sh` helper
(default 1200s, overridable via `NEMOCLAW_E2E_PHASE_TIMEOUT`)
- **`test/e2e/test-double-onboard.sh`**: Added elapsed-time logging
(`phase_start_time`/`phase_elapsed`) at the start and end of each
onboard phase
- **`test/e2e/test-double-onboard.sh`**: Added `dump_diagnostics()` that
dumps `openshell status`, `openshell sandbox list`, and `docker ps` on
any phase timeout (exit 124) or failure
- **`test/e2e/test-double-onboard.sh`**: Added `TODO(NVIDIA#2562)` markers for
future unified timeout abstraction

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes (shell-relevant hooks:
shellcheck, shfmt, yaml, gitleaks all pass; ESLint/vitest/tsc failures
are pre-existing missing node_modules in worktree)
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

## AI Disclosure
- [x] AI-assisted — tool: Claude (pi agent)

---
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
  * Increased CI job timeout to allow longer end-to-end test runs.
* Improved test harness with configurable phase timeouts, robust timeout
handling, and per-phase elapsed-time logging.
* Added richer diagnostic capture on failures, stricter runtime checks
during multi-node onboarding, and extra cleanup/status probes to help
recover stale state.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
…800s (NVIDIA#2676)

## Summary

Follow-up to NVIDIA#2658 — the script-level self-wrap timeout
(`NEMOCLAW_E2E_DEFAULT_TIMEOUT`) of 900s (15 min) is far too tight for
`test-double-onboard.sh`, which runs three sequential sandbox creations
that each take 5-7 minutes on CI. Phase 4 was being killed by the script
timeout (exit 124) before the per-phase timeout could fire. This commit
was pushed to NVIDIA#2658 after it was already merged.

## Related Issue

Fixes NVIDIA#2654

## Changes

- **`test/e2e/test-double-onboard.sh`**: Increase
`NEMOCLAW_E2E_DEFAULT_TIMEOUT` from 900s to 4800s (80 min), leaving a
10-min buffer under the 90-min CI job timeout

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Verified via nightly-e2e workflow_dispatch run 25090813418 —
double-onboard-e2e passed 33/33 with this change
- [ ] Docs updated for user-facing behavior changes

## AI Disclosure
- [x] AI-assisted — tool: Claude (pi agent)

---
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Extended E2E test timeout to improve reliability for complex test
scenarios requiring extended execution time.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(e2e): double-onboard-e2e fails — gateway not reused on re-onboard + Phase 4 timeout

3 participants