fix(e2e): fix double-onboard assertion strings, CI timeout, and per-phase diagnostics#2658
Conversation
…hase diagnostics - Fix grep assertions in test-double-onboard.sh to match the actual product string 'Reusing healthy NemoClaw gateway' instead of the non-existent 'Reusing existing NemoClaw gateway' (RC-1: Phase 3 + Phase 4 failures) - Increase double-onboard-e2e job timeout from 60 to 90 minutes to prevent Phase 4 timeout caused by 5 sequential sandbox operations (RC-2) - Add per-phase run_with_timeout wrapping via e2e-timeout.sh helper - Add elapsed-time logging for each onboard phase - Add dump_diagnostics() with openshell status, sandbox list, and docker ps on any phase timeout or failure - Add TODO(#2562) markers for future unified timeout abstraction Fixes #2654
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughIncreases the nightly E2E job timeout from 60 to 90 minutes and enhances the double-onboard test script with per-phase timeout enforcement, elapsed-time logging, explicit timeout detection (exit code 124), richer diagnostic dumps on failures, stricter gateway-reuse matching, and additional cleanup/status calls. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Review rate limit: 8/10 reviews remaining, refill in 9 minutes and 10 seconds. Comment |
Selective E2E Results — ✅ All requested jobs passedRun: 25088973729
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-double-onboard.sh`:
- Around line 45-46: The per-phase PHASE_TIMEOUT (variable PHASE_TIMEOUT set
from NEMOCLAW_E2E_PHASE_TIMEOUT) exceeds the global script wrapper timeout
NEMOCLAW_E2E_DEFAULT_TIMEOUT applied by sourcing e2e-timeout.sh, so align them:
update the script to derive NEMOCLAW_E2E_DEFAULT_TIMEOUT from the desired
per-phase budget (or set PHASE_TIMEOUT to not exceed
NEMOCLAW_E2E_DEFAULT_TIMEOUT), i.e., ensure NEMOCLAW_E2E_DEFAULT_TIMEOUT >=
PHASE_TIMEOUT (or set
PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-${NEMOCLAW_E2E_DEFAULT_TIMEOUT}}")
so the global wrapper won't preempt per-phase timeouts; modify the variables
where they are defined/used (PHASE_TIMEOUT, NEMOCLAW_E2E_PHASE_TIMEOUT,
NEMOCLAW_E2E_DEFAULT_TIMEOUT and the e2e-timeout.sh sourcing) to enforce this
invariant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6ad494d6-5c82-4996-a3ba-28c1f6449a56
📒 Files selected for processing (2)
.github/workflows/nightly-e2e.yamltest/e2e/test-double-onboard.sh
| # Per-phase timeout in seconds (20 min per onboard phase, generous for CI) | ||
| PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}" |
There was a problem hiding this comment.
Per-phase 1200s timeout is effectively preempted by the 900s global script timeout.
e2e-timeout.sh is sourced at Line [21] and self-wraps the entire script using NEMOCLAW_E2E_DEFAULT_TIMEOUT (set to 900 at Line [18]). That means the script can be terminated around 15 minutes before a 1200s phase timeout path (and its diagnostics) is reached.
Suggested fix (align global timeout with phase budget)
-export NEMOCLAW_E2E_DEFAULT_TIMEOUT=900
+export NEMOCLAW_E2E_DEFAULT_TIMEOUT="${NEMOCLAW_E2E_DEFAULT_TIMEOUT:-4500}"
@@
-# Per-phase timeout in seconds (20 min per onboard phase, generous for CI)
-PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}"
+# Per-phase timeout in seconds (20 min per onboard phase, generous for CI)
+PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Per-phase timeout in seconds (20 min per onboard phase, generous for CI) | |
| PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}" | |
| export NEMOCLAW_E2E_DEFAULT_TIMEOUT="${NEMOCLAW_E2E_DEFAULT_TIMEOUT:-4500}" | |
| # ... | |
| # Per-phase timeout in seconds (20 min per onboard phase, generous for CI) | |
| PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-1200}" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-double-onboard.sh` around lines 45 - 46, The per-phase
PHASE_TIMEOUT (variable PHASE_TIMEOUT set from NEMOCLAW_E2E_PHASE_TIMEOUT)
exceeds the global script wrapper timeout NEMOCLAW_E2E_DEFAULT_TIMEOUT applied
by sourcing e2e-timeout.sh, so align them: update the script to derive
NEMOCLAW_E2E_DEFAULT_TIMEOUT from the desired per-phase budget (or set
PHASE_TIMEOUT to not exceed NEMOCLAW_E2E_DEFAULT_TIMEOUT), i.e., ensure
NEMOCLAW_E2E_DEFAULT_TIMEOUT >= PHASE_TIMEOUT (or set
PHASE_TIMEOUT="${NEMOCLAW_E2E_PHASE_TIMEOUT:-${NEMOCLAW_E2E_DEFAULT_TIMEOUT}}")
so the global wrapper won't preempt per-phase timeouts; modify the variables
where they are defined/used (PHASE_TIMEOUT, NEMOCLAW_E2E_PHASE_TIMEOUT,
NEMOCLAW_E2E_DEFAULT_TIMEOUT and the e2e-timeout.sh sourcing) to enforce this
invariant.
After Phase 6 stops the gateway, 'nemoclaw destroy' restarts it to delete the sandbox but may fail to clean its own registry entry when the gateway was in a degraded state. Add 'nemoclaw <name> status' calls after destroy to trigger the stale-entry reconciliation path, ensuring the registry is clean before the assertion checks. Fixes #2654
Selective E2E Results — ✅ All requested jobs passedRun: 25090010575
|
…800s (#2676) ## Summary Follow-up to #2658 — the script-level self-wrap timeout (`NEMOCLAW_E2E_DEFAULT_TIMEOUT`) of 900s (15 min) is far too tight for `test-double-onboard.sh`, which runs three sequential sandbox creations that each take 5-7 minutes on CI. Phase 4 was being killed by the script timeout (exit 124) before the per-phase timeout could fire. This commit was pushed to #2658 after it was already merged. ## Related Issue Fixes #2654 ## Changes - **`test/e2e/test-double-onboard.sh`**: Increase `NEMOCLAW_E2E_DEFAULT_TIMEOUT` from 900s to 4800s (80 min), leaving a 10-min buffer under the 90-min CI job timeout ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Verified via nightly-e2e workflow_dispatch run 25090813418 — double-onboard-e2e passed 33/33 with this change - [ ] Docs updated for user-facing behavior changes ## AI Disclosure - [x] AI-assisted — tool: Claude (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Extended E2E test timeout to improve reliability for complex test scenarios requiring extended execution time. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…hase diagnostics (NVIDIA#2658) ## Summary Fix the `double-onboard-e2e` nightly failures caused by a test assertion string mismatch (Phase 3 + Phase 4) and a CI job-level timeout exhaustion (Phase 4). The product code is correct — the gateway IS being reused — the test just could not detect it because it grepped for a string the product never emits. Additionally, the 60-minute CI budget was insufficient for 5 sequential sandbox operations. ## Related Issue Fixes NVIDIA#2654 ## Changes - **`test/e2e/test-double-onboard.sh`**: Changed grep assertions from `"Reusing existing NemoClaw gateway"` to `"Reusing healthy NemoClaw gateway"` to match the actual string emitted by `onboard.ts` line 7150 - **`.github/workflows/nightly-e2e.yaml`**: Increased `double-onboard-e2e` job `timeout-minutes` from 60 to 90 to accommodate 5 sequential sandbox operations - **`test/e2e/test-double-onboard.sh`**: Added per-phase `run_with_timeout` wrapping via the existing `e2e-timeout.sh` helper (default 1200s, overridable via `NEMOCLAW_E2E_PHASE_TIMEOUT`) - **`test/e2e/test-double-onboard.sh`**: Added elapsed-time logging (`phase_start_time`/`phase_elapsed`) at the start and end of each onboard phase - **`test/e2e/test-double-onboard.sh`**: Added `dump_diagnostics()` that dumps `openshell status`, `openshell sandbox list`, and `docker ps` on any phase timeout (exit 124) or failure - **`test/e2e/test-double-onboard.sh`**: Added `TODO(NVIDIA#2562)` markers for future unified timeout abstraction ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes (shell-relevant hooks: shellcheck, shfmt, yaml, gitleaks all pass; ESLint/vitest/tsc failures are pre-existing missing node_modules in worktree) - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Increased CI job timeout to allow longer end-to-end test runs. * Improved test harness with configurable phase timeouts, robust timeout handling, and per-phase elapsed-time logging. * Added richer diagnostic capture on failures, stricter runtime checks during multi-node onboarding, and extra cleanup/status probes to help recover stale state. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…800s (NVIDIA#2676) ## Summary Follow-up to NVIDIA#2658 — the script-level self-wrap timeout (`NEMOCLAW_E2E_DEFAULT_TIMEOUT`) of 900s (15 min) is far too tight for `test-double-onboard.sh`, which runs three sequential sandbox creations that each take 5-7 minutes on CI. Phase 4 was being killed by the script timeout (exit 124) before the per-phase timeout could fire. This commit was pushed to NVIDIA#2658 after it was already merged. ## Related Issue Fixes NVIDIA#2654 ## Changes - **`test/e2e/test-double-onboard.sh`**: Increase `NEMOCLAW_E2E_DEFAULT_TIMEOUT` from 900s to 4800s (80 min), leaving a 10-min buffer under the 90-min CI job timeout ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Verified via nightly-e2e workflow_dispatch run 25090813418 — double-onboard-e2e passed 33/33 with this change - [ ] Docs updated for user-facing behavior changes ## AI Disclosure - [x] AI-assisted — tool: Claude (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Extended E2E test timeout to improve reliability for complex test scenarios requiring extended execution time. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Fix the
double-onboard-e2enightly failures caused by a test assertion string mismatch (Phase 3 + Phase 4) and a CI job-level timeout exhaustion (Phase 4). The product code is correct — the gateway IS being reused — the test just could not detect it because it grepped for a string the product never emits. Additionally, the 60-minute CI budget was insufficient for 5 sequential sandbox operations.Related Issue
Fixes #2654
Changes
test/e2e/test-double-onboard.sh: Changed grep assertions from"Reusing existing NemoClaw gateway"to"Reusing healthy NemoClaw gateway"to match the actual string emitted byonboard.tsline 7150.github/workflows/nightly-e2e.yaml: Increaseddouble-onboard-e2ejobtimeout-minutesfrom 60 to 90 to accommodate 5 sequential sandbox operationstest/e2e/test-double-onboard.sh: Added per-phaserun_with_timeoutwrapping via the existinge2e-timeout.shhelper (default 1200s, overridable viaNEMOCLAW_E2E_PHASE_TIMEOUT)test/e2e/test-double-onboard.sh: Added elapsed-time logging (phase_start_time/phase_elapsed) at the start and end of each onboard phasetest/e2e/test-double-onboard.sh: Addeddump_diagnostics()that dumpsopenshell status,openshell sandbox list, anddocker pson any phase timeout (exit 124) or failuretest/e2e/test-double-onboard.sh: AddedTODO(#2562)markers for future unified timeout abstractionType of Change
Verification
npx prek run --all-filespasses (shell-relevant hooks: shellcheck, shfmt, yaml, gitleaks all pass; ESLint/vitest/tsc failures are pre-existing missing node_modules in worktree)npm testpassesmake docsbuilds without warnings (doc changes only)AI Disclosure
Signed-off-by: Julie Yaunches jyaunches@nvidia.com
Summary by CodeRabbit