fix(cli): restore pre-#2398 gateway recovery (fixes E2E hangs)#2471
Conversation
Reverts the nemoclaw.ts changes from #2398 (dashboard delivery chain refactor) which introduced hangs in `nemoclaw status` that cause sandbox-survival, skip-permissions, and sandbox-operations E2E failures. Restores the original implementations of: - isSandboxGatewayRunning (probes via curl -sf, not http_code) - recoverSandboxProcesses (direct SSH recovery, no CORS/download) - ensureSandboxPortForward (simple forward stop/start) - checkAndRecoverSandboxProcesses (uses the above, no dashboard chain) The dashboard-contract, dashboard-health, and dashboard-recover modules from #2398 are left in place since onboard.ts depends on them. Only the nemoclaw.ts consumer (status/recovery path) is reverted. Bisect evidence confirmed #2398 as sole culprit (run 24921496960).
📝 WalkthroughWalkthroughThe sandbox gateway health check mechanism is refactored to probe a health endpoint and interpret explicit Changes
Sequence Diagram(s)sequenceDiagram
participant Host as Host Process
participant Check as checkAndRecover<br/>SandboxProcesses
participant Health as isSandboxGateway<br/>Running
participant Recover as recoverSandbox<br/>Processes
participant Sandbox as Sandbox<br/>Environment
participant Gateway as Gateway HTTP<br/>Endpoint
participant Forward as ensureSandbox<br/>PortForward
Host->>Check: Initiate recovery check
Check->>Health: Probe gateway health
Health->>Gateway: curl -sf (health endpoint)
Gateway-->>Health: Response (RUNNING/STOPPED)
Health-->>Check: Gateway status
alt Gateway is DOWN
Check->>Recover: Trigger recovery
Recover->>Sandbox: Execute recovery script<br/>(agent-provided or fallback)
Sandbox->>Sandbox: Clean lock/log files
Sandbox->>Sandbox: Launch: openclaw gateway run
Recover-->>Check: Recovery complete
Check->>Health: Re-validate gateway
Health->>Gateway: curl -sf (health endpoint)
Gateway-->>Health: RUNNING
Health-->>Check: Gateway restored
Check->>Forward: Re-establish port forward
Forward->>Host: host → sandbox forward
Forward-->>Check: Forward active
else Gateway is RUNNING
Check-->>Host: No recovery needed
end
Check->>Host: Emit completion log
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
…#2615) ## Summary Add automated E2E test recommendations to PR reviews and selective job dispatch to the nightly E2E workflow. Closes #2564 (Phases 1–3). ## What changed ### 1. CodeRabbit `path_instructions` for E2E recommendations (`.coderabbit.yaml`) 15 new `path_instructions` entries map sensitive file paths to the nightly E2E jobs that exercise them. When a PR touches a mapped path, CodeRabbit posts a review comment recommending specific jobs and a copy-pasteable `gh workflow run` command. | Path Pattern | Recommended Jobs | |-------------|-----------------| | `scripts/nemoclaw-start.sh`, `scripts/lib/sandbox-init.sh` | `sandbox-survival-e2e`, `sandbox-operations-e2e`, `cloud-e2e` | | `Dockerfile`, `Dockerfile.base` | `cloud-e2e`, `sandbox-survival-e2e`, `hermes-e2e`, `rebuild-openclaw-e2e` | | `nemoclaw-blueprint/scripts/http-proxy-fix.js` | `cloud-e2e`, `inference-routing-e2e` | | `src/lib/onboard.ts` | `cloud-e2e`, `sandbox-operations-e2e`, `rebuild-openclaw-e2e` | | `src/nemoclaw.ts` | `sandbox-survival-e2e`, `sandbox-operations-e2e`, `skip-permissions-e2e` | | `src/lib/cluster-image-patch.ts`, `src/lib/preflight.ts` | `overlayfs-autofix-e2e` | | `src/lib/deploy.ts` | `deployment-services-e2e` | | `src/lib/sandbox-state.ts` | `snapshot-commands-e2e`, `rebuild-openclaw-e2e` | | `src/lib/shields*.ts` | `shields-config-e2e` | | `agents/hermes/**` | `hermes-e2e`, `rebuild-hermes-e2e` | | `nemoclaw-blueprint/policies/**` | `network-policy-e2e`, `skip-permissions-e2e` | | `.github/workflows/nightly-e2e.yaml` | Reminds to add CodeRabbit coverage for new jobs | ### 2. Selective job dispatch (`nightly-e2e.yaml`) Added a `jobs` input to `workflow_dispatch` so maintainers can run a subset of nightly jobs on any branch: ``` gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=sandbox-survival-e2e,sandbox-operations-e2e ``` - All 18 E2E jobs get a conditional guard: unselected jobs are skipped - Empty `jobs` input (or scheduled runs) still runs everything - `notify-on-failure` is unaffected: skipped jobs produce `result: 'skipped'`, not `'failure'` ### 3. Cross-validation test (`test/validate-e2e-coverage.test.ts`) Keeps the mapping up to date as files and jobs evolve: | Assertion | What it catches | |-----------|----------------| | Job names in CodeRabbit match `nightly-e2e.yaml` | Renamed/removed jobs | | Path globs match at least one file on disk | Renamed/deleted source files | | Every nightly job has selective dispatch guard | New jobs added without the `if:` pattern | | Advisory: nightly jobs with no CodeRabbit coverage | New jobs added without `path_instructions` | ## Validation - [x] All 4 cross-validation tests pass locally - [x] Existing `validate-config-schemas` tests still pass - [x] Selective dispatch validated: [run 25052625486](https://github.com/NVIDIA/NemoClaw/actions/runs/25052625486) — triggered with `-f jobs=diagnostics-e2e`, 17/18 jobs correctly skipped - [x] `notify-on-failure` does not false-alarm on selective run — [run 25052625486](https://github.com/NVIDIA/NemoClaw/actions/runs/25052625486) confirmed: `notify-on-failure` was skipped (not triggered) - [ ] CodeRabbit posts recommendations on a PR touching a mapped file (post-merge validation) ## Context - Issue: #2564 - Weekend incident: #2471, #2472, #2482, #2490 - E2E strategy: `cloud-experimental-e2e` removal in #2472 left a coverage gap that would have been flagged by these recommendations <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Expanded review automation to map sensitive paths to targeted nightly E2E jobs and inject instructions for running relevant subsets. * Added manual workflow dispatch allowing selective E2E job execution via a jobs input. * **New Features** * Added a reporting step that, on manual runs, posts a PR comment summarizing passed/failed/skipped E2E jobs. * **Tests** * Added a validation suite that cross-checks review-to-workflow mappings and dispatch guards, warning on uncovered jobs. <!-- end of auto-generated comment: release notes by coderabbit.ai --> ### 4. Substring match fix (`nightly-e2e.yaml`) CodeRabbit review correctly identified that `contains(inputs.jobs, 'cloud-e2e')` performs substring matching — e.g., passing `jobs=e2e` would match every job. All 18 job guards now use delimiter-wrapping: ```yaml contains(format(',{0},', inputs.jobs), ',<job-name>,') ``` This ensures exact token matching within the comma-separated input. The cross-validation test was updated to enforce the new pattern.
…VIDIA#2471) ## Summary - Reverts the `nemoclaw.ts` changes from NVIDIA#2398 that caused `nemoclaw status` to hang indefinitely - Restores the original `isSandboxGatewayRunning`, `recoverSandboxProcesses`, `ensureSandboxPortForward`, and `checkAndRecoverSandboxProcesses` implementations - Leaves the dashboard-contract/health/recover modules in place (onboard.ts depends on them) ## Root cause PR NVIDIA#2398 replaced the simple gateway recovery path with `recoverDashboardChain()` which introduced unbounded calls that hang in CI. The bisect confirmed NVIDIA#2398 as the sole culprit: | Run | Commit | Result | |-----|--------|--------| | Last good | `de97a00d` (Apr 24 16:06) | pass | | Bisect 4 | `9fbfbaca` (NVIDIA#2398 only) | **hang** | | First bad | `79c8e2a9` (Apr 25 00:10) | hang | sandbox-survival, skip-permissions, sandbox-operations, and cloud-e2e all hang at `nemoclaw <name> status` after NVIDIA#2398. ## What this reopens - NVIDIA#2342 (false "Offline" with device auth) — the old `curl -sf /` probe returns failure on 401. This needs a targeted fix, not a refactor. - NVIDIA#2390 — dashboard delivery chain consolidation goal. Can be re-attempted with proper E2E validation. ## Test plan - [ ] `npx tsc -p tsconfig.src.json --noEmit` passes - [ ] Dashboard unit tests pass - [ ] Nightly E2E: sandbox-survival, skip-permissions, sandbox-operations, cloud-e2e pass <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved gateway health check reliability by using explicit status outputs instead of HTTP code interpretation. * Enhanced automatic recovery mechanism for sandbox gateway outages with better restart and port forwarding restoration. * Updated diagnostic logging to better reflect gateway restart and connectivity restoration activities. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…NVIDIA#2615) ## Summary Add automated E2E test recommendations to PR reviews and selective job dispatch to the nightly E2E workflow. Closes NVIDIA#2564 (Phases 1–3). ## What changed ### 1. CodeRabbit `path_instructions` for E2E recommendations (`.coderabbit.yaml`) 15 new `path_instructions` entries map sensitive file paths to the nightly E2E jobs that exercise them. When a PR touches a mapped path, CodeRabbit posts a review comment recommending specific jobs and a copy-pasteable `gh workflow run` command. | Path Pattern | Recommended Jobs | |-------------|-----------------| | `scripts/nemoclaw-start.sh`, `scripts/lib/sandbox-init.sh` | `sandbox-survival-e2e`, `sandbox-operations-e2e`, `cloud-e2e` | | `Dockerfile`, `Dockerfile.base` | `cloud-e2e`, `sandbox-survival-e2e`, `hermes-e2e`, `rebuild-openclaw-e2e` | | `nemoclaw-blueprint/scripts/http-proxy-fix.js` | `cloud-e2e`, `inference-routing-e2e` | | `src/lib/onboard.ts` | `cloud-e2e`, `sandbox-operations-e2e`, `rebuild-openclaw-e2e` | | `src/nemoclaw.ts` | `sandbox-survival-e2e`, `sandbox-operations-e2e`, `skip-permissions-e2e` | | `src/lib/cluster-image-patch.ts`, `src/lib/preflight.ts` | `overlayfs-autofix-e2e` | | `src/lib/deploy.ts` | `deployment-services-e2e` | | `src/lib/sandbox-state.ts` | `snapshot-commands-e2e`, `rebuild-openclaw-e2e` | | `src/lib/shields*.ts` | `shields-config-e2e` | | `agents/hermes/**` | `hermes-e2e`, `rebuild-hermes-e2e` | | `nemoclaw-blueprint/policies/**` | `network-policy-e2e`, `skip-permissions-e2e` | | `.github/workflows/nightly-e2e.yaml` | Reminds to add CodeRabbit coverage for new jobs | ### 2. Selective job dispatch (`nightly-e2e.yaml`) Added a `jobs` input to `workflow_dispatch` so maintainers can run a subset of nightly jobs on any branch: ``` gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=sandbox-survival-e2e,sandbox-operations-e2e ``` - All 18 E2E jobs get a conditional guard: unselected jobs are skipped - Empty `jobs` input (or scheduled runs) still runs everything - `notify-on-failure` is unaffected: skipped jobs produce `result: 'skipped'`, not `'failure'` ### 3. Cross-validation test (`test/validate-e2e-coverage.test.ts`) Keeps the mapping up to date as files and jobs evolve: | Assertion | What it catches | |-----------|----------------| | Job names in CodeRabbit match `nightly-e2e.yaml` | Renamed/removed jobs | | Path globs match at least one file on disk | Renamed/deleted source files | | Every nightly job has selective dispatch guard | New jobs added without the `if:` pattern | | Advisory: nightly jobs with no CodeRabbit coverage | New jobs added without `path_instructions` | ## Validation - [x] All 4 cross-validation tests pass locally - [x] Existing `validate-config-schemas` tests still pass - [x] Selective dispatch validated: [run 25052625486](https://github.com/NVIDIA/NemoClaw/actions/runs/25052625486) — triggered with `-f jobs=diagnostics-e2e`, 17/18 jobs correctly skipped - [x] `notify-on-failure` does not false-alarm on selective run — [run 25052625486](https://github.com/NVIDIA/NemoClaw/actions/runs/25052625486) confirmed: `notify-on-failure` was skipped (not triggered) - [ ] CodeRabbit posts recommendations on a PR touching a mapped file (post-merge validation) ## Context - Issue: NVIDIA#2564 - Weekend incident: NVIDIA#2471, NVIDIA#2472, NVIDIA#2482, NVIDIA#2490 - E2E strategy: `cloud-experimental-e2e` removal in NVIDIA#2472 left a coverage gap that would have been flagged by these recommendations <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Expanded review automation to map sensitive paths to targeted nightly E2E jobs and inject instructions for running relevant subsets. * Added manual workflow dispatch allowing selective E2E job execution via a jobs input. * **New Features** * Added a reporting step that, on manual runs, posts a PR comment summarizing passed/failed/skipped E2E jobs. * **Tests** * Added a validation suite that cross-checks review-to-workflow mappings and dispatch guards, warning on uncovered jobs. <!-- end of auto-generated comment: release notes by coderabbit.ai --> ### 4. Substring match fix (`nightly-e2e.yaml`) CodeRabbit review correctly identified that `contains(inputs.jobs, 'cloud-e2e')` performs substring matching — e.g., passing `jobs=e2e` would match every job. All 18 job guards now use delimiter-wrapping: ```yaml contains(format(',{0},', inputs.jobs), ',<job-name>,') ``` This ensures exact token matching within the comma-separated input. The cross-validation test was updated to enforce the new pattern.
…VIDIA#2471) ## Summary - Reverts the `nemoclaw.ts` changes from NVIDIA#2398 that caused `nemoclaw status` to hang indefinitely - Restores the original `isSandboxGatewayRunning`, `recoverSandboxProcesses`, `ensureSandboxPortForward`, and `checkAndRecoverSandboxProcesses` implementations - Leaves the dashboard-contract/health/recover modules in place (onboard.ts depends on them) ## Root cause PR NVIDIA#2398 replaced the simple gateway recovery path with `recoverDashboardChain()` which introduced unbounded calls that hang in CI. The bisect confirmed NVIDIA#2398 as the sole culprit: | Run | Commit | Result | |-----|--------|--------| | Last good | `de97a00d` (Apr 24 16:06) | pass | | Bisect 4 | `9fbfbaca` (NVIDIA#2398 only) | **hang** | | First bad | `79c8e2a9` (Apr 25 00:10) | hang | sandbox-survival, skip-permissions, sandbox-operations, and cloud-e2e all hang at `nemoclaw <name> status` after NVIDIA#2398. ## What this reopens - NVIDIA#2342 (false "Offline" with device auth) — the old `curl -sf /` probe returns failure on 401. This needs a targeted fix, not a refactor. - NVIDIA#2390 — dashboard delivery chain consolidation goal. Can be re-attempted with proper E2E validation. ## Test plan - [ ] `npx tsc -p tsconfig.src.json --noEmit` passes - [ ] Dashboard unit tests pass - [ ] Nightly E2E: sandbox-survival, skip-permissions, sandbox-operations, cloud-e2e pass <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved gateway health check reliability by using explicit status outputs instead of HTTP code interpretation. * Enhanced automatic recovery mechanism for sandbox gateway outages with better restart and port forwarding restoration. * Updated diagnostic logging to better reflect gateway restart and connectivity restoration activities. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
nemoclaw.tschanges from refactor(cli): extract dashboard delivery chain into contract/health/recover modules #2398 that causednemoclaw statusto hang indefinitelyisSandboxGatewayRunning,recoverSandboxProcesses,ensureSandboxPortForward, andcheckAndRecoverSandboxProcessesimplementationsRoot cause
PR #2398 replaced the simple gateway recovery path with
recoverDashboardChain()which introduced unbounded calls that hang in CI. The bisect confirmed #2398 as the sole culprit:de97a00d(Apr 24 16:06)9fbfbaca(#2398 only)79c8e2a9(Apr 25 00:10)sandbox-survival, skip-permissions, sandbox-operations, and cloud-e2e all hang at
nemoclaw <name> statusafter #2398.What this reopens
curl -sf /probe returns failure on 401. This needs a targeted fix, not a refactor.Test plan
npx tsc -p tsconfig.src.json --noEmitpassesSummary by CodeRabbit