Skip to content

fix(cli): restore pre-#2398 gateway recovery (fixes E2E hangs)#2471

Merged
ericksoa merged 2 commits into
mainfrom
revert/2398-dashboard-refactor
Apr 25, 2026
Merged

fix(cli): restore pre-#2398 gateway recovery (fixes E2E hangs)#2471
ericksoa merged 2 commits into
mainfrom
revert/2398-dashboard-refactor

Conversation

@ericksoa

@ericksoa ericksoa commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Root cause

PR #2398 replaced the simple gateway recovery path with recoverDashboardChain() which introduced unbounded calls that hang in CI. The bisect confirmed #2398 as the sole culprit:

Run Commit Result
Last good de97a00d (Apr 24 16:06) pass
Bisect 4 9fbfbaca (#2398 only) hang
First bad 79c8e2a9 (Apr 25 00:10) hang

sandbox-survival, skip-permissions, sandbox-operations, and cloud-e2e all hang at nemoclaw <name> status after #2398.

What this reopens

Test plan

  • npx tsc -p tsconfig.src.json --noEmit passes
  • Dashboard unit tests pass
  • Nightly E2E: sandbox-survival, skip-permissions, sandbox-operations, cloud-e2e pass

Summary by CodeRabbit

  • Bug Fixes
    • Improved gateway health check reliability by using explicit status outputs instead of HTTP code interpretation.
    • Enhanced automatic recovery mechanism for sandbox gateway outages with better restart and port forwarding restoration.
    • Updated diagnostic logging to better reflect gateway restart and connectivity restoration activities.

Reverts the nemoclaw.ts changes from #2398 (dashboard delivery chain
refactor) which introduced hangs in `nemoclaw status` that cause
sandbox-survival, skip-permissions, and sandbox-operations E2E failures.

Restores the original implementations of:
- isSandboxGatewayRunning (probes via curl -sf, not http_code)
- recoverSandboxProcesses (direct SSH recovery, no CORS/download)
- ensureSandboxPortForward (simple forward stop/start)
- checkAndRecoverSandboxProcesses (uses the above, no dashboard chain)

The dashboard-contract, dashboard-health, and dashboard-recover modules
from #2398 are left in place since onboard.ts depends on them. Only the
nemoclaw.ts consumer (status/recovery path) is reverted.

Bisect evidence confirmed #2398 as sole culprit (run 24921496960).
@coderabbitai

coderabbitai Bot commented Apr 25, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

The sandbox gateway health check mechanism is refactored to probe a health endpoint and interpret explicit RUNNING/STOPPED outputs instead of HTTP status codes. A new recovery function restarts the gateway using an agent-provided or fallback recovery script, and the recovery orchestration is updated to skip link-aware dashboard-chain recovery, restart processes, and re-establish port forwarding.

Changes

Cohort / File(s) Summary
Sandbox Gateway Health & Recovery
src/nemoclaw.ts
Refactored isSandboxGatewayRunning to use curl-based probing and explicit state outputs. Added recoverSandboxProcesses function with agent-provided or fallback recovery scripts. Updated checkAndRecoverSandboxProcesses to remove link-aware dashboard recovery, restart processes, re-validate gateway responsiveness, and re-establish port forwarding with updated log messages.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host Process
    participant Check as checkAndRecover<br/>SandboxProcesses
    participant Health as isSandboxGateway<br/>Running
    participant Recover as recoverSandbox<br/>Processes
    participant Sandbox as Sandbox<br/>Environment
    participant Gateway as Gateway HTTP<br/>Endpoint
    participant Forward as ensureSandbox<br/>PortForward

    Host->>Check: Initiate recovery check
    Check->>Health: Probe gateway health
    Health->>Gateway: curl -sf (health endpoint)
    Gateway-->>Health: Response (RUNNING/STOPPED)
    Health-->>Check: Gateway status
    
    alt Gateway is DOWN
        Check->>Recover: Trigger recovery
        Recover->>Sandbox: Execute recovery script<br/>(agent-provided or fallback)
        Sandbox->>Sandbox: Clean lock/log files
        Sandbox->>Sandbox: Launch: openclaw gateway run
        Recover-->>Check: Recovery complete
        
        Check->>Health: Re-validate gateway
        Health->>Gateway: curl -sf (health endpoint)
        Gateway-->>Health: RUNNING
        Health-->>Check: Gateway restored
        
        Check->>Forward: Re-establish port forward
        Forward->>Host: host → sandbox forward
        Forward-->>Check: Forward active
    else Gateway is RUNNING
        Check-->>Host: No recovery needed
    end
    
    Check->>Host: Emit completion log
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hop hop, the gateway now stands tall,
With probes that echo through the hall,
Recovery scripts dance and play,
Ports forward in a brand new way,
Health checks true, no false alarms—
The sandbox thrives with open arms! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly and specifically identifies the main change: restoring pre-#2398 gateway recovery to fix E2E hangs, which directly matches the PR's core objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch revert/2398-dashboard-refactor

Comment @coderabbitai help to get the list of available commands and usage tips.

@ericksoa ericksoa changed the title revert: restore pre-#2398 gateway recovery (fixes E2E hangs) fix(cli): restore pre-#2398 gateway recovery (fixes E2E hangs) Apr 25, 2026
@ericksoa ericksoa merged commit 9c5a42a into main Apr 25, 2026
16 of 17 checks passed
jyaunches added a commit that referenced this pull request Apr 28, 2026
…#2615)

## Summary

Add automated E2E test recommendations to PR reviews and selective job
dispatch to the nightly E2E workflow.

Closes #2564 (Phases 1–3).

## What changed

### 1. CodeRabbit `path_instructions` for E2E recommendations
(`.coderabbit.yaml`)

15 new `path_instructions` entries map sensitive file paths to the
nightly E2E jobs that exercise them. When a PR touches a mapped path,
CodeRabbit posts a review comment recommending specific jobs and a
copy-pasteable `gh workflow run` command.

| Path Pattern | Recommended Jobs |
|-------------|-----------------|
| `scripts/nemoclaw-start.sh`, `scripts/lib/sandbox-init.sh` |
`sandbox-survival-e2e`, `sandbox-operations-e2e`, `cloud-e2e` |
| `Dockerfile`, `Dockerfile.base` | `cloud-e2e`, `sandbox-survival-e2e`,
`hermes-e2e`, `rebuild-openclaw-e2e` |
| `nemoclaw-blueprint/scripts/http-proxy-fix.js` | `cloud-e2e`,
`inference-routing-e2e` |
| `src/lib/onboard.ts` | `cloud-e2e`, `sandbox-operations-e2e`,
`rebuild-openclaw-e2e` |
| `src/nemoclaw.ts` | `sandbox-survival-e2e`, `sandbox-operations-e2e`,
`skip-permissions-e2e` |
| `src/lib/cluster-image-patch.ts`, `src/lib/preflight.ts` |
`overlayfs-autofix-e2e` |
| `src/lib/deploy.ts` | `deployment-services-e2e` |
| `src/lib/sandbox-state.ts` | `snapshot-commands-e2e`,
`rebuild-openclaw-e2e` |
| `src/lib/shields*.ts` | `shields-config-e2e` |
| `agents/hermes/**` | `hermes-e2e`, `rebuild-hermes-e2e` |
| `nemoclaw-blueprint/policies/**` | `network-policy-e2e`,
`skip-permissions-e2e` |
| `.github/workflows/nightly-e2e.yaml` | Reminds to add CodeRabbit
coverage for new jobs |

### 2. Selective job dispatch (`nightly-e2e.yaml`)

Added a `jobs` input to `workflow_dispatch` so maintainers can run a
subset of nightly jobs on any branch:

```
gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=sandbox-survival-e2e,sandbox-operations-e2e
```

- All 18 E2E jobs get a conditional guard: unselected jobs are skipped
- Empty `jobs` input (or scheduled runs) still runs everything
- `notify-on-failure` is unaffected: skipped jobs produce `result:
'skipped'`, not `'failure'`

### 3. Cross-validation test (`test/validate-e2e-coverage.test.ts`)

Keeps the mapping up to date as files and jobs evolve:

| Assertion | What it catches |
|-----------|----------------|
| Job names in CodeRabbit match `nightly-e2e.yaml` | Renamed/removed
jobs |
| Path globs match at least one file on disk | Renamed/deleted source
files |
| Every nightly job has selective dispatch guard | New jobs added
without the `if:` pattern |
| Advisory: nightly jobs with no CodeRabbit coverage | New jobs added
without `path_instructions` |

## Validation

- [x] All 4 cross-validation tests pass locally
- [x] Existing `validate-config-schemas` tests still pass
- [x] Selective dispatch validated: [run
25052625486](https://github.com/NVIDIA/NemoClaw/actions/runs/25052625486)
— triggered with `-f jobs=diagnostics-e2e`, 17/18 jobs correctly skipped
- [x] `notify-on-failure` does not false-alarm on selective run — [run
25052625486](https://github.com/NVIDIA/NemoClaw/actions/runs/25052625486)
confirmed: `notify-on-failure` was skipped (not triggered)
- [ ] CodeRabbit posts recommendations on a PR touching a mapped file
(post-merge validation)

## Context

- Issue: #2564
- Weekend incident: #2471, #2472, #2482, #2490
- E2E strategy: `cloud-experimental-e2e` removal in #2472 left a
coverage gap that would have been flagged by these recommendations


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Expanded review automation to map sensitive paths to targeted nightly
E2E jobs and inject instructions for running relevant subsets.
* Added manual workflow dispatch allowing selective E2E job execution
via a jobs input.

* **New Features**
* Added a reporting step that, on manual runs, posts a PR comment
summarizing passed/failed/skipped E2E jobs.

* **Tests**
* Added a validation suite that cross-checks review-to-workflow mappings
and dispatch guards, warning on uncovered jobs.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

### 4. Substring match fix (`nightly-e2e.yaml`)

CodeRabbit review correctly identified that `contains(inputs.jobs,
'cloud-e2e')` performs substring matching — e.g., passing `jobs=e2e`
would match every job. All 18 job guards now use delimiter-wrapping:

```yaml
contains(format(',{0},', inputs.jobs), ',<job-name>,')
```

This ensures exact token matching within the comma-separated input. The
cross-validation test was updated to enforce the new pattern.
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
…VIDIA#2471)

## Summary

- Reverts the `nemoclaw.ts` changes from NVIDIA#2398 that caused `nemoclaw
status` to hang indefinitely
- Restores the original `isSandboxGatewayRunning`,
`recoverSandboxProcesses`, `ensureSandboxPortForward`, and
`checkAndRecoverSandboxProcesses` implementations
- Leaves the dashboard-contract/health/recover modules in place
(onboard.ts depends on them)

## Root cause

PR NVIDIA#2398 replaced the simple gateway recovery path with
`recoverDashboardChain()` which introduced unbounded calls that hang in
CI. The bisect confirmed NVIDIA#2398 as the sole culprit:

| Run | Commit | Result |
|-----|--------|--------|
| Last good | `de97a00d` (Apr 24 16:06) | pass |
| Bisect 4 | `9fbfbaca` (NVIDIA#2398 only) | **hang** |
| First bad | `79c8e2a9` (Apr 25 00:10) | hang |

sandbox-survival, skip-permissions, sandbox-operations, and cloud-e2e
all hang at `nemoclaw <name> status` after NVIDIA#2398.

## What this reopens

- NVIDIA#2342 (false "Offline" with device auth) — the old `curl -sf /` probe
returns failure on 401. This needs a targeted fix, not a refactor.
- NVIDIA#2390 — dashboard delivery chain consolidation goal. Can be
re-attempted with proper E2E validation.

## Test plan

- [ ] `npx tsc -p tsconfig.src.json --noEmit` passes
- [ ] Dashboard unit tests pass
- [ ] Nightly E2E: sandbox-survival, skip-permissions,
sandbox-operations, cloud-e2e pass

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved gateway health check reliability by using explicit status
outputs instead of HTTP code interpretation.
* Enhanced automatic recovery mechanism for sandbox gateway outages with
better restart and port forwarding restoration.
* Updated diagnostic logging to better reflect gateway restart and
connectivity restoration activities.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
…NVIDIA#2615)

## Summary

Add automated E2E test recommendations to PR reviews and selective job
dispatch to the nightly E2E workflow.

Closes NVIDIA#2564 (Phases 1–3).

## What changed

### 1. CodeRabbit `path_instructions` for E2E recommendations
(`.coderabbit.yaml`)

15 new `path_instructions` entries map sensitive file paths to the
nightly E2E jobs that exercise them. When a PR touches a mapped path,
CodeRabbit posts a review comment recommending specific jobs and a
copy-pasteable `gh workflow run` command.

| Path Pattern | Recommended Jobs |
|-------------|-----------------|
| `scripts/nemoclaw-start.sh`, `scripts/lib/sandbox-init.sh` |
`sandbox-survival-e2e`, `sandbox-operations-e2e`, `cloud-e2e` |
| `Dockerfile`, `Dockerfile.base` | `cloud-e2e`, `sandbox-survival-e2e`,
`hermes-e2e`, `rebuild-openclaw-e2e` |
| `nemoclaw-blueprint/scripts/http-proxy-fix.js` | `cloud-e2e`,
`inference-routing-e2e` |
| `src/lib/onboard.ts` | `cloud-e2e`, `sandbox-operations-e2e`,
`rebuild-openclaw-e2e` |
| `src/nemoclaw.ts` | `sandbox-survival-e2e`, `sandbox-operations-e2e`,
`skip-permissions-e2e` |
| `src/lib/cluster-image-patch.ts`, `src/lib/preflight.ts` |
`overlayfs-autofix-e2e` |
| `src/lib/deploy.ts` | `deployment-services-e2e` |
| `src/lib/sandbox-state.ts` | `snapshot-commands-e2e`,
`rebuild-openclaw-e2e` |
| `src/lib/shields*.ts` | `shields-config-e2e` |
| `agents/hermes/**` | `hermes-e2e`, `rebuild-hermes-e2e` |
| `nemoclaw-blueprint/policies/**` | `network-policy-e2e`,
`skip-permissions-e2e` |
| `.github/workflows/nightly-e2e.yaml` | Reminds to add CodeRabbit
coverage for new jobs |

### 2. Selective job dispatch (`nightly-e2e.yaml`)

Added a `jobs` input to `workflow_dispatch` so maintainers can run a
subset of nightly jobs on any branch:

```
gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=sandbox-survival-e2e,sandbox-operations-e2e
```

- All 18 E2E jobs get a conditional guard: unselected jobs are skipped
- Empty `jobs` input (or scheduled runs) still runs everything
- `notify-on-failure` is unaffected: skipped jobs produce `result:
'skipped'`, not `'failure'`

### 3. Cross-validation test (`test/validate-e2e-coverage.test.ts`)

Keeps the mapping up to date as files and jobs evolve:

| Assertion | What it catches |
|-----------|----------------|
| Job names in CodeRabbit match `nightly-e2e.yaml` | Renamed/removed
jobs |
| Path globs match at least one file on disk | Renamed/deleted source
files |
| Every nightly job has selective dispatch guard | New jobs added
without the `if:` pattern |
| Advisory: nightly jobs with no CodeRabbit coverage | New jobs added
without `path_instructions` |

## Validation

- [x] All 4 cross-validation tests pass locally
- [x] Existing `validate-config-schemas` tests still pass
- [x] Selective dispatch validated: [run
25052625486](https://github.com/NVIDIA/NemoClaw/actions/runs/25052625486)
— triggered with `-f jobs=diagnostics-e2e`, 17/18 jobs correctly skipped
- [x] `notify-on-failure` does not false-alarm on selective run — [run
25052625486](https://github.com/NVIDIA/NemoClaw/actions/runs/25052625486)
confirmed: `notify-on-failure` was skipped (not triggered)
- [ ] CodeRabbit posts recommendations on a PR touching a mapped file
(post-merge validation)

## Context

- Issue: NVIDIA#2564
- Weekend incident: NVIDIA#2471, NVIDIA#2472, NVIDIA#2482, NVIDIA#2490
- E2E strategy: `cloud-experimental-e2e` removal in NVIDIA#2472 left a
coverage gap that would have been flagged by these recommendations


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Expanded review automation to map sensitive paths to targeted nightly
E2E jobs and inject instructions for running relevant subsets.
* Added manual workflow dispatch allowing selective E2E job execution
via a jobs input.

* **New Features**
* Added a reporting step that, on manual runs, posts a PR comment
summarizing passed/failed/skipped E2E jobs.

* **Tests**
* Added a validation suite that cross-checks review-to-workflow mappings
and dispatch guards, warning on uncovered jobs.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

### 4. Substring match fix (`nightly-e2e.yaml`)

CodeRabbit review correctly identified that `contains(inputs.jobs,
'cloud-e2e')` performs substring matching — e.g., passing `jobs=e2e`
would match every job. All 18 job guards now use delimiter-wrapping:

```yaml
contains(format(',{0},', inputs.jobs), ',<job-name>,')
```

This ensures exact token matching within the comma-separated input. The
cross-validation test was updated to enforce the new pattern.
ksapru pushed a commit to ksapru/NemoClaw that referenced this pull request May 12, 2026
…VIDIA#2471)

## Summary

- Reverts the `nemoclaw.ts` changes from NVIDIA#2398 that caused `nemoclaw
status` to hang indefinitely
- Restores the original `isSandboxGatewayRunning`,
`recoverSandboxProcesses`, `ensureSandboxPortForward`, and
`checkAndRecoverSandboxProcesses` implementations
- Leaves the dashboard-contract/health/recover modules in place
(onboard.ts depends on them)

## Root cause

PR NVIDIA#2398 replaced the simple gateway recovery path with
`recoverDashboardChain()` which introduced unbounded calls that hang in
CI. The bisect confirmed NVIDIA#2398 as the sole culprit:

| Run | Commit | Result |
|-----|--------|--------|
| Last good | `de97a00d` (Apr 24 16:06) | pass |
| Bisect 4 | `9fbfbaca` (NVIDIA#2398 only) | **hang** |
| First bad | `79c8e2a9` (Apr 25 00:10) | hang |

sandbox-survival, skip-permissions, sandbox-operations, and cloud-e2e
all hang at `nemoclaw <name> status` after NVIDIA#2398.

## What this reopens

- NVIDIA#2342 (false "Offline" with device auth) — the old `curl -sf /` probe
returns failure on 401. This needs a targeted fix, not a refactor.
- NVIDIA#2390 — dashboard delivery chain consolidation goal. Can be
re-attempted with proper E2E validation.

## Test plan

- [ ] `npx tsc -p tsconfig.src.json --noEmit` passes
- [ ] Dashboard unit tests pass
- [ ] Nightly E2E: sandbox-survival, skip-permissions,
sandbox-operations, cloud-e2e pass

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved gateway health check reliability by using explicit status
outputs instead of HTTP code interpretation.
* Enhanced automatic recovery mechanism for sandbox gateway outages with
better restart and port forwarding restoration.
* Updated diagnostic logging to better reflect gateway restart and
connectivity restoration activities.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants