fix(sandbox): recover stale sandbox on rebuild instead of dead-ending (#4497) by yimoj · Pull Request #5034 · NVIDIA/NemoClaw

yimoj · 2026-06-09T07:15:30Z

Summary

rebuild --yes no longer dead-ends on a stale sandbox. When a sandbox is registered locally but absent from a healthy OpenShell gateway (stuck/diverged provision, container reaped), rebuild now treats it as a recovery: it skips the impossible backup and recreates from the preserved registry + onboard-session metadata, instead of aborting with "Cannot back up state."

Related Issue

Fixes #4497

Changes

rebuild: when the sandbox is missing from the active gateway's sandbox list, confirm absence authoritatively against the named nemoclaw gateway (getReconciledSandboxGatewayState runs sandbox get). Only enter the destructive stale-recovery path once it is genuinely gone on a healthy named gateway.
Stale recovery skips backup/restore and recreates from registry/session metadata; built-in policy presets are re-applied from sb.policies (the same source the backup manifest uses).
Multi-gateway data-loss guards: never recreate-from-scratch when (a) a foreign gateway is active (a present/list result there is not trusted), or (b) the sandbox is recorded on a non-default per-port gateway (fix(onboard): bind gateway name, state dir, and marker per gateway port (#4422) #4645) — the operator is pointed at openshell gateway select instead.
Crash-safety: on recreate failure, restore the captured registry entry verbatim (target-only, under the registry lock — no clobbering concurrent changes) and reclaim the default-sandbox pointer, so rebuild/connect stay retryable.
Shields: reset the gone sandbox's stale lock seal only after a successful recreate (a failed recreate keeps the lockdown record for retry), and prompt the operator to re-apply shields up if it had been locked.
New registry.restoreSandboxEntry and shields.clearShieldsState helpers.

Type of Change

Code change (feature, bug fix, or refactor)

Verification

npm test passes (only the pre-existing test/ssrf-parity.test.ts fails, identically on clean upstream/main — unrelated to this change)
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
npx biome check clean on changed files; shfmt/shellcheck clean on the E2E script; tsc type-check passes

Local end-to-end proof (exact reporter workflow on a real OpenShell gateway)

Drove status -> connect -> rebuild --yes against a real healthy gateway with a registered-but-not-live sandbox:

⚠ Sandbox 'repro4497stale' is registered locally but absent from the live OpenShell gateway.
No live workspace state to back up — recreating from the preserved registry metadata.
Deleting old sandbox...
✓ Old sandbox deleted
Creating new sandbox with current image...

No "Cannot back up state", no "Backing up sandbox state" — rebuild crossed the gate that previously dead-ended and proceeded into the recreate (onboard --resume). The user's other registry entries were untouched.

Regression + pipeline coverage

test/rebuild-stale-recovery.test.ts — recover, live-control, foreign-gateway guard, non-default per-port guard, failed-recreate registry rollback (incl. defaultSandbox).
test/gateway-state-reconcile-2276.test.ts Scenario 14 and rebuild-gateway-drift.test.ts updated to assert recovery rather than the dead-end.
test/e2e/test-double-onboard.sh Phase 5 strengthened so the exact reporter workflow must recreate a live sandbox (the prior probe only checked for "does not exist" and would have passed against this bug).

Known limitation (pre-existing, shared with normal rebuild)

Custom policy-add --from-file/--from-dir egress rules (customPolicies) are not re-applied to the recreated sandbox — this is identical to a normal rebuild's recreate path (the registry entry is removed before onboard --resume runs) and is out of scope for this dead-end fix.

Signed-off-by: Yimo Jiang yimoj@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Enhanced sandbox recovery when a sandbox exists locally but is missing from the live gateway, preventing unnecessary destructive operations and enabling a “stale-sandbox” recovery path.
- Improved rebuild behavior to preserve and restore local registry metadata so retrying recovery no longer fails.
Improvements
- Updated post-recreate and error messaging to include clearer context and preserved backup paths; retry guidance is now configurable.
- Shield handling adjusted to avoid unsafe unlock/relock behavior during stale recovery.
New Features
- Added APIs to restore a removed registry entry and to clear persisted shields state.
Tests
- Added and expanded end-to-end and unit tests covering stale-sandbox recovery scenarios and regressions.

coderabbitai · 2026-06-09T07:15:44Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a3d07eec-0945-48f0-95e0-db151115e22e

📥 Commits

Reviewing files that changed from the base of the PR and between d88d2cb and fc48531.

📒 Files selected for processing (3)

src/lib/actions/sandbox/gateway-state.ts
src/lib/actions/sandbox/rebuild-gateway-drift.test.ts
src/lib/actions/sandbox/rebuild.ts

🚧 Files skipped from review as they are similar to previous changes (3)

src/lib/actions/sandbox/gateway-state.ts
src/lib/actions/sandbox/rebuild-gateway-drift.test.ts
src/lib/actions/sandbox/rebuild.ts

📝 Walkthrough

Walkthrough

Adds stale-sandbox recovery to rebuild: when a sandbox exists in the local registry but is missing from the live OpenShell gateway, rebuild snapshots registry metadata, skips backup (no live workspace), recreates from preserved metadata, restores the registry entry on recreate failure, and clears stale shields on success.

Changes

Stale-Sandbox Recovery Implementation

Layer / File(s)	Summary
Supporting contracts and helpers `src/lib/actions/sandbox/gateway-state.ts`, `src/lib/shields/index.ts`, `src/lib/state/registry.ts`, `src/lib/actions/sandbox/rebuild.ts` (imports)	Adds `retryCommand` parameter to wrong-gateway guidance, introduces `clearShieldsState`, adds `restoreSandboxEntry`, and wires new imports into rebuild.
Stale recovery detection and gateway reconciliation `src/lib/actions/sandbox/rebuild.ts`	Performs authoritative named-gateway lifecycle reconciliation when sandbox is absent from live list, distinguishes missing vs ambiguous cases, and snapshots the registry entry before destructive actions.
Stale-aware rebuild execution path `src/lib/actions/sandbox/rebuild.ts`	Skips normal shields unlock/relock window and backup when stale, records prior lock state, recreates from preserved metadata, restores preserved registry entry on recreate failure, conditions restore steps on backup existence, clears stale shields on success, and updates messaging for stale-recovery.
Conditional restore flow and messaging `src/lib/actions/sandbox/rebuild.ts`	Make workspace/policy restore conditional on backup manifest, fallback to registry presets when manifest absent, and adjust success/incomplete messaging to reflect stale recovery state.

Test Coverage for Stale-Sandbox Recovery

Layer / File(s)	Summary
Rebuild gateway drift test update `src/lib/actions/sandbox/rebuild-gateway-drift.test.ts`	Replaced named-gateway recovery test to assert transition into stale-sandbox recovery, stubbed sandbox list retry and recreate failure, and assert retry count and recovery rejection.
Gateway state reconciliation test update `test/gateway-state-reconcile-2276.test.ts`	Expanded Scenario 14 to assert that `rebuild --yes` finds preserved registry entry, skips backup when absent from live gateway, and proceeds to recreate from preserved metadata.
End-to-end acceptance test `test/e2e/test-double-onboard.sh`	Replaced Phase 5 with `#4497` acceptance gate: run `rebuild --yes` on stale sandbox, assert no timeout/dead-end, verify skip-backup messaging, recreate progression, live-state recovery, and registry persistence followed by successful `destroy --yes` cleanup.
Comprehensive regression test `test/rebuild-stale-recovery.test.ts`	New Vitest suite creating a temp HOME fixture and stubbed `openshell`/`docker` to simulate stale vs live states; validates stale recovery skips backup and recreates, live case backs up normally, multi-gateway safety, and registry restoration when recreate fails.

Sequence Diagram (high-level stale-recovery flow):

sequenceDiagram
  participant Operator
  participant CLI
  participant rebuildSandbox
  participant Registry
  participant OpenShell
  participant Shields

  Operator->>CLI: nemoclaw <sandbox> rebuild --yes
  CLI->>rebuildSandbox: start
  rebuildSandbox->>OpenShell: sandbox list / status
  OpenShell-->>rebuildSandbox: missing
  rebuildSandbox->>Registry: snapshot entry (preserve)
  rebuildSandbox->>Shields: record locked state (avoid unlock window)
  rebuildSandbox->>rebuildSandbox: skip backup (no live state)
  rebuildSandbox->>OpenShell: recreate sandbox (onboard)
  OpenShell-->>rebuildSandbox: recreate success/failure
  alt failure
    rebuildSandbox->>Registry: restore preserved entry
    rebuildSandbox->>CLI: report recreate failure
  else success
    rebuildSandbox->>Shields: clear stale shields state
    rebuildSandbox->>CLI: report success (stale-recovery)
  end

🎯 4 (Complex) | ⏱️ ~45 minutes

🐰 I found a sandbox missing, though still in our file,
I saved its entry first and kept the plan agile.
Skip the impossible backup, recreate from the note,
If recreate fails, restore the saved quote.
Now rebuild retries bloom — recovery after a while.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(sandbox): recover stale sandbox on rebuild instead of dead-ending (`#4497`)' is clear and directly addresses the main change: enabling recovery of stale sandboxes during rebuild instead of failing with an error.
Linked Issues check	✅ Passed	The PR implements the core requirements from issue `#4497`: preserving the local registry entry during stale recovery, enabling rebuild to proceed without 'Cannot back up state' failure, and allowing recreate from preserved metadata when the sandbox is absent from the gateway.
Out of Scope Changes check	✅ Passed	All code changes are directly aligned with the stale-sandbox recovery objective: modifications to rebuild logic, new registry/shields helper functions, and comprehensive test coverage for the stale-recovery scenario. No unrelated changes detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/lib/actions/sandbox/rebuild.ts (1)
494-605: Run the targeted nightly E2E suites for this recovery path before merge.

As per coding guidelines, changes in src/lib/actions/sandbox/rebuild.ts and related shields behavior should be validated with channels-stop-start-e2e, vm-driver-privileged-exec-routing-e2e, and shields-config-e2e.

Also applies to: 996-1002, 1226-1233
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/actions/sandbox/rebuild.ts` around lines 494 - 605, Run the targeted
nightly E2E suites that validate the stale-sandbox recovery and shields behavior
before merging: execute channels-stop-start-e2e,
vm-driver-privileged-exec-routing-e2e, and shields-config-e2e and confirm they
pass for the changes touching the rebuild flow (notably the stale-sandbox path
that sets staleRecovery and staleRegistrySnapshot and the logic around
getReconciledSandboxGatewayState / rebuild --yes); if any failures appear,
capture logs and fix the failing scenarios in rebuild.ts and related shield
handling before merging.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/actions/sandbox/rebuild.ts`:
- Around line 494-605: Run the targeted nightly E2E suites that validate the
stale-sandbox recovery and shields behavior before merging: execute
channels-stop-start-e2e, vm-driver-privileged-exec-routing-e2e, and
shields-config-e2e and confirm they pass for the changes touching the rebuild
flow (notably the stale-sandbox path that sets staleRecovery and
staleRegistrySnapshot and the logic around getReconciledSandboxGatewayState /
rebuild --yes); if any failures appear, capture logs and fix the failing
scenarios in rebuild.ts and related shield handling before merging.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fc7cba39-5453-4445-8bae-29d537e10cac

📥 Commits

Reviewing files that changed from the base of the PR and between 38c03b1 and d88d2cb.

📒 Files selected for processing (8)

src/lib/actions/sandbox/gateway-state.ts
src/lib/actions/sandbox/rebuild-gateway-drift.test.ts
src/lib/actions/sandbox/rebuild.ts
src/lib/shields/index.ts
src/lib/state/registry.ts
test/e2e/test-double-onboard.sh
test/gateway-state-reconcile-2276.test.ts
test/rebuild-stale-recovery.test.ts

yimoj · 2026-06-09T10:32:08Z

CodeRabbit E2E suite validation (channels-stop-start-e2e, vm-driver-privileged-exec-routing-e2e, shields-config-e2e)

CodeRabbit's review recommends running three nightly E2E suites for the rebuild.ts/shields changes. Status of each on this PR head (d88d2cbc):

Suite	Result	Notes
`vm-driver-privileged-exec-routing-e2e`	✅ PASS (local)	Hermetic host-side check (fake docker, no live gateway). `bash test/e2e/test-vm-driver-privileged-exec-routing.sh` → `PASS: VM and Docker privileged exec routing uses direct sandbox containers`.
`shields-config-e2e`	⚠️ Blocked (env)	Full live onboard required. Built CLI from this worktree, opened the OpenShell docker-bridge firewall, sourced a real key — onboard reached `[4/8] Setting up inference provider` then hit `transport error … Connection refused (os error 111)`. Gateway log shows `Shutdown signal received; stopping gateway` — this host's OpenShell gateway is shared across concurrent sessions and the lifecycle tore it down mid-onboard (it killed 3 unrelated sandboxes, `Exited 137`). Not reproducible here without disrupting parallel work.
`channels-stop-start-e2e`	⚠️ Not run (env)	Same shared-gateway dependency, plus a 2h budget building both agents × every channel. Same blocker.

Why I couldn't dispatch the nightly workflow myself: nightly-e2e.yaml has the isolated environment and all three jobs, but it is gated to NVIDIA/NemoClaw and my account has pull+triage only (no push/actions:write; token lacks workflow scope), so gh workflow run is not available to me. The e2e-advisor auto-dispatch / a maintainer owns that dispatch.

The #4497 reporter workflow itself is already validated in this PR:

Strengthened test/e2e/test-double-onboard.sh Phase 5 asserts status → connect → rebuild --yes recreates a live sandbox.
Real-gateway local repro (in the PR description): rebuild on a stale entry printed the recovery markers and proceeded into recreate with zero Cannot back up state.
cli-test-shards / plugin-tests / unit-vitest all green.

Ask for the maintainer at merge: please dispatch the three nightly suites on this head (or via the e2e-advisor), e.g.
gh workflow run nightly-e2e.yaml --repo NVIDIA/NemoClaw --ref main -f jobs=channels-stop-start-e2e,vm-driver-privileged-exec-routing-e2e,shields-config-e2e,rebuild-openclaw-e2e -f target_ref=d88d2cbc5471e587061748d7253adbffda8f6173 -f pr_number=5034

Signed-off-by: Yimo Jiang yimoj@nvidia.com

wscurran · 2026-06-09T15:25:26Z

✨
Related open PRs:

#4645 fix(onboard): bind gateway name, state dir, and marker per gateway port (#4422)

Related open issues:

#4497 [Ubuntu 24.04][Sandbox] stale sandbox recovery removes registry entry before rebuild can recover

…NVIDIA#4497) PR NVIDIA#4647 stopped `connect` from deleting the registry entry of a sandbox that is registered locally but absent from a healthy gateway, so the `rebuild --yes` hint `status` prints would have something to recover. But `rebuild` still dead-ended: its backup step aborted with "Sandbox '<name>' is not running. Cannot back up state." whenever the live sandbox was missing — exactly the stale state — so the recommended recovery path stayed broken and the issue reopened. Treat a registered-but-not-live sandbox as a recovery rebuild: there is no live workspace state to back up, so skip the backup/restore steps and recreate from the preserved registry + onboard-session metadata. Guard the no-backup path so it never destroys live state: - Confirm absence authoritatively against the NAMED nemoclaw gateway (getReconciledSandboxGatewayState runs `sandbox get`), and only treat `present` as live when nemoclaw is the active healthy gateway — a foreign active gateway (multi-gateway, NVIDIA#4645) or a list/get inconsistency must not trigger a destructive recreate. - Refuse the destructive path for a sandbox recorded on a non-default per-port gateway; point the operator at `openshell gateway select`. - On recreate failure, restore the captured registry entry verbatim (under the registry lock, target-only) so `rebuild`/`connect` stay retryable without clobbering concurrent changes; reclaim the default pointer. - Reset the gone sandbox's stale shields seal only after a successful recreate, and tell the operator to re-apply `shields up` if it was locked. Add a focused regression suite (rebuild-stale-recovery.test.ts) covering the recover, control, multi-gateway, per-port, and failed-recreate cases; update the NVIDIA#2276/NVIDIA#4497 reconcile Scenario 14 and the gateway-drift retry test to assert recovery instead of the dead-end; and strengthen the double-onboard E2E so the exact reporter workflow (status -> connect -> rebuild --yes) must recreate a live sandbox — the old probe only checked for "does not exist" and would have passed against this bug. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

## Summary - Add the v0.0.63 release-note section using the published development note as source context. - Update source docs for sandbox recovery, OpenClaw config restore safety, managed vLLM selection, Slack Socket Mode conflict handling, and host diagnostics. - Refresh generated `nemoclaw-user-*` skills from the updated Fern MDX docs. - Update the release-doc refresh skill so post-release docs for version `n` look up the matching announcement discussion and use the `n+1` patch release label. - Fix CLI/docs parity by avoiding a `--from <Dockerfile>` flag mention inside the `upgrade-sandboxes` command section. ## Source summary - #5034 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document safer stale-sandbox recovery through `rebuild --yes` before recreating from scratch. - #5091 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Docker-driver post-reboot recovery from OpenShell container labels. - #5101, #5174, #5177 -> `docs/manage-sandboxes/backup-restore.mdx`, `docs/about/release-notes.mdx`: Document OpenClaw `openclaw.json` preservation, merge behavior, and fail-safe restore handling. - #5102 -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/about/release-notes.mdx`: Document `upgrade-sandboxes` image-fingerprint drift detection. - #4201 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document the installer diagnostic for unexpected Docker daemon access outside the `docker` group. - #5038 -> `docs/inference/inference-options.mdx`, `docs/inference/use-local-inference.mdx`, `docs/about/release-notes.mdx`: Document the interactive managed-vLLM model picker and non-interactive override behavior. - #5040, #5041 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Ollama auth-proxy recovery and host DNS preflight diagnostics. - #4986, #5039 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/about/release-notes.mdx`: Document Slack validation and duplicate Slack Socket Mode sandbox handling. - #4981, #5168 -> `docs/about/release-notes.mdx`: Capture Hermes gateway secret-guard and wrapped-argv startup hardening in the release surface. - Follow-up -> `.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`: Record the post-release docs workflow, discussion-announcement lookup, and next-patch release label rule. - Follow-up -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`: Reword custom Dockerfile sandbox text so CLI parity does not treat `--from` as an `upgrade-sandboxes` flag. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` - `npm run build:cli` - `bash test/e2e/e2e-cloud-experimental/check-docs.sh --only-cli` - Skip-term scan for `docs/.docs-skip` blocked terms across generated user skills  ## Summary by CodeRabbit * **Documentation** * Enhanced local inference setup with interactive model selection prompts and environment variable overrides * Improved sandbox upgrade detection using build fingerprints and version checks * Clarified configuration restore behavior preserving user settings during rebuild/restore * Added gateway authentication as fifth security layer * Expanded Slack messaging validation with live credential checking * Enhanced troubleshooting guidance for Docker access, DNS issues, and sandbox recovery * Updated release notes for v0.0.63 featuring sandbox recovery and inference improvements

cv approved these changes Jun 9, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

yimoj added the v0.0.62 Release target label Jun 9, 2026

wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression labels Jun 9, 2026

ahunnargikar-nvidia assigned cv Jun 9, 2026

yimoj force-pushed the fix/4497-rebuild-stale-recovery branch from d88d2cb to fc48531 Compare June 10, 2026 02:30

cv added v0.0.63 Release target and removed v0.0.62 Release target labels Jun 10, 2026

cv merged commit fc3c0a9 into NVIDIA:main Jun 10, 2026
34 checks passed

coderabbitai Bot mentioned this pull request Jun 10, 2026

refactor(messaging): use plan-backed channel state #5123

Merged

12 tasks

miyoungc mentioned this pull request Jun 11, 2026

docs: refresh v0.0.63 release docs #5244

Merged

coderabbitai Bot mentioned this pull request Jun 11, 2026

fix(sandbox): resolve gateway from persisted sandbox entry #5225

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sandbox): recover stale sandbox on rebuild instead of dead-ending (#4497)#5034

fix(sandbox): recover stale sandbox on rebuild instead of dead-ending (#4497)#5034
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/4497-rebuild-stale-recovery

yimoj commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

yimoj commented Jun 9, 2026

Uh oh!

wscurran commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yimoj commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Local end-to-end proof (exact reporter workflow on a real OpenShell gateway)

Regression + pipeline coverage

Known limitation (pre-existing, shared with normal rebuild)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yimoj commented Jun 9, 2026

CodeRabbit E2E suite validation (channels-stop-start-e2e, vm-driver-privileged-exec-routing-e2e, shields-config-e2e)

Uh oh!

wscurran commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yimoj commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading