Skip to content

fix(sandbox): recover stale sandbox on rebuild instead of dead-ending (#4497)#5034

Merged
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/4497-rebuild-stale-recovery
Jun 10, 2026
Merged

fix(sandbox): recover stale sandbox on rebuild instead of dead-ending (#4497)#5034
cv merged 1 commit into
NVIDIA:mainfrom
yimoj:fix/4497-rebuild-stale-recovery

Conversation

@yimoj

@yimoj yimoj commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

rebuild --yes no longer dead-ends on a stale sandbox. When a sandbox is registered locally but absent from a healthy OpenShell gateway (stuck/diverged provision, container reaped), rebuild now treats it as a recovery: it skips the impossible backup and recreates from the preserved registry + onboard-session metadata, instead of aborting with "Cannot back up state."

Related Issue

Fixes #4497

Changes

  • rebuild: when the sandbox is missing from the active gateway's sandbox list, confirm absence authoritatively against the named nemoclaw gateway (getReconciledSandboxGatewayState runs sandbox get). Only enter the destructive stale-recovery path once it is genuinely gone on a healthy named gateway.
  • Stale recovery skips backup/restore and recreates from registry/session metadata; built-in policy presets are re-applied from sb.policies (the same source the backup manifest uses).
  • Multi-gateway data-loss guards: never recreate-from-scratch when (a) a foreign gateway is active (a present/list result there is not trusted), or (b) the sandbox is recorded on a non-default per-port gateway (fix(onboard): bind gateway name, state dir, and marker per gateway port (#4422) #4645) — the operator is pointed at openshell gateway select instead.
  • Crash-safety: on recreate failure, restore the captured registry entry verbatim (target-only, under the registry lock — no clobbering concurrent changes) and reclaim the default-sandbox pointer, so rebuild/connect stay retryable.
  • Shields: reset the gone sandbox's stale lock seal only after a successful recreate (a failed recreate keeps the lockdown record for retry), and prompt the operator to re-apply shields up if it had been locked.
  • New registry.restoreSandboxEntry and shields.clearShieldsState helpers.

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • npm test passes (only the pre-existing test/ssrf-parity.test.ts fails, identically on clean upstream/main — unrelated to this change)
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • npx biome check clean on changed files; shfmt/shellcheck clean on the E2E script; tsc type-check passes

Local end-to-end proof (exact reporter workflow on a real OpenShell gateway)

Drove status -> connect -> rebuild --yes against a real healthy gateway with a registered-but-not-live sandbox:

⚠ Sandbox 'repro4497stale' is registered locally but absent from the live OpenShell gateway.
No live workspace state to back up — recreating from the preserved registry metadata.
Deleting old sandbox...
✓ Old sandbox deleted
Creating new sandbox with current image...

No "Cannot back up state", no "Backing up sandbox state" — rebuild crossed the gate that previously dead-ended and proceeded into the recreate (onboard --resume). The user's other registry entries were untouched.

Regression + pipeline coverage

  • test/rebuild-stale-recovery.test.ts — recover, live-control, foreign-gateway guard, non-default per-port guard, failed-recreate registry rollback (incl. defaultSandbox).
  • test/gateway-state-reconcile-2276.test.ts Scenario 14 and rebuild-gateway-drift.test.ts updated to assert recovery rather than the dead-end.
  • test/e2e/test-double-onboard.sh Phase 5 strengthened so the exact reporter workflow must recreate a live sandbox (the prior probe only checked for "does not exist" and would have passed against this bug).

Known limitation (pre-existing, shared with normal rebuild)

Custom policy-add --from-file/--from-dir egress rules (customPolicies) are not re-applied to the recreated sandbox — this is identical to a normal rebuild's recreate path (the registry entry is removed before onboard --resume runs) and is out of scope for this dead-end fix.


Signed-off-by: Yimo Jiang yimoj@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced sandbox recovery when a sandbox exists locally but is missing from the live gateway, preventing unnecessary destructive operations and enabling a “stale-sandbox” recovery path.
    • Improved rebuild behavior to preserve and restore local registry metadata so retrying recovery no longer fails.
  • Improvements

    • Updated post-recreate and error messaging to include clearer context and preserved backup paths; retry guidance is now configurable.
    • Shield handling adjusted to avoid unsafe unlock/relock behavior during stale recovery.
  • New Features

    • Added APIs to restore a removed registry entry and to clear persisted shields state.
  • Tests

    • Added and expanded end-to-end and unit tests covering stale-sandbox recovery scenarios and regressions.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a3d07eec-0945-48f0-95e0-db151115e22e

📥 Commits

Reviewing files that changed from the base of the PR and between d88d2cb and fc48531.

📒 Files selected for processing (3)
  • src/lib/actions/sandbox/gateway-state.ts
  • src/lib/actions/sandbox/rebuild-gateway-drift.test.ts
  • src/lib/actions/sandbox/rebuild.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/lib/actions/sandbox/gateway-state.ts
  • src/lib/actions/sandbox/rebuild-gateway-drift.test.ts
  • src/lib/actions/sandbox/rebuild.ts

📝 Walkthrough

Walkthrough

Adds stale-sandbox recovery to rebuild: when a sandbox exists in the local registry but is missing from the live OpenShell gateway, rebuild snapshots registry metadata, skips backup (no live workspace), recreates from preserved metadata, restores the registry entry on recreate failure, and clears stale shields on success.

Changes

Stale-Sandbox Recovery Implementation

Layer / File(s) Summary
Supporting contracts and helpers
src/lib/actions/sandbox/gateway-state.ts, src/lib/shields/index.ts, src/lib/state/registry.ts, src/lib/actions/sandbox/rebuild.ts (imports)
Adds retryCommand parameter to wrong-gateway guidance, introduces clearShieldsState, adds restoreSandboxEntry, and wires new imports into rebuild.
Stale recovery detection and gateway reconciliation
src/lib/actions/sandbox/rebuild.ts
Performs authoritative named-gateway lifecycle reconciliation when sandbox is absent from live list, distinguishes missing vs ambiguous cases, and snapshots the registry entry before destructive actions.
Stale-aware rebuild execution path
src/lib/actions/sandbox/rebuild.ts
Skips normal shields unlock/relock window and backup when stale, records prior lock state, recreates from preserved metadata, restores preserved registry entry on recreate failure, conditions restore steps on backup existence, clears stale shields on success, and updates messaging for stale-recovery.
Conditional restore flow and messaging
src/lib/actions/sandbox/rebuild.ts
Make workspace/policy restore conditional on backup manifest, fallback to registry presets when manifest absent, and adjust success/incomplete messaging to reflect stale recovery state.

Test Coverage for Stale-Sandbox Recovery

Layer / File(s) Summary
Rebuild gateway drift test update
src/lib/actions/sandbox/rebuild-gateway-drift.test.ts
Replaced named-gateway recovery test to assert transition into stale-sandbox recovery, stubbed sandbox list retry and recreate failure, and assert retry count and recovery rejection.
Gateway state reconciliation test update
test/gateway-state-reconcile-2276.test.ts
Expanded Scenario 14 to assert that rebuild --yes finds preserved registry entry, skips backup when absent from live gateway, and proceeds to recreate from preserved metadata.
End-to-end acceptance test
test/e2e/test-double-onboard.sh
Replaced Phase 5 with #4497 acceptance gate: run rebuild --yes on stale sandbox, assert no timeout/dead-end, verify skip-backup messaging, recreate progression, live-state recovery, and registry persistence followed by successful destroy --yes cleanup.
Comprehensive regression test
test/rebuild-stale-recovery.test.ts
New Vitest suite creating a temp HOME fixture and stubbed openshell/docker to simulate stale vs live states; validates stale recovery skips backup and recreates, live case backs up normally, multi-gateway safety, and registry restoration when recreate fails.

Sequence Diagram (high-level stale-recovery flow):

sequenceDiagram
  participant Operator
  participant CLI
  participant rebuildSandbox
  participant Registry
  participant OpenShell
  participant Shields

  Operator->>CLI: nemoclaw <sandbox> rebuild --yes
  CLI->>rebuildSandbox: start
  rebuildSandbox->>OpenShell: sandbox list / status
  OpenShell-->>rebuildSandbox: missing
  rebuildSandbox->>Registry: snapshot entry (preserve)
  rebuildSandbox->>Shields: record locked state (avoid unlock window)
  rebuildSandbox->>rebuildSandbox: skip backup (no live state)
  rebuildSandbox->>OpenShell: recreate sandbox (onboard)
  OpenShell-->>rebuildSandbox: recreate success/failure
  alt failure
    rebuildSandbox->>Registry: restore preserved entry
    rebuildSandbox->>CLI: report recreate failure
  else success
    rebuildSandbox->>Shields: clear stale shields state
    rebuildSandbox->>CLI: report success (stale-recovery)
  end
Loading

🎯 4 (Complex) | ⏱️ ~45 minutes

🐰 I found a sandbox missing, though still in our file,
I saved its entry first and kept the plan agile.
Skip the impossible backup, recreate from the note,
If recreate fails, restore the saved quote.
Now rebuild retries bloom — recovery after a while.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(sandbox): recover stale sandbox on rebuild instead of dead-ending (#4497)' is clear and directly addresses the main change: enabling recovery of stale sandboxes during rebuild instead of failing with an error.
Linked Issues check ✅ Passed The PR implements the core requirements from issue #4497: preserving the local registry entry during stale recovery, enabling rebuild to proceed without 'Cannot back up state' failure, and allowing recreate from preserved metadata when the sandbox is absent from the gateway.
Out of Scope Changes check ✅ Passed All code changes are directly aligned with the stale-sandbox recovery objective: modifications to rebuild logic, new registry/shields helper functions, and comprehensive test coverage for the stale-recovery scenario. No unrelated changes detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/actions/sandbox/rebuild.ts (1)

494-605: Run the targeted nightly E2E suites for this recovery path before merge.

As per coding guidelines, changes in src/lib/actions/sandbox/rebuild.ts and related shields behavior should be validated with channels-stop-start-e2e, vm-driver-privileged-exec-routing-e2e, and shields-config-e2e.

Also applies to: 996-1002, 1226-1233

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/actions/sandbox/rebuild.ts` around lines 494 - 605, Run the targeted
nightly E2E suites that validate the stale-sandbox recovery and shields behavior
before merging: execute channels-stop-start-e2e,
vm-driver-privileged-exec-routing-e2e, and shields-config-e2e and confirm they
pass for the changes touching the rebuild flow (notably the stale-sandbox path
that sets staleRecovery and staleRegistrySnapshot and the logic around
getReconciledSandboxGatewayState / rebuild --yes); if any failures appear,
capture logs and fix the failing scenarios in rebuild.ts and related shield
handling before merging.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/actions/sandbox/rebuild.ts`:
- Around line 494-605: Run the targeted nightly E2E suites that validate the
stale-sandbox recovery and shields behavior before merging: execute
channels-stop-start-e2e, vm-driver-privileged-exec-routing-e2e, and
shields-config-e2e and confirm they pass for the changes touching the rebuild
flow (notably the stale-sandbox path that sets staleRecovery and
staleRegistrySnapshot and the logic around getReconciledSandboxGatewayState /
rebuild --yes); if any failures appear, capture logs and fix the failing
scenarios in rebuild.ts and related shield handling before merging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fc7cba39-5453-4445-8bae-29d537e10cac

📥 Commits

Reviewing files that changed from the base of the PR and between 38c03b1 and d88d2cb.

📒 Files selected for processing (8)
  • src/lib/actions/sandbox/gateway-state.ts
  • src/lib/actions/sandbox/rebuild-gateway-drift.test.ts
  • src/lib/actions/sandbox/rebuild.ts
  • src/lib/shields/index.ts
  • src/lib/state/registry.ts
  • test/e2e/test-double-onboard.sh
  • test/gateway-state-reconcile-2276.test.ts
  • test/rebuild-stale-recovery.test.ts

@yimoj

yimoj commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

CodeRabbit E2E suite validation (channels-stop-start-e2e, vm-driver-privileged-exec-routing-e2e, shields-config-e2e)

CodeRabbit's review recommends running three nightly E2E suites for the rebuild.ts/shields changes. Status of each on this PR head (d88d2cbc):

Suite Result Notes
vm-driver-privileged-exec-routing-e2e PASS (local) Hermetic host-side check (fake docker, no live gateway). bash test/e2e/test-vm-driver-privileged-exec-routing.shPASS: VM and Docker privileged exec routing uses direct sandbox containers.
shields-config-e2e ⚠️ Blocked (env) Full live onboard required. Built CLI from this worktree, opened the OpenShell docker-bridge firewall, sourced a real key — onboard reached [4/8] Setting up inference provider then hit transport error … Connection refused (os error 111). Gateway log shows Shutdown signal received; stopping gateway — this host's OpenShell gateway is shared across concurrent sessions and the lifecycle tore it down mid-onboard (it killed 3 unrelated sandboxes, Exited 137). Not reproducible here without disrupting parallel work.
channels-stop-start-e2e ⚠️ Not run (env) Same shared-gateway dependency, plus a 2h budget building both agents × every channel. Same blocker.

Why I couldn't dispatch the nightly workflow myself: nightly-e2e.yaml has the isolated environment and all three jobs, but it is gated to NVIDIA/NemoClaw and my account has pull+triage only (no push/actions:write; token lacks workflow scope), so gh workflow run is not available to me. The e2e-advisor auto-dispatch / a maintainer owns that dispatch.

The #4497 reporter workflow itself is already validated in this PR:

  • Strengthened test/e2e/test-double-onboard.sh Phase 5 asserts status → connect → rebuild --yes recreates a live sandbox.
  • Real-gateway local repro (in the PR description): rebuild on a stale entry printed the recovery markers and proceeded into recreate with zero Cannot back up state.
  • cli-test-shards / plugin-tests / unit-vitest all green.

Ask for the maintainer at merge: please dispatch the three nightly suites on this head (or via the e2e-advisor), e.g.
gh workflow run nightly-e2e.yaml --repo NVIDIA/NemoClaw --ref main -f jobs=channels-stop-start-e2e,vm-driver-privileged-exec-routing-e2e,shields-config-e2e,rebuild-openclaw-e2e -f target_ref=d88d2cbc5471e587061748d7253adbffda8f6173 -f pr_number=5034

Signed-off-by: Yimo Jiang yimoj@nvidia.com

@yimoj yimoj added the v0.0.62 Release target label Jun 9, 2026
@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression labels Jun 9, 2026
@wscurran

wscurran commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

…NVIDIA#4497)

PR NVIDIA#4647 stopped `connect` from deleting the registry entry of a sandbox
that is registered locally but absent from a healthy gateway, so the
`rebuild --yes` hint `status` prints would have something to recover. But
`rebuild` still dead-ended: its backup step aborted with "Sandbox '<name>'
is not running. Cannot back up state." whenever the live sandbox was
missing — exactly the stale state — so the recommended recovery path stayed
broken and the issue reopened.

Treat a registered-but-not-live sandbox as a recovery rebuild: there is no
live workspace state to back up, so skip the backup/restore steps and
recreate from the preserved registry + onboard-session metadata. Guard the
no-backup path so it never destroys live state:

- Confirm absence authoritatively against the NAMED nemoclaw gateway
  (getReconciledSandboxGatewayState runs `sandbox get`), and only treat
  `present` as live when nemoclaw is the active healthy gateway — a foreign
  active gateway (multi-gateway, NVIDIA#4645) or a list/get inconsistency must not
  trigger a destructive recreate.
- Refuse the destructive path for a sandbox recorded on a non-default
  per-port gateway; point the operator at `openshell gateway select`.
- On recreate failure, restore the captured registry entry verbatim (under
  the registry lock, target-only) so `rebuild`/`connect` stay retryable
  without clobbering concurrent changes; reclaim the default pointer.
- Reset the gone sandbox's stale shields seal only after a successful
  recreate, and tell the operator to re-apply `shields up` if it was locked.

Add a focused regression suite (rebuild-stale-recovery.test.ts) covering the
recover, control, multi-gateway, per-port, and failed-recreate cases; update
the NVIDIA#2276/NVIDIA#4497 reconcile Scenario 14 and the gateway-drift retry test to
assert recovery instead of the dead-end; and strengthen the double-onboard
E2E so the exact reporter workflow (status -> connect -> rebuild --yes) must
recreate a live sandbox — the old probe only checked for "does not exist"
and would have passed against this bug.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj force-pushed the fix/4497-rebuild-stale-recovery branch from d88d2cb to fc48531 Compare June 10, 2026 02:30
@cv cv added v0.0.63 Release target and removed v0.0.62 Release target labels Jun 10, 2026
@cv cv merged commit fc3c0a9 into NVIDIA:main Jun 10, 2026
34 checks passed
miyoungc added a commit that referenced this pull request Jun 11, 2026
## Summary
- Add the v0.0.63 release-note section using the published development
note as source context.
- Update source docs for sandbox recovery, OpenClaw config restore
safety, managed vLLM selection, Slack Socket Mode conflict handling, and
host diagnostics.
- Refresh generated `nemoclaw-user-*` skills from the updated Fern MDX
docs.
- Update the release-doc refresh skill so post-release docs for version
`n` look up the matching announcement discussion and use the `n+1` patch
release label.
- Fix CLI/docs parity by avoiding a `--from <Dockerfile>` flag mention
inside the `upgrade-sandboxes` command section.

## Source summary
- #5034 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document safer stale-sandbox recovery
through `rebuild --yes` before recreating from scratch.
- #5091 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document Docker-driver post-reboot
recovery from OpenShell container labels.
- #5101, #5174, #5177 -> `docs/manage-sandboxes/backup-restore.mdx`,
`docs/about/release-notes.mdx`: Document OpenClaw `openclaw.json`
preservation, merge behavior, and fail-safe restore handling.
- #5102 -> `docs/reference/commands.mdx`,
`docs/reference/commands-nemohermes.mdx`,
`docs/manage-sandboxes/lifecycle.mdx`, `docs/about/release-notes.mdx`:
Document `upgrade-sandboxes` image-fingerprint drift detection.
- #4201 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document the installer diagnostic for
unexpected Docker daemon access outside the `docker` group.
- #5038 -> `docs/inference/inference-options.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/about/release-notes.mdx`: Document the interactive managed-vLLM
model picker and non-interactive override behavior.
- #5040, #5041 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document Ollama auth-proxy recovery and
host DNS preflight diagnostics.
- #4986, #5039 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`docs/about/release-notes.mdx`: Document Slack validation and duplicate
Slack Socket Mode sandbox handling.
- #4981, #5168 -> `docs/about/release-notes.mdx`: Capture Hermes gateway
secret-guard and wrapped-argv startup hardening in the release surface.
- Follow-up ->
`.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`: Record the
post-release docs workflow, discussion-announcement lookup, and
next-patch release label rule.
- Follow-up -> `docs/reference/commands.mdx`,
`docs/reference/commands-nemohermes.mdx`: Reword custom Dockerfile
sandbox text so CLI parity does not treat `--from` as an
`upgrade-sandboxes` flag.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `npm run docs`
- `npm run build:cli`
- `bash test/e2e/e2e-cloud-experimental/check-docs.sh --only-cli`
- Skip-term scan for `docs/.docs-skip` blocked terms across generated
user skills

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Enhanced local inference setup with interactive model selection
prompts and environment variable overrides
* Improved sandbox upgrade detection using build fingerprints and
version checks
* Clarified configuration restore behavior preserving user settings
during rebuild/restore
  * Added gateway authentication as fifth security layer
  * Expanded Slack messaging validation with live credential checking
* Enhanced troubleshooting guidance for Docker access, DNS issues, and
sandbox recovery
* Updated release notes for v0.0.63 featuring sandbox recovery and
inference improvements

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression v0.0.63 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][Sandbox] stale sandbox recovery removes registry entry before rebuild can recover

3 participants