fix(sandbox): recover stale sandbox on rebuild instead of dead-ending (#4497)#5034
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughAdds stale-sandbox recovery to rebuild: when a sandbox exists in the local registry but is missing from the live OpenShell gateway, rebuild snapshots registry metadata, skips backup (no live workspace), recreates from preserved metadata, restores the registry entry on recreate failure, and clears stale shields on success. ChangesStale-Sandbox Recovery Implementation
Test Coverage for Stale-Sandbox Recovery
Sequence Diagram (high-level stale-recovery flow): sequenceDiagram
participant Operator
participant CLI
participant rebuildSandbox
participant Registry
participant OpenShell
participant Shields
Operator->>CLI: nemoclaw <sandbox> rebuild --yes
CLI->>rebuildSandbox: start
rebuildSandbox->>OpenShell: sandbox list / status
OpenShell-->>rebuildSandbox: missing
rebuildSandbox->>Registry: snapshot entry (preserve)
rebuildSandbox->>Shields: record locked state (avoid unlock window)
rebuildSandbox->>rebuildSandbox: skip backup (no live state)
rebuildSandbox->>OpenShell: recreate sandbox (onboard)
OpenShell-->>rebuildSandbox: recreate success/failure
alt failure
rebuildSandbox->>Registry: restore preserved entry
rebuildSandbox->>CLI: report recreate failure
else success
rebuildSandbox->>Shields: clear stale shields state
rebuildSandbox->>CLI: report success (stale-recovery)
end
🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/actions/sandbox/rebuild.ts (1)
494-605: Run the targeted nightly E2E suites for this recovery path before merge.As per coding guidelines, changes in
src/lib/actions/sandbox/rebuild.tsand related shields behavior should be validated withchannels-stop-start-e2e,vm-driver-privileged-exec-routing-e2e, andshields-config-e2e.Also applies to: 996-1002, 1226-1233
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/actions/sandbox/rebuild.ts` around lines 494 - 605, Run the targeted nightly E2E suites that validate the stale-sandbox recovery and shields behavior before merging: execute channels-stop-start-e2e, vm-driver-privileged-exec-routing-e2e, and shields-config-e2e and confirm they pass for the changes touching the rebuild flow (notably the stale-sandbox path that sets staleRecovery and staleRegistrySnapshot and the logic around getReconciledSandboxGatewayState / rebuild --yes); if any failures appear, capture logs and fix the failing scenarios in rebuild.ts and related shield handling before merging.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/actions/sandbox/rebuild.ts`:
- Around line 494-605: Run the targeted nightly E2E suites that validate the
stale-sandbox recovery and shields behavior before merging: execute
channels-stop-start-e2e, vm-driver-privileged-exec-routing-e2e, and
shields-config-e2e and confirm they pass for the changes touching the rebuild
flow (notably the stale-sandbox path that sets staleRecovery and
staleRegistrySnapshot and the logic around getReconciledSandboxGatewayState /
rebuild --yes); if any failures appear, capture logs and fix the failing
scenarios in rebuild.ts and related shield handling before merging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fc7cba39-5453-4445-8bae-29d537e10cac
📒 Files selected for processing (8)
src/lib/actions/sandbox/gateway-state.tssrc/lib/actions/sandbox/rebuild-gateway-drift.test.tssrc/lib/actions/sandbox/rebuild.tssrc/lib/shields/index.tssrc/lib/state/registry.tstest/e2e/test-double-onboard.shtest/gateway-state-reconcile-2276.test.tstest/rebuild-stale-recovery.test.ts
CodeRabbit E2E suite validation (channels-stop-start-e2e, vm-driver-privileged-exec-routing-e2e, shields-config-e2e)CodeRabbit's review recommends running three nightly E2E suites for the
Why I couldn't dispatch the nightly workflow myself: The #4497 reporter workflow itself is already validated in this PR:
Ask for the maintainer at merge: please dispatch the three nightly suites on this head (or via the e2e-advisor), e.g. Signed-off-by: Yimo Jiang yimoj@nvidia.com |
…NVIDIA#4497) PR NVIDIA#4647 stopped `connect` from deleting the registry entry of a sandbox that is registered locally but absent from a healthy gateway, so the `rebuild --yes` hint `status` prints would have something to recover. But `rebuild` still dead-ended: its backup step aborted with "Sandbox '<name>' is not running. Cannot back up state." whenever the live sandbox was missing — exactly the stale state — so the recommended recovery path stayed broken and the issue reopened. Treat a registered-but-not-live sandbox as a recovery rebuild: there is no live workspace state to back up, so skip the backup/restore steps and recreate from the preserved registry + onboard-session metadata. Guard the no-backup path so it never destroys live state: - Confirm absence authoritatively against the NAMED nemoclaw gateway (getReconciledSandboxGatewayState runs `sandbox get`), and only treat `present` as live when nemoclaw is the active healthy gateway — a foreign active gateway (multi-gateway, NVIDIA#4645) or a list/get inconsistency must not trigger a destructive recreate. - Refuse the destructive path for a sandbox recorded on a non-default per-port gateway; point the operator at `openshell gateway select`. - On recreate failure, restore the captured registry entry verbatim (under the registry lock, target-only) so `rebuild`/`connect` stay retryable without clobbering concurrent changes; reclaim the default pointer. - Reset the gone sandbox's stale shields seal only after a successful recreate, and tell the operator to re-apply `shields up` if it was locked. Add a focused regression suite (rebuild-stale-recovery.test.ts) covering the recover, control, multi-gateway, per-port, and failed-recreate cases; update the NVIDIA#2276/NVIDIA#4497 reconcile Scenario 14 and the gateway-drift retry test to assert recovery instead of the dead-end; and strengthen the double-onboard E2E so the exact reporter workflow (status -> connect -> rebuild --yes) must recreate a live sandbox — the old probe only checked for "does not exist" and would have passed against this bug. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
d88d2cb to
fc48531
Compare
## Summary - Add the v0.0.63 release-note section using the published development note as source context. - Update source docs for sandbox recovery, OpenClaw config restore safety, managed vLLM selection, Slack Socket Mode conflict handling, and host diagnostics. - Refresh generated `nemoclaw-user-*` skills from the updated Fern MDX docs. - Update the release-doc refresh skill so post-release docs for version `n` look up the matching announcement discussion and use the `n+1` patch release label. - Fix CLI/docs parity by avoiding a `--from <Dockerfile>` flag mention inside the `upgrade-sandboxes` command section. ## Source summary - #5034 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document safer stale-sandbox recovery through `rebuild --yes` before recreating from scratch. - #5091 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Docker-driver post-reboot recovery from OpenShell container labels. - #5101, #5174, #5177 -> `docs/manage-sandboxes/backup-restore.mdx`, `docs/about/release-notes.mdx`: Document OpenClaw `openclaw.json` preservation, merge behavior, and fail-safe restore handling. - #5102 -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/about/release-notes.mdx`: Document `upgrade-sandboxes` image-fingerprint drift detection. - #4201 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document the installer diagnostic for unexpected Docker daemon access outside the `docker` group. - #5038 -> `docs/inference/inference-options.mdx`, `docs/inference/use-local-inference.mdx`, `docs/about/release-notes.mdx`: Document the interactive managed-vLLM model picker and non-interactive override behavior. - #5040, #5041 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Ollama auth-proxy recovery and host DNS preflight diagnostics. - #4986, #5039 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/about/release-notes.mdx`: Document Slack validation and duplicate Slack Socket Mode sandbox handling. - #4981, #5168 -> `docs/about/release-notes.mdx`: Capture Hermes gateway secret-guard and wrapped-argv startup hardening in the release surface. - Follow-up -> `.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`: Record the post-release docs workflow, discussion-announcement lookup, and next-patch release label rule. - Follow-up -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`: Reword custom Dockerfile sandbox text so CLI parity does not treat `--from` as an `upgrade-sandboxes` flag. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` - `npm run build:cli` - `bash test/e2e/e2e-cloud-experimental/check-docs.sh --only-cli` - Skip-term scan for `docs/.docs-skip` blocked terms across generated user skills <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Enhanced local inference setup with interactive model selection prompts and environment variable overrides * Improved sandbox upgrade detection using build fingerprints and version checks * Clarified configuration restore behavior preserving user settings during rebuild/restore * Added gateway authentication as fifth security layer * Expanded Slack messaging validation with live credential checking * Enhanced troubleshooting guidance for Docker access, DNS issues, and sandbox recovery * Updated release notes for v0.0.63 featuring sandbox recovery and inference improvements <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
rebuild --yesno longer dead-ends on a stale sandbox. When a sandbox is registered locally but absent from a healthy OpenShell gateway (stuck/diverged provision, container reaped), rebuild now treats it as a recovery: it skips the impossible backup and recreates from the preserved registry + onboard-session metadata, instead of aborting with "Cannot back up state."Related Issue
Fixes #4497
Changes
rebuild: when the sandbox is missing from the active gateway'ssandbox list, confirm absence authoritatively against the named nemoclaw gateway (getReconciledSandboxGatewayStaterunssandbox get). Only enter the destructive stale-recovery path once it is genuinely gone on a healthy named gateway.sb.policies(the same source the backup manifest uses).present/list result there is not trusted), or (b) the sandbox is recorded on a non-default per-port gateway (fix(onboard): bind gateway name, state dir, and marker per gateway port (#4422) #4645) — the operator is pointed atopenshell gateway selectinstead.rebuild/connectstay retryable.shields upif it had been locked.registry.restoreSandboxEntryandshields.clearShieldsStatehelpers.Type of Change
Verification
npm testpasses (only the pre-existingtest/ssrf-parity.test.tsfails, identically on cleanupstream/main— unrelated to this change)npx biome checkclean on changed files;shfmt/shellcheckclean on the E2E script;tsctype-check passesLocal end-to-end proof (exact reporter workflow on a real OpenShell gateway)
Drove
status -> connect -> rebuild --yesagainst a real healthy gateway with a registered-but-not-live sandbox:No "Cannot back up state", no "Backing up sandbox state" — rebuild crossed the gate that previously dead-ended and proceeded into the recreate (
onboard --resume). The user's other registry entries were untouched.Regression + pipeline coverage
test/rebuild-stale-recovery.test.ts— recover, live-control, foreign-gateway guard, non-default per-port guard, failed-recreate registry rollback (incl.defaultSandbox).test/gateway-state-reconcile-2276.test.tsScenario 14 andrebuild-gateway-drift.test.tsupdated to assert recovery rather than the dead-end.test/e2e/test-double-onboard.shPhase 5 strengthened so the exact reporter workflow must recreate a live sandbox (the prior probe only checked for "does not exist" and would have passed against this bug).Known limitation (pre-existing, shared with normal rebuild)
Custom
policy-add --from-file/--from-diregress rules (customPolicies) are not re-applied to the recreated sandbox — this is identical to a normal rebuild's recreate path (the registry entry is removed beforeonboard --resumeruns) and is out of scope for this dead-end fix.Signed-off-by: Yimo Jiang yimoj@nvidia.com
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Improvements
New Features
Tests