fix(sandbox): preserve registry entry on stale-sandbox connect recovery (#4497)#4647
Conversation
…ry (NVIDIA#4497) When a sandbox is registered locally but absent from a healthy NemoClaw gateway, `connect` removed the local registry entry. That raced with the `rebuild --yes` recovery hint `status` prints for a stuck/stale sandbox: the routine `connect` deleted the very metadata `rebuild` needs, so the follow-up rebuild failed with "does not exist". Make the missing-live path in ensureLiveSandboxOrExit non-destructive: preserve the registry entry and onboard session, point at rebuild/onboard for recovery, and route intentional purges through the explicit `destroy` command. Update the NVIDIA#2276 reconcile regressions plus cli/e2e coverage to assert the new contract, including that `rebuild` can still locate the preserved sandbox after a non-destructive connect. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughWhen a local sandbox registry entry exists but the sandbox is absent from the live OpenShell gateway, the CLI preserves the registry entry and prints recovery guidance instead of removing it, restoring a usable recovery path via ChangesStale registry preservation for recovery
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/test-double-onboard.sh (1)
747-751: ⚡ Quick winWrap the new recovery probes in
run_with_timeout.These are the only new Phase 5 CLI calls that can still hang the job if a prompt/regression slips back in. Reusing the existing timeout helper here will fail fast instead of burning most of the 80-minute E2E budget.
Suggested change
CONNECT_LOG="$(mktemp)" -NEMOCLAW_NON_INTERACTIVE=1 run_nemoclaw "$SANDBOX_A" connect >"$CONNECT_LOG" 2>&1 +run_with_timeout \ + "$PHASE_TIMEOUT" \ + env NEMOCLAW_NON_INTERACTIVE=1 "${NEMOCLAW_CMD[@]}" "$SANDBOX_A" connect \ + >"$CONNECT_LOG" 2>&1 connect_exit=$? connect_output="$(cat "$CONNECT_LOG")" rm -f "$CONNECT_LOG" @@ REBUILD_LOG="$(mktemp)" -NEMOCLAW_NON_INTERACTIVE=1 run_nemoclaw "$SANDBOX_A" rebuild --yes >"$REBUILD_LOG" 2>&1 || true +run_with_timeout \ + "$PHASE_TIMEOUT" \ + env NEMOCLAW_NON_INTERACTIVE=1 "${NEMOCLAW_CMD[@]}" "$SANDBOX_A" rebuild --yes \ + >"$REBUILD_LOG" 2>&1 || true rebuild_output="$(cat "$REBUILD_LOG")" rm -f "$REBUILD_LOG"Also applies to: 774-777
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/test-double-onboard.sh` around lines 747 - 751, The new Phase 5 connect invocation can hang; wrap the non-interactive run_nemoclaw call in the existing run_with_timeout helper so it fails fast: create the temp CONNECT_LOG as before, invoke run_with_timeout to run NEMOCLAW_NON_INTERACTIVE=1 run_nemoclaw "$SANDBOX_A" connect redirecting stdout/stderr to CONNECT_LOG, then read CONNECT_LOG into connect_output and set connect_exit from the wrapped command exit; apply the same wrapping to the other similar block (the lines around the second connect usage). Use the same timeout value/semantics as existing run_with_timeout calls to preserve behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/e2e/test-double-onboard.sh`:
- Around line 747-751: The new Phase 5 connect invocation can hang; wrap the
non-interactive run_nemoclaw call in the existing run_with_timeout helper so it
fails fast: create the temp CONNECT_LOG as before, invoke run_with_timeout to
run NEMOCLAW_NON_INTERACTIVE=1 run_nemoclaw "$SANDBOX_A" connect redirecting
stdout/stderr to CONNECT_LOG, then read CONNECT_LOG into connect_output and set
connect_exit from the wrapped command exit; apply the same wrapping to the other
similar block (the lines around the second connect usage). Use the same timeout
value/semantics as existing run_with_timeout calls to preserve behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c7f14035-c10a-4c2d-b785-8df2e8d559a8
📒 Files selected for processing (4)
src/lib/actions/sandbox/gateway-state.tstest/cli.test.tstest/e2e/test-double-onboard.shtest/gateway-state-reconcile-2276.test.ts
…meout Address CodeRabbit review on PR NVIDIA#4647: wrap the new stale-recovery connect/rebuild/destroy probes in `run_with_timeout` (mirroring the Phase 4 probe-only connect) so a reintroduced prompt or hang fails fast instead of stalling to the phase timeout. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/test-double-onboard.sh`:
- Around line 781-784: The rebuild probe currently swallows failures by
appending "|| true" and never checking the command's exit status; remove the "||
true", capture the exit code from run_with_timeout when invoking env
NEMOCLAW_NON_INTERACTIVE=1 "${NEMOCLAW_CMD[@]}" "$SANDBOX_A" rebuild --yes
(store it in e.g. rebuild_exit), and assert that rebuild_exit indicates failure
on timeout (fail the test if it equals the timeout/kill code or is non-zero);
keep the existing REBUILD_LOG capture and the grep -q "does not exist" check but
only consider the test passed when the exit code and the log both indicate the
expected failure. Ensure you reference the variables REBUILD_LOG, rebuild_output
and the run_with_timeout invocation when making these changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 830ea07b-7c8d-411a-a2e3-14e2144974be
📒 Files selected for processing (1)
test/e2e/test-double-onboard.sh
… pass Address CodeRabbit review on PR NVIDIA#4647: the rebuild recovery probe used `|| true` with only a `grep` content assertion, so a run_with_timeout kill (exit 124) produced output lacking "does not exist" and falsely PASSed. Capture the exit code and fail explicitly on a timeout before the content check. The connect probe (asserts exit==1) and destroy probe (caught by the registry check) already fail correctly on timeout. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
prekshivyas
left a comment
There was a problem hiding this comment.
APPROVE.
Traced every branch of ensureLiveSandboxOrExit at head 0dc4f48: no removeSandbox/updateSession call remains in any path (missing, wrong_gateway_active, identity_drift, *_after_restart, stuck-phase, fallthrough), so the registry entry is preserved on all stale-state paths, not just the #4497 repro. The dropped registry/onboard-session imports are genuinely unused, and reconcileMissingAgainstNamedGateway stays read-only.
Tests prove the regression: cli.test.ts and Scenario 1 invert the old remove-and-clear assertions, and Scenario 14 + e2e Phase 5 walk the full status→connect→rebuild chain the issue describes — rebuild now locates the preserved sandbox instead of failing "does not exist". The rebuild-timeout-124 guard from CodeRabbit is in. CI green across macos/wsl/sandbox e2e, vitest, CodeQL, ShellCheck.
Signed-off-by: Prekshi Vyas prekshiv@nvidia.com
## Summary - Add the missing `v0.0.57` release-notes section with links to the detailed docs pages for command, inference, onboarding, messaging, status, installer, and policy changes. - Remove public references to docs-skip terms from source docs and regenerate the NemoClaw user skills from the current Fern MDX docs. - Carry forward generated references for the per-agent documentation split, including Hermes-specific reference files. ## Source summary - #4615 and #4653 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover host-side `sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON` secondary-agent baking. - #4163, #4204, #4611, #4619, and #4676 -> `docs/about/release-notes.mdx`, `docs/inference/use-local-inference.mdx`: Release notes now cover managed vLLM progress/readiness, DGX Spark model default changes, local Ollama streaming usage, and inference route divergence warnings. - #4267, #4601, #4609, #4642, #4645, and #4661 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover UFW auto-remediation, local-inference reachability gates, gateway reuse/binding, cancel rollback, and policy selection persistence. - #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and Slack placeholder normalization. - #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover status failure layers, paused-container hints, Docker-driver doctor behavior, and non-destructive stale-registry recovery. - #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/network-policy/integration-policy-examples.mdx`: Release notes now cover installer tag pinning, PyPI `uv` policy access, and observable Jira validation. - #4632 -> `.agents/skills/`: Regenerated user skills from the current per-agent docs source, including newly generated Hermes reference files. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" docs --glob "*.mdx"` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" .agents/skills --glob "*.md"` - `npm run docs` - `npm run build:cli` - Commit hooks: markdownlint, docs-to-skills verification, gitleaks, skills YAML, commitlint <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Restructured documentation to clearly distinguish OpenClaw and Hermes agent variants throughout user guides. * Enhanced security, credential storage, and deployment guidance with clearer setup flows. * Added Hermes plugin installation and ecosystem documentation. * Improved workspace, messaging, and policy management references with variant-specific command examples. * Refined troubleshooting and CLI reference sections for clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…NVIDIA#4497) PR NVIDIA#4647 stopped `connect` from deleting the registry entry of a sandbox that is registered locally but absent from a healthy gateway, so the `rebuild --yes` hint `status` prints would have something to recover. But `rebuild` still dead-ended: its backup step aborted with "Sandbox '<name>' is not running. Cannot back up state." whenever the live sandbox was missing — exactly the stale state — so the recommended recovery path stayed broken and the issue reopened. Treat a registered-but-not-live sandbox as a recovery rebuild: there is no live workspace state to back up, so skip the backup/restore steps and recreate from the preserved registry + onboard-session metadata. Guard the no-backup path so it never destroys live state: - Confirm absence authoritatively against the NAMED nemoclaw gateway (getReconciledSandboxGatewayState runs `sandbox get`), and only treat `present` as live when nemoclaw is the active healthy gateway — a foreign active gateway (multi-gateway, NVIDIA#4645) or a list/get inconsistency must not trigger a destructive recreate. - Refuse the destructive path for a sandbox recorded on a non-default per-port gateway; point the operator at `openshell gateway select`. - On recreate failure, restore the captured registry entry verbatim (under the registry lock, target-only) so `rebuild`/`connect` stay retryable without clobbering concurrent changes; reclaim the default pointer. - Reset the gone sandbox's stale shields seal only after a successful recreate, and tell the operator to re-apply `shields up` if it was locked. Add a focused regression suite (rebuild-stale-recovery.test.ts) covering the recover, control, multi-gateway, per-port, and failed-recreate cases; update the NVIDIA#2276/NVIDIA#4497 reconcile Scenario 14 and the gateway-drift retry test to assert recovery instead of the dead-end; and strengthen the double-onboard E2E so the exact reporter workflow (status -> connect -> rebuild --yes) must recreate a live sandbox — the old probe only checked for "does not exist" and would have passed against this bug. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Summary
Stale-sandbox recovery no longer deletes the local registry entry before
rebuildcan use it. When a sandbox is registered locally but absent from a healthy NemoClaw gateway, a routineconnectused to remove the entry — the very metadata therebuild --yeshint (printed bystatus) depends on — so the follow-up rebuild failed with "does not exist".Related Issue
Fixes #4497
Changes
src/lib/actions/sandbox/gateway-state.ts: make the missing-live branch ofensureLiveSandboxOrExitnon-destructive. It now preserves the registry entry and onboard session, prints recovery guidance (retryrebuild --yes/onboard), and routes intentional purges through the explicitdestroycommand instead of deleting state automatically. Drops the now-unusedregistry/onboard-sessionimports.test/gateway-state-reconcile-2276.test.ts: flip Scenario 1 to the new non-destructive contract and add Scenario 14 ([Ubuntu 24.04][Sandbox] stale sandbox recovery removes registry entry before rebuild can recover #4497) provingstatus-recommended →connectpreserves →rebuild --yescan still locate the sandbox.test/cli.test.ts: update the CLI-dispatch connect test to assert the entry is preserved (not removed).test/e2e/test-double-onboard.sh: Phase 5 now assertsstatusandconnectpreserve the stale entry and thatrebuildcan locate it, then purges it via the explicitdestroywhile the gateway is healthy so cleanup stays reliable.Type of Change
Verification
Reproduced end-to-end on a host whose local registry held entries while
openshell sandbox listreported none (healthy gateway): before the fixconnectprinted "Removed stale local registry entry" and the subsequentrebuild --yesfailed with "does not exist"; after the fixconnectpreserves the entry andrebuild --yeslocates the sandbox and proceeds into its own flow.npx prek run --all-filespassesnpm testpasses (pre-existing, load-induced 5000ms timeouts intest/e2e-scenario/framework-testsare unrelated — they fail on the clean baseline too and pass when run with a larger timeout)Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
Bug Fixes
rebuild --yes/onboardfor recovery anddestroyfor deliberate removal.Tests
connect+rebuild --yesflows.