fix(sandbox): preserve registry entry on stale-sandbox connect recovery (#4497) by yimoj · Pull Request #4647 · NVIDIA/NemoClaw

yimoj · 2026-06-02T05:44:00Z

Summary

Stale-sandbox recovery no longer deletes the local registry entry before rebuild can use it. When a sandbox is registered locally but absent from a healthy NemoClaw gateway, a routine connect used to remove the entry — the very metadata the rebuild --yes hint (printed by status) depends on — so the follow-up rebuild failed with "does not exist".

Related Issue

Fixes #4497

Changes

src/lib/actions/sandbox/gateway-state.ts: make the missing-live branch of ensureLiveSandboxOrExit non-destructive. It now preserves the registry entry and onboard session, prints recovery guidance (retry rebuild --yes / onboard), and routes intentional purges through the explicit destroy command instead of deleting state automatically. Drops the now-unused registry / onboard-session imports.
test/gateway-state-reconcile-2276.test.ts: flip Scenario 1 to the new non-destructive contract and add Scenario 14 ([Ubuntu 24.04][Sandbox] stale sandbox recovery removes registry entry before rebuild can recover #4497) proving status-recommended → connect preserves → rebuild --yes can still locate the sandbox.
test/cli.test.ts: update the CLI-dispatch connect test to assert the entry is preserved (not removed).
test/e2e/test-double-onboard.sh: Phase 5 now asserts status and connect preserve the stale entry and that rebuild can locate it, then purges it via the explicit destroy while the gateway is healthy so cleanup stays reliable.

Type of Change

Code change (feature, bug fix, or refactor)

Verification

Reproduced end-to-end on a host whose local registry held entries while openshell sandbox list reported none (healthy gateway): before the fix connect printed "Removed stale local registry entry" and the subsequent rebuild --yes failed with "does not exist"; after the fix connect preserves the entry and rebuild --yes locates the sandbox and proceeds into its own flow.

npx prek run --all-files passes
npm test passes (pre-existing, load-induced 5000ms timeouts in test/e2e-scenario/framework-tests are unrelated — they fail on the clean baseline too and pass when run with a larger timeout)
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes (n/a — guidance text only)

Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

Bug Fixes
- Stale local sandbox registry entries are now preserved when the live sandbox is missing; CLI output now explains the entry is preserved and points to rebuild --yes / onboard for recovery and destroy for deliberate removal.
Tests
- Updated unit and end-to-end tests to expect preserved-registry behavior and added a recovery scenario validating connect + rebuild --yes flows.

…ry (NVIDIA#4497) When a sandbox is registered locally but absent from a healthy NemoClaw gateway, `connect` removed the local registry entry. That raced with the `rebuild --yes` recovery hint `status` prints for a stuck/stale sandbox: the routine `connect` deleted the very metadata `rebuild` needs, so the follow-up rebuild failed with "does not exist". Make the missing-live path in ensureLiveSandboxOrExit non-destructive: preserve the registry entry and onboard session, point at rebuild/onboard for recovery, and route intentional purges through the explicit `destroy` command. Update the NVIDIA#2276 reconcile regressions plus cli/e2e coverage to assert the new contract, including that `rebuild` can still locate the preserved sandbox after a non-destructive connect. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-02T05:44:12Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 421615fe-faa2-47bb-9d7c-0280564b2385

📥 Commits

Reviewing files that changed from the base of the PR and between 20a6482 and 0dc4f48.

📒 Files selected for processing (1)

test/e2e/test-double-onboard.sh

🚧 Files skipped from review as they are similar to previous changes (1)

test/e2e/test-double-onboard.sh

📝 Walkthrough

Walkthrough

When a local sandbox registry entry exists but the sandbox is absent from the live OpenShell gateway, the CLI preserves the registry entry and prints recovery guidance instead of removing it, restoring a usable recovery path via rebuild --yes or allowing intentional purge via destroy.

Changes

Stale registry preservation for recovery

Layer / File(s)	Summary
Non-destructive stale registry handling `src/lib/actions/sandbox/gateway-state.ts`	Import reordering and replacement of destructive stale-registry logic in `ensureLiveSandboxOrExit`. When a sandbox is missing from the live gateway, the function now preserves the local registry entry and session state, and prints console guidance explaining that the preserved entry enables recovery paths via `rebuild --yes` or intentional removal via `destroy`.
CLI test validation `test/cli.test.ts`	Updated the "connect stale registry entry" test to assert that `alpha connect` targeting a missing sandbox preserves the local registry entry, emits the new guidance messaging, and keeps the sandbox definition in the saved registry file.
End-to-end and regression test coverage `test/e2e/test-double-onboard.sh`, `test/gateway-state-reconcile-2276.test.ts`	Comprehensive test coverage across Phase 5 stale-registry reconciliation (e2e) and Scenario 1 + new Scenario 14 (regression) validating that `status` and `connect` preserve stale entries, that `rebuild --yes` can locate preserved sandboxes, and that only `destroy` removes the stale registry entry.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related issues

nightly-e2e: double-onboard-e2e stale-registry reconciliation misaligned with #4578 (run 26790528855) #4639: Describes the same stale local registry entry removal behavior; this PR implements preservation to enable rebuild-based recovery.

Possibly related PRs

NVIDIA/NemoClaw#4605: Modifies the same ensureLiveSandboxOrExit area to change guidance/behavior for gateway failure modes.

Suggested labels

fix, Sandbox, NemoClaw CLI, v0.0.57

Suggested reviewers

cv
cjagwani

Poem

🐰 I saw a registry entry near,
Left behind when the gateway disappeared,
I hugged it close — rebuild can find,
Don't delete by chance, call destroy kind,
Recovery hops back, carrots cheered.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: preserving registry entries during stale-sandbox connect recovery, with issue reference.
Linked Issues check	✅ Passed	Changes preserve registry entry and onboard session during connect when sandbox is missing from gateway, provide recovery guidance, and require explicit destroy for purges—directly addressing `#4497`'s requirement for a consistent recovery path.
Out of Scope Changes check	✅ Passed	All changes are directly related to the stated objective: modifying stale-sandbox recovery behavior, updating test assertions to match the new contract, and adding recovery verification tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

test/e2e/test-double-onboard.sh (1)

747-751: ⚡ Quick win

Wrap the new recovery probes in run_with_timeout.

These are the only new Phase 5 CLI calls that can still hang the job if a prompt/regression slips back in. Reusing the existing timeout helper here will fail fast instead of burning most of the 80-minute E2E budget.

Suggested change

 CONNECT_LOG="$(mktemp)"
-NEMOCLAW_NON_INTERACTIVE=1 run_nemoclaw "$SANDBOX_A" connect >"$CONNECT_LOG" 2>&1
+run_with_timeout \
+  "$PHASE_TIMEOUT" \
+  env NEMOCLAW_NON_INTERACTIVE=1 "${NEMOCLAW_CMD[@]}" "$SANDBOX_A" connect \
+  >"$CONNECT_LOG" 2>&1
 connect_exit=$?
 connect_output="$(cat "$CONNECT_LOG")"
 rm -f "$CONNECT_LOG"
 
@@
 REBUILD_LOG="$(mktemp)"
-NEMOCLAW_NON_INTERACTIVE=1 run_nemoclaw "$SANDBOX_A" rebuild --yes >"$REBUILD_LOG" 2>&1 || true
+run_with_timeout \
+  "$PHASE_TIMEOUT" \
+  env NEMOCLAW_NON_INTERACTIVE=1 "${NEMOCLAW_CMD[@]}" "$SANDBOX_A" rebuild --yes \
+  >"$REBUILD_LOG" 2>&1 || true
 rebuild_output="$(cat "$REBUILD_LOG")"
 rm -f "$REBUILD_LOG"

Also applies to: 774-777

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-double-onboard.sh` around lines 747 - 751, The new Phase 5
connect invocation can hang; wrap the non-interactive run_nemoclaw call in the
existing run_with_timeout helper so it fails fast: create the temp CONNECT_LOG
as before, invoke run_with_timeout to run NEMOCLAW_NON_INTERACTIVE=1
run_nemoclaw "$SANDBOX_A" connect redirecting stdout/stderr to CONNECT_LOG, then
read CONNECT_LOG into connect_output and set connect_exit from the wrapped
command exit; apply the same wrapping to the other similar block (the lines
around the second connect usage). Use the same timeout value/semantics as
existing run_with_timeout calls to preserve behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/test-double-onboard.sh`:
- Around line 747-751: The new Phase 5 connect invocation can hang; wrap the
non-interactive run_nemoclaw call in the existing run_with_timeout helper so it
fails fast: create the temp CONNECT_LOG as before, invoke run_with_timeout to
run NEMOCLAW_NON_INTERACTIVE=1 run_nemoclaw "$SANDBOX_A" connect redirecting
stdout/stderr to CONNECT_LOG, then read CONNECT_LOG into connect_output and set
connect_exit from the wrapped command exit; apply the same wrapping to the other
similar block (the lines around the second connect usage). Use the same timeout
value/semantics as existing run_with_timeout calls to preserve behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c7f14035-c10a-4c2d-b785-8df2e8d559a8

📥 Commits

Reviewing files that changed from the base of the PR and between cab8f39 and 51c97b5.

📒 Files selected for processing (4)

src/lib/actions/sandbox/gateway-state.ts
test/cli.test.ts
test/e2e/test-double-onboard.sh
test/gateway-state-reconcile-2276.test.ts

…meout Address CodeRabbit review on PR NVIDIA#4647: wrap the new stale-recovery connect/rebuild/destroy probes in `run_with_timeout` (mirroring the Phase 4 probe-only connect) so a reintroduced prompt or hang fails fast instead of stalling to the phase timeout. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/test-double-onboard.sh`:
- Around line 781-784: The rebuild probe currently swallows failures by
appending "|| true" and never checking the command's exit status; remove the "||
true", capture the exit code from run_with_timeout when invoking env
NEMOCLAW_NON_INTERACTIVE=1 "${NEMOCLAW_CMD[@]}" "$SANDBOX_A" rebuild --yes
(store it in e.g. rebuild_exit), and assert that rebuild_exit indicates failure
on timeout (fail the test if it equals the timeout/kill code or is non-zero);
keep the existing REBUILD_LOG capture and the grep -q "does not exist" check but
only consider the test passed when the exit code and the log both indicate the
expected failure. Ensure you reference the variables REBUILD_LOG, rebuild_output
and the run_with_timeout invocation when making these changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 830ea07b-7c8d-411a-a2e3-14e2144974be

📥 Commits

Reviewing files that changed from the base of the PR and between 51c97b5 and 20a6482.

📒 Files selected for processing (1)

test/e2e/test-double-onboard.sh

… pass Address CodeRabbit review on PR NVIDIA#4647: the rebuild recovery probe used `|| true` with only a `grep` content assertion, so a run_with_timeout kill (exit 124) produced output lacking "does not exist" and falsely PASSed. Capture the exit code and fail explicitly on a timeout before the content check. The connect probe (asserts exit==1) and destroy probe (caught by the registry check) already fail correctly on timeout. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

prekshivyas

APPROVE.

Traced every branch of ensureLiveSandboxOrExit at head 0dc4f48: no removeSandbox/updateSession call remains in any path (missing, wrong_gateway_active, identity_drift, *_after_restart, stuck-phase, fallthrough), so the registry entry is preserved on all stale-state paths, not just the #4497 repro. The dropped registry/onboard-session imports are genuinely unused, and reconcileMissingAgainstNamedGateway stays read-only.

Tests prove the regression: cli.test.ts and Scenario 1 invert the old remove-and-clear assertions, and Scenario 14 + e2e Phase 5 walk the full status→connect→rebuild chain the issue describes — rebuild now locates the preserved sandbox instead of failing "does not exist". The rebuild-timeout-124 guard from CodeRabbit is in. CI green across macos/wsl/sandbox e2e, vitest, CodeQL, ShellCheck.

Signed-off-by: Prekshi Vyas prekshiv@nvidia.com

## Summary - Add the missing `v0.0.57` release-notes section with links to the detailed docs pages for command, inference, onboarding, messaging, status, installer, and policy changes. - Remove public references to docs-skip terms from source docs and regenerate the NemoClaw user skills from the current Fern MDX docs. - Carry forward generated references for the per-agent documentation split, including Hermes-specific reference files. ## Source summary - #4615 and #4653 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover host-side `sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON` secondary-agent baking. - #4163, #4204, #4611, #4619, and #4676 -> `docs/about/release-notes.mdx`, `docs/inference/use-local-inference.mdx`: Release notes now cover managed vLLM progress/readiness, DGX Spark model default changes, local Ollama streaming usage, and inference route divergence warnings. - #4267, #4601, #4609, #4642, #4645, and #4661 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover UFW auto-remediation, local-inference reachability gates, gateway reuse/binding, cancel rollback, and policy selection persistence. - #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and Slack placeholder normalization. - #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover status failure layers, paused-container hints, Docker-driver doctor behavior, and non-destructive stale-registry recovery. - #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/network-policy/integration-policy-examples.mdx`: Release notes now cover installer tag pinning, PyPI `uv` policy access, and observable Jira validation. - #4632 -> `.agents/skills/`: Regenerated user skills from the current per-agent docs source, including newly generated Hermes reference files. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" docs --glob "*.mdx"` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" .agents/skills --glob "*.md"` - `npm run docs` - `npm run build:cli` - Commit hooks: markdownlint, docs-to-skills verification, gitleaks, skills YAML, commitlint  ## Summary by CodeRabbit * **Documentation** * Restructured documentation to clearly distinguish OpenClaw and Hermes agent variants throughout user guides. * Enhanced security, credential storage, and deployment guidance with clearer setup flows. * Added Hermes plugin installation and ecosystem documentation. * Improved workspace, messaging, and policy management references with variant-specific command examples. * Refined troubleshooting and CLI reference sections for clarity.

…NVIDIA#4497) PR NVIDIA#4647 stopped `connect` from deleting the registry entry of a sandbox that is registered locally but absent from a healthy gateway, so the `rebuild --yes` hint `status` prints would have something to recover. But `rebuild` still dead-ended: its backup step aborted with "Sandbox '<name>' is not running. Cannot back up state." whenever the live sandbox was missing — exactly the stale state — so the recommended recovery path stayed broken and the issue reopened. Treat a registered-but-not-live sandbox as a recovery rebuild: there is no live workspace state to back up, so skip the backup/restore steps and recreate from the preserved registry + onboard-session metadata. Guard the no-backup path so it never destroys live state: - Confirm absence authoritatively against the NAMED nemoclaw gateway (getReconciledSandboxGatewayState runs `sandbox get`), and only treat `present` as live when nemoclaw is the active healthy gateway — a foreign active gateway (multi-gateway, NVIDIA#4645) or a list/get inconsistency must not trigger a destructive recreate. - Refuse the destructive path for a sandbox recorded on a non-default per-port gateway; point the operator at `openshell gateway select`. - On recreate failure, restore the captured registry entry verbatim (under the registry lock, target-only) so `rebuild`/`connect` stay retryable without clobbering concurrent changes; reclaim the default pointer. - Reset the gone sandbox's stale shields seal only after a successful recreate, and tell the operator to re-apply `shields up` if it was locked. Add a focused regression suite (rebuild-stale-recovery.test.ts) covering the recover, control, multi-gateway, per-port, and failed-recreate cases; update the NVIDIA#2276/NVIDIA#4497 reconcile Scenario 14 and the gateway-drift retry test to assert recovery instead of the dead-end; and strengthen the double-onboard E2E so the exact reporter workflow (status -> connect -> rebuild --yes) must recreate a live sandbox — the old probe only checked for "does not exist" and would have passed against this bug. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread test/e2e/test-double-onboard.sh

yimoj added the v0.0.57 Release target label Jun 2, 2026

prekshivyas approved these changes Jun 2, 2026

View reviewed changes

cv merged commit c170aa5 into NVIDIA:main Jun 2, 2026
30 checks passed

prekshivyas self-assigned this Jun 2, 2026

coderabbitai Bot mentioned this pull request Jun 3, 2026

fix(e2e): align double-onboard stale-registry test with non-destructive status (#4578) #4672

Merged

9 tasks

miyoungc mentioned this pull request Jun 3, 2026

docs: refresh 0.0.57 release docs #4716

Merged

wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026

coderabbitai Bot mentioned this pull request Jun 9, 2026

fix(sandbox): recover stale sandbox on rebuild instead of dead-ending (#4497) #5034

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sandbox): preserve registry entry on stale-sandbox connect recovery (#4497)#4647

fix(sandbox): preserve registry entry on stale-sandbox connect recovery (#4497)#4647
cv merged 3 commits into
NVIDIA:mainfrom
yimoj:fix/4497-preserve-registry-on-stale-recovery

yimoj commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

prekshivyas left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yimoj commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

prekshivyas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yimoj commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading