Skip to content

fix(sandbox): preserve registry entry on stale-sandbox connect recovery (#4497)#4647

Merged
cv merged 3 commits into
NVIDIA:mainfrom
yimoj:fix/4497-preserve-registry-on-stale-recovery
Jun 2, 2026
Merged

fix(sandbox): preserve registry entry on stale-sandbox connect recovery (#4497)#4647
cv merged 3 commits into
NVIDIA:mainfrom
yimoj:fix/4497-preserve-registry-on-stale-recovery

Conversation

@yimoj

@yimoj yimoj commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Stale-sandbox recovery no longer deletes the local registry entry before rebuild can use it. When a sandbox is registered locally but absent from a healthy NemoClaw gateway, a routine connect used to remove the entry — the very metadata the rebuild --yes hint (printed by status) depends on — so the follow-up rebuild failed with "does not exist".

Related Issue

Fixes #4497

Changes

  • src/lib/actions/sandbox/gateway-state.ts: make the missing-live branch of ensureLiveSandboxOrExit non-destructive. It now preserves the registry entry and onboard session, prints recovery guidance (retry rebuild --yes / onboard), and routes intentional purges through the explicit destroy command instead of deleting state automatically. Drops the now-unused registry / onboard-session imports.
  • test/gateway-state-reconcile-2276.test.ts: flip Scenario 1 to the new non-destructive contract and add Scenario 14 ([Ubuntu 24.04][Sandbox] stale sandbox recovery removes registry entry before rebuild can recover #4497) proving status-recommended → connect preserves → rebuild --yes can still locate the sandbox.
  • test/cli.test.ts: update the CLI-dispatch connect test to assert the entry is preserved (not removed).
  • test/e2e/test-double-onboard.sh: Phase 5 now asserts status and connect preserve the stale entry and that rebuild can locate it, then purges it via the explicit destroy while the gateway is healthy so cleanup stays reliable.

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

Reproduced end-to-end on a host whose local registry held entries while openshell sandbox list reported none (healthy gateway): before the fix connect printed "Removed stale local registry entry" and the subsequent rebuild --yes failed with "does not exist"; after the fix connect preserves the entry and rebuild --yes locates the sandbox and proceeds into its own flow.

  • npx prek run --all-files passes
  • npm test passes (pre-existing, load-induced 5000ms timeouts in test/e2e-scenario/framework-tests are unrelated — they fail on the clean baseline too and pass when run with a larger timeout)
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes (n/a — guidance text only)

Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Stale local sandbox registry entries are now preserved when the live sandbox is missing; CLI output now explains the entry is preserved and points to rebuild --yes / onboard for recovery and destroy for deliberate removal.
  • Tests

    • Updated unit and end-to-end tests to expect preserved-registry behavior and added a recovery scenario validating connect + rebuild --yes flows.

…ry (NVIDIA#4497)

When a sandbox is registered locally but absent from a healthy NemoClaw
gateway, `connect` removed the local registry entry. That raced with the
`rebuild --yes` recovery hint `status` prints for a stuck/stale sandbox:
the routine `connect` deleted the very metadata `rebuild` needs, so the
follow-up rebuild failed with "does not exist".

Make the missing-live path in ensureLiveSandboxOrExit non-destructive:
preserve the registry entry and onboard session, point at rebuild/onboard
for recovery, and route intentional purges through the explicit `destroy`
command. Update the NVIDIA#2276 reconcile regressions plus cli/e2e coverage to
assert the new contract, including that `rebuild` can still locate the
preserved sandbox after a non-destructive connect.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 421615fe-faa2-47bb-9d7c-0280564b2385

📥 Commits

Reviewing files that changed from the base of the PR and between 20a6482 and 0dc4f48.

📒 Files selected for processing (1)
  • test/e2e/test-double-onboard.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/test-double-onboard.sh

📝 Walkthrough

Walkthrough

When a local sandbox registry entry exists but the sandbox is absent from the live OpenShell gateway, the CLI preserves the registry entry and prints recovery guidance instead of removing it, restoring a usable recovery path via rebuild --yes or allowing intentional purge via destroy.

Changes

Stale registry preservation for recovery

Layer / File(s) Summary
Non-destructive stale registry handling
src/lib/actions/sandbox/gateway-state.ts
Import reordering and replacement of destructive stale-registry logic in ensureLiveSandboxOrExit. When a sandbox is missing from the live gateway, the function now preserves the local registry entry and session state, and prints console guidance explaining that the preserved entry enables recovery paths via rebuild --yes or intentional removal via destroy.
CLI test validation
test/cli.test.ts
Updated the "connect stale registry entry" test to assert that alpha connect targeting a missing sandbox preserves the local registry entry, emits the new guidance messaging, and keeps the sandbox definition in the saved registry file.
End-to-end and regression test coverage
test/e2e/test-double-onboard.sh, test/gateway-state-reconcile-2276.test.ts
Comprehensive test coverage across Phase 5 stale-registry reconciliation (e2e) and Scenario 1 + new Scenario 14 (regression) validating that status and connect preserve stale entries, that rebuild --yes can locate preserved sandboxes, and that only destroy removes the stale registry entry.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/NemoClaw#4605: Modifies the same ensureLiveSandboxOrExit area to change guidance/behavior for gateway failure modes.

Suggested labels

fix, Sandbox, NemoClaw CLI, v0.0.57

Suggested reviewers

  • cv
  • cjagwani

Poem

🐰 I saw a registry entry near,
Left behind when the gateway disappeared,
I hugged it close — rebuild can find,
Don't delete by chance, call destroy kind,
Recovery hops back, carrots cheered.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: preserving registry entries during stale-sandbox connect recovery, with issue reference.
Linked Issues check ✅ Passed Changes preserve registry entry and onboard session during connect when sandbox is missing from gateway, provide recovery guidance, and require explicit destroy for purges—directly addressing #4497's requirement for a consistent recovery path.
Out of Scope Changes check ✅ Passed All changes are directly related to the stated objective: modifying stale-sandbox recovery behavior, updating test assertions to match the new contract, and adding recovery verification tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/e2e/test-double-onboard.sh (1)

747-751: ⚡ Quick win

Wrap the new recovery probes in run_with_timeout.

These are the only new Phase 5 CLI calls that can still hang the job if a prompt/regression slips back in. Reusing the existing timeout helper here will fail fast instead of burning most of the 80-minute E2E budget.

Suggested change
 CONNECT_LOG="$(mktemp)"
-NEMOCLAW_NON_INTERACTIVE=1 run_nemoclaw "$SANDBOX_A" connect >"$CONNECT_LOG" 2>&1
+run_with_timeout \
+  "$PHASE_TIMEOUT" \
+  env NEMOCLAW_NON_INTERACTIVE=1 "${NEMOCLAW_CMD[@]}" "$SANDBOX_A" connect \
+  >"$CONNECT_LOG" 2>&1
 connect_exit=$?
 connect_output="$(cat "$CONNECT_LOG")"
 rm -f "$CONNECT_LOG"
 
@@
 REBUILD_LOG="$(mktemp)"
-NEMOCLAW_NON_INTERACTIVE=1 run_nemoclaw "$SANDBOX_A" rebuild --yes >"$REBUILD_LOG" 2>&1 || true
+run_with_timeout \
+  "$PHASE_TIMEOUT" \
+  env NEMOCLAW_NON_INTERACTIVE=1 "${NEMOCLAW_CMD[@]}" "$SANDBOX_A" rebuild --yes \
+  >"$REBUILD_LOG" 2>&1 || true
 rebuild_output="$(cat "$REBUILD_LOG")"
 rm -f "$REBUILD_LOG"

Also applies to: 774-777

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/test-double-onboard.sh` around lines 747 - 751, The new Phase 5
connect invocation can hang; wrap the non-interactive run_nemoclaw call in the
existing run_with_timeout helper so it fails fast: create the temp CONNECT_LOG
as before, invoke run_with_timeout to run NEMOCLAW_NON_INTERACTIVE=1
run_nemoclaw "$SANDBOX_A" connect redirecting stdout/stderr to CONNECT_LOG, then
read CONNECT_LOG into connect_output and set connect_exit from the wrapped
command exit; apply the same wrapping to the other similar block (the lines
around the second connect usage). Use the same timeout value/semantics as
existing run_with_timeout calls to preserve behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/test-double-onboard.sh`:
- Around line 747-751: The new Phase 5 connect invocation can hang; wrap the
non-interactive run_nemoclaw call in the existing run_with_timeout helper so it
fails fast: create the temp CONNECT_LOG as before, invoke run_with_timeout to
run NEMOCLAW_NON_INTERACTIVE=1 run_nemoclaw "$SANDBOX_A" connect redirecting
stdout/stderr to CONNECT_LOG, then read CONNECT_LOG into connect_output and set
connect_exit from the wrapped command exit; apply the same wrapping to the other
similar block (the lines around the second connect usage). Use the same timeout
value/semantics as existing run_with_timeout calls to preserve behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c7f14035-c10a-4c2d-b785-8df2e8d559a8

📥 Commits

Reviewing files that changed from the base of the PR and between cab8f39 and 51c97b5.

📒 Files selected for processing (4)
  • src/lib/actions/sandbox/gateway-state.ts
  • test/cli.test.ts
  • test/e2e/test-double-onboard.sh
  • test/gateway-state-reconcile-2276.test.ts

…meout

Address CodeRabbit review on PR NVIDIA#4647: wrap the new stale-recovery
connect/rebuild/destroy probes in `run_with_timeout` (mirroring the Phase 4
probe-only connect) so a reintroduced prompt or hang fails fast instead of
stalling to the phase timeout.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/test-double-onboard.sh`:
- Around line 781-784: The rebuild probe currently swallows failures by
appending "|| true" and never checking the command's exit status; remove the "||
true", capture the exit code from run_with_timeout when invoking env
NEMOCLAW_NON_INTERACTIVE=1 "${NEMOCLAW_CMD[@]}" "$SANDBOX_A" rebuild --yes
(store it in e.g. rebuild_exit), and assert that rebuild_exit indicates failure
on timeout (fail the test if it equals the timeout/kill code or is non-zero);
keep the existing REBUILD_LOG capture and the grep -q "does not exist" check but
only consider the test passed when the exit code and the log both indicate the
expected failure. Ensure you reference the variables REBUILD_LOG, rebuild_output
and the run_with_timeout invocation when making these changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 830ea07b-7c8d-411a-a2e3-14e2144974be

📥 Commits

Reviewing files that changed from the base of the PR and between 51c97b5 and 20a6482.

📒 Files selected for processing (1)
  • test/e2e/test-double-onboard.sh

Comment thread test/e2e/test-double-onboard.sh
… pass

Address CodeRabbit review on PR NVIDIA#4647: the rebuild recovery probe used
`|| true` with only a `grep` content assertion, so a run_with_timeout kill
(exit 124) produced output lacking "does not exist" and falsely PASSed.
Capture the exit code and fail explicitly on a timeout before the content
check. The connect probe (asserts exit==1) and destroy probe (caught by the
registry check) already fail correctly on timeout.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@yimoj yimoj added the v0.0.57 Release target label Jun 2, 2026

@prekshivyas prekshivyas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE.

Traced every branch of ensureLiveSandboxOrExit at head 0dc4f48: no removeSandbox/updateSession call remains in any path (missing, wrong_gateway_active, identity_drift, *_after_restart, stuck-phase, fallthrough), so the registry entry is preserved on all stale-state paths, not just the #4497 repro. The dropped registry/onboard-session imports are genuinely unused, and reconcileMissingAgainstNamedGateway stays read-only.

Tests prove the regression: cli.test.ts and Scenario 1 invert the old remove-and-clear assertions, and Scenario 14 + e2e Phase 5 walk the full status→connect→rebuild chain the issue describes — rebuild now locates the preserved sandbox instead of failing "does not exist". The rebuild-timeout-124 guard from CodeRabbit is in. CI green across macos/wsl/sandbox e2e, vitest, CodeQL, ShellCheck.

Signed-off-by: Prekshi Vyas prekshiv@nvidia.com

@cv cv merged commit c170aa5 into NVIDIA:main Jun 2, 2026
30 checks passed
@prekshivyas prekshivyas self-assigned this Jun 2, 2026
cv pushed a commit that referenced this pull request Jun 3, 2026
## Summary
- Add the missing `v0.0.57` release-notes section with links to the
detailed docs pages for command, inference, onboarding, messaging,
status, installer, and policy changes.
- Remove public references to docs-skip terms from source docs and
regenerate the NemoClaw user skills from the current Fern MDX docs.
- Carry forward generated references for the per-agent documentation
split, including Hermes-specific reference files.

## Source summary
- #4615 and #4653 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover host-side
`sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON`
secondary-agent baking.
- #4163, #4204, #4611, #4619, and #4676 ->
`docs/about/release-notes.mdx`,
`docs/inference/use-local-inference.mdx`: Release notes now cover
managed vLLM progress/readiness, DGX Spark model default changes, local
Ollama streaming usage, and inference route divergence warnings.
- #4267, #4601, #4609, #4642, #4645, and #4661 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release
notes now cover UFW auto-remediation, local-inference reachability
gates, gateway reuse/binding, cancel rollback, and policy selection
persistence.
- #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover
Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and
Slack placeholder normalization.
- #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover status failure
layers, paused-container hints, Docker-driver doctor behavior, and
non-destructive stale-registry recovery.
- #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/lifecycle.mdx`,
`docs/network-policy/integration-policy-examples.mdx`: Release notes now
cover installer tag pinning, PyPI `uv` policy access, and observable
Jira validation.
- #4632 -> `.agents/skills/`: Regenerated user skills from the current
per-agent docs source, including newly generated Hermes reference files.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" docs --glob "*.mdx"`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" .agents/skills --glob "*.md"`
- `npm run docs`
- `npm run build:cli`
- Commit hooks: markdownlint, docs-to-skills verification, gitleaks,
skills YAML, commitlint

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Restructured documentation to clearly distinguish OpenClaw and Hermes
agent variants throughout user guides.
* Enhanced security, credential storage, and deployment guidance with
clearer setup flows.
  * Added Hermes plugin installation and ecosystem documentation.
* Improved workspace, messaging, and policy management references with
variant-specific command examples.
  * Refined troubleshooting and CLI reference sections for clarity.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
yimoj added a commit to yimoj/NemoClaw that referenced this pull request Jun 10, 2026
…NVIDIA#4497)

PR NVIDIA#4647 stopped `connect` from deleting the registry entry of a sandbox
that is registered locally but absent from a healthy gateway, so the
`rebuild --yes` hint `status` prints would have something to recover. But
`rebuild` still dead-ended: its backup step aborted with "Sandbox '<name>'
is not running. Cannot back up state." whenever the live sandbox was
missing — exactly the stale state — so the recommended recovery path stayed
broken and the issue reopened.

Treat a registered-but-not-live sandbox as a recovery rebuild: there is no
live workspace state to back up, so skip the backup/restore steps and
recreate from the preserved registry + onboard-session metadata. Guard the
no-backup path so it never destroys live state:

- Confirm absence authoritatively against the NAMED nemoclaw gateway
  (getReconciledSandboxGatewayState runs `sandbox get`), and only treat
  `present` as live when nemoclaw is the active healthy gateway — a foreign
  active gateway (multi-gateway, NVIDIA#4645) or a list/get inconsistency must not
  trigger a destructive recreate.
- Refuse the destructive path for a sandbox recorded on a non-default
  per-port gateway; point the operator at `openshell gateway select`.
- On recreate failure, restore the captured registry entry verbatim (under
  the registry lock, target-only) so `rebuild`/`connect` stay retryable
  without clobbering concurrent changes; reclaim the default pointer.
- Reset the gone sandbox's stale shields seal only after a successful
  recreate, and tell the operator to re-apply `shields up` if it was locked.

Add a focused regression suite (rebuild-stale-recovery.test.ts) covering the
recover, control, multi-gateway, per-port, and failed-recreate cases; update
the NVIDIA#2276/NVIDIA#4497 reconcile Scenario 14 and the gateway-drift retry test to
assert recovery instead of the dead-end; and strengthen the double-onboard
E2E so the exact reporter workflow (status -> connect -> rebuild --yes) must
recreate a live sandbox — the old probe only checked for "does not exist"
and would have passed against this bug.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression v0.0.57 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][Sandbox] stale sandbox recovery removes registry entry before rebuild can recover

4 participants