fix(status): preserve registry on missing live sandbox by ericksoa · Pull Request #4578 · NVIDIA/NemoClaw

ericksoa · 2026-05-31T14:38:25Z

Summary

Addresses the destructive passive-status side effect from #4423. nemoclaw <name> status is now non-destructive when a sandbox is registered locally but missing from the live OpenShell gateway, preventing the first post-reboot status check on DGX Spark from deleting registry state before gateway reconciliation catches up.

This is the v0.0.56 safety mitigation only: it preserves local registry/onboard-session state and makes the failure explicit. The broader #4423 recovery goals remain open on the issue, including DGX Spark gateway auto-start after reboot and full post-reboot sandbox reconciliation/recovery.

Related Issue

Partially addresses #4423.

Changes

Preserve registry and onboard-session state in the status-only missing-live-sandbox path.
Print explicit guidance that status did not remove the entry and should be retried after gateway reconnection.
Keep destructive stale-entry cleanup in active-use paths like connect, and update regression tests to lock that split.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
npm run docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Validation run:

npm run build:cli
npx vitest run test/gateway-state-reconcile-2276.test.ts test/cli.test.ts - 182 passed, 1 skipped
npx vitest run test/cli.test.ts -t "preserves an orphan registry entry" - 1 passed
npx vitest run test/gateway-state-reconcile-2276.test.ts -t "reports the missing live sandbox" - 1 passed

Advisor-selected E2E:

Required sandbox-operations-e2e auto-dispatched by the E2E advisor against PR head 93dd87a67c876358d42ef399a083d86baf3dc5a0: https://github.com/NVIDIA/NemoClaw/actions/runs/26715565554
Required scenario E2E ubuntu-repo-cloud-openclaw dispatched against this branch: https://github.com/NVIDIA/NemoClaw/actions/runs/26715713730
Optional adjacent gateway-drift-preflight-e2e regression dispatched against this branch: https://github.com/NVIDIA/NemoClaw/actions/runs/26715713787

Signed-off-by: Aaron Erickson aerickson@nvidia.com

Summary by CodeRabbit

Bug Fixes
- The status command now preserves local sandbox registry entries when registered locally but missing from the gateway. Clear guidance is provided for recovery steps instead of automatic removal.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

coderabbitai · 2026-05-31T14:38:42Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 05f446c4-08e6-47b5-89be-e79f53ee6b05

📥 Commits

Reviewing files that changed from the base of the PR and between 93dd87a and eabc2f1.

📒 Files selected for processing (2)

src/lib/actions/sandbox/status.ts
test/cli.test.ts

🚧 Files skipped from review as they are similar to previous changes (2)

test/cli.test.ts
src/lib/actions/sandbox/status.ts

📝 Walkthrough

Walkthrough

When nemoclaw status finds a sandbox in the local registry but missing from the live OpenShell gateway, the CLI now prints user guidance and preserves the local registry and onboard session state instead of removing them. Tests were updated to assert the new non-destructive behavior and messaging.

Changes

Preserve orphan entries instead of cleanup

Layer / File(s)	Summary
Status handler refactoring for orphan guidance `src/lib/actions/sandbox/status.ts`	Remove imports related to the destructive lifecycle guard and onboard-session cleanup, add `printMissingLiveSandboxStatusGuidance()` to render guidance for sandboxes registered locally but absent from the live gateway, and replace the `lookup.state === "missing"` branch with a call to the new helper (no registry deletion or session clearing).
Test updates verifying non-destructive behavior `test/cli.test.ts`, `test/gateway-state-reconcile-2276.test.ts`	Rename and update the orphan-registry status test and Scenario 2 to assert preservation of the registry and onboard-session state, expect exit status `1` where appropriate, verify the output contains "registered locally, but is not present in the live OpenShell gateway", and ensure it does not contain "Removed stale local registry entry".

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

NVIDIA/NemoClaw#4423: Addresses the same destructive cleanup behavior in src/lib/actions/sandbox/status.ts, relevant to preserving local registry entries when gateways transiently lose sandboxes.

Suggested labels

NemoClaw CLI, fix, v0.0.56

Suggested reviewers

cv

Poem

🐰 I found an orphaned name in the log,
once slated for loss like a driftwood log.
Now I whisper a hint,
keep the registry mint —
reboot may bring back its catalog 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: making the status command preserve local registry state when a sandbox is missing from the live gateway, addressing the core issue described in the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/4423-preserve-registry-on-status

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-31T14:40:27Z

E2E Advisor Recommendation

Required E2E: sandbox-operations-e2e
Optional E2E: sandbox-survival-e2e, gateway-drift-preflight-e2e

Dispatch hint: sandbox-operations-e2e

Auto-dispatched E2E: sandbox-operations-e2e via nightly-e2e.yaml at eabc2f187169355ce79e3afff7f68bb895c300a8 — nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

sandbox-operations-e2e (high (~60 min, creates multiple sandboxes)): Required because the source change affects nemoclaw <sandbox> status and sandbox lifecycle behavior. This job exercises live sandbox operations including status output, registry behavior, process recovery through status, destroy cleanup, multi-sandbox flows, and gateway kill recovery.

Optional E2E

sandbox-survival-e2e (medium-high (~30 min, requires NVIDIA_API_KEY/live inference)): Useful adjacent confidence for gateway restart/recovery semantics: it validates that a sandbox remains discoverable, registry state is retained, and nemoclaw <name> status/connect work after gateway stop/start. This is relevant to the changed missing-live-sandbox guidance but broader and more expensive than the targeted required job.
gateway-drift-preflight-e2e (low-medium (~15 min, mocked OpenShell/Docker state)): Optional adjacent regression coverage for fail-closed/no-mutation behavior when OpenShell sandbox-state RPCs are unsafe or drifted. The PR changes another no-mutation status path, so this provides confidence around related gateway-state safety boundaries.

New E2E recommendations

sandbox status missing-live-sandbox reconciliation (high): Existing live E2E coverage exercises normal status and gateway recovery, but there does not appear to be a dedicated E2E that creates a registered local sandbox entry, makes the live OpenShell gateway report NotFound for that sandbox, and asserts nemoclaw <name> status preserves sandboxes.json and onboard session state while printing the new passive guidance.
- Suggested test: Add a targeted E2E/regression case for passive status on a missing live sandbox: registered local sandbox + healthy named gateway + openshell sandbox get NotFound => status exits non-zero, prints 'No local registry entry was removed', and leaves registry/session untouched.

Dispatch hint

Workflow: nightly-e2e.yaml
jobs input: sandbox-operations-e2e

github-actions · 2026-05-31T14:40:28Z

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw
Optional scenario E2E: None

Dispatch required scenario E2E:

gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required scenario E2E

ubuntu-repo-cloud-openclaw: The PR changes the core sandbox status command behavior for a registered sandbox that is missing from the live OpenShell gateway. The Ubuntu repo cloud OpenClaw scenario is the smallest routed scenario that exercises the main repo-checkout onboarding path and runtime status/list operations for a live sandbox.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional scenario E2E

None.

Relevant changed files

src/lib/actions/sandbox/status.ts

github-actions · 2026-05-31T14:42:17Z

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Since last review: 2 prior items resolved, 0 still apply, 1 new item found

Review findings

🛠️ Needs attention

None.

🔎 Worth checking

Runtime validation is still recommended for the infrastructure-dependent reboot path (src/lib/actions/sandbox/status.ts:407): The updated CLI and harness tests cover the intended passive-status split, including preserving registry and onboard-session state when the live gateway reports the sandbox missing. However, the motivating DGX Spark/OpenShell post-reboot state is infrastructure-dependent, and the stubbed tests cannot prove the real gateway recovery/reconciliation timing behaves the same way.
- Recommendation: Add or identify targeted runtime/integration validation for the changed behavior: a locally registered sandbox should remain in registry/session when `nemoclaw <name> status` sees a missing live sandbox during gateway recovery, while active-use paths such as `connect` keep their existing cleanup behavior. Do not rely on external E2E status alone as code-review evidence.
- Evidence: The production change is the `lookup.state === "missing"` branch in `showSandboxStatus()`, which now calls `printMissingLiveSandboxStatusGuidance()` and exits. Tests in `test/cli.test.ts` and `test/gateway-state-reconcile-2276.test.ts` use stubbed `openshell` binaries and filesystem fixtures rather than a real rebooted OpenShell gateway.

🌱 Nice ideas

None.

Since last review details

Current findings:

Runtime validation is still recommended for the infrastructure-dependent reboot path (src/lib/actions/sandbox/status.ts:407): The updated CLI and harness tests cover the intended passive-status split, including preserving registry and onboard-session state when the live gateway reports the sandbox missing. However, the motivating DGX Spark/OpenShell post-reboot state is infrastructure-dependent, and the stubbed tests cannot prove the real gateway recovery/reconciliation timing behaves the same way.
- Recommendation: Add or identify targeted runtime/integration validation for the changed behavior: a locally registered sandbox should remain in registry/session when `nemoclaw <name> status` sees a missing live sandbox during gateway recovery, while active-use paths such as `connect` keep their existing cleanup behavior. Do not rely on external E2E status alone as code-review evidence.
- Evidence: The production change is the `lookup.state === "missing"` branch in `showSandboxStatus()`, which now calls `printMissingLiveSandboxStatusGuidance()` and exits. Tests in `test/cli.test.ts` and `test/gateway-state-reconcile-2276.test.ts` use stubbed `openshell` binaries and filesystem fixtures rather than a real rebooted OpenShell gateway.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

ericksoa · 2026-05-31T14:47:21Z

Advisor feedback addressed.

Narrowed the PR body from Fixes #4423 to Partially addresses #4423 so this merge does not close the broader DGX Spark reboot-recovery issue.
Explicitly documented the source boundary: this PR is the v0.0.56 safety mitigation for the destructive passive status side effect only. Full gateway auto-start/reconciliation/recovery remains tracked on [DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423.
Added links to the advisor-selected E2E runs in the PR body:
- required sandbox-operations-e2e: https://github.com/NVIDIA/NemoClaw/actions/runs/26715565554
- required scenario ubuntu-repo-cloud-openclaw: https://github.com/NVIDIA/NemoClaw/actions/runs/26715713730
- optional adjacent gateway-drift-preflight-e2e: https://github.com/NVIDIA/NemoClaw/actions/runs/26715713787

For the removal condition: passive status no longer removes local state on live-gateway NotFound; active-use paths like connect still retain the existing true-stale cleanup behavior. The focused regression tests assert both sides of that split.

github-actions · 2026-05-31T14:53:50Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26715565554
Target ref: 93dd87a67c876358d42ef399a083d86baf3dc5a0
Workflow ref: main
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job	Result
sandbox-operations-e2e	✅ success

ericksoa · 2026-05-31T15:01:01Z

@cv could you please review/approve this v0.0.56 mitigation when you get a chance?

The advisor scope feedback has been addressed by changing the PR from Fixes #4423 to Partially addresses #4423; this now only preserves registry/onboard-session state for passive nemoclaw <name> status when the live gateway reports the sandbox missing. The broader DGX Spark reboot recovery work remains tracked on #4423 for v0.0.57.

Current validation is green:

required sandbox-operations-e2e: https://github.com/NVIDIA/NemoClaw/actions/runs/26715565554
required scenario ubuntu-repo-cloud-openclaw: https://github.com/NVIDIA/NemoClaw/actions/runs/26715713730
optional adjacent gateway-drift-preflight-e2e: https://github.com/NVIDIA/NemoClaw/actions/runs/26715713787
standard PR checks are passing.

cjagwani

Approving. Right shape for v0.0.56 — keep status passive so DGX Spark first-post-reboot doesn't nuke registry state before the gateway reconciles. Destructive cleanup stays on connect, which is the correct boundary.

Tests pin both sides (gateway-state-reconcile-2276.test.ts flips the assertion correctly, cli.test.ts covers the new guidance string). sandbox-operations-e2e, ubuntu-repo-cloud-openclaw, wsl-e2e, macos-e2e, test-e2e-sandbox, test-e2e-gateway-isolation, and test-e2e-port-overrides all passed. Scope properly narrowed to "partially addresses #4423" per advisor feedback.

github-actions · 2026-06-01T19:00:20Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26774901829
Target ref: eabc2f187169355ce79e3afff7f68bb895c300a8
Workflow ref: main
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job	Result
sandbox-operations-e2e	✅ success

…ve status (#4578) (#4672)  ## Summary Replaces #4635 with the same diff on an upstream `NVIDIA/NemoClaw` branch so maintainers can dispatch E2E from the replacement PR. The `double-onboard-e2e` nightly job failed at Phase 5 because the test expected `nemoclaw status` to reconcile stale registry entries, but #4578 intentionally made `status` non-destructive. This PR aligns the test with that behavior by verifying the non-destructive `status` output first, then triggering actual reconciliation via `connect --probe-only`. ## Related Issue Fixes #4639 ## Changes - `test/e2e/test-double-onboard.sh`: replace the single `nemoclaw status` reconciliation assertion with a two-step sequence. - Verify `status` exits 1 and preserves the registry entry with the "No local registry entry was removed" guidance. - Trigger reconciliation via `connect --probe-only` and verify "Removed stale local registry entry" plus final registry cleanup. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes Note: local checks were not rerun for this replacement branch; it copies #4635 unchanged. #4635 reported focused validation for `double-onboard-e2e` on the same fix commit: https://github.com/hunglp6d/NemoClaw/actions/runs/26792577152. --- Signed-off-by: San Dang <sdang@nvidia.com>  ## Summary by CodeRabbit * **Improvements** * Clarified that the status command is non-destructive and explicitly reports when it does not remove local registry entries. * Status and connect now intentionally preserve stale registry entries unless destroyed. * Cleanup now relies on the explicit destroy action to remove registry entries. * Tests updated to verify the non-destructive status message and to avoid forcing reconciliation after teardown.  --------- Signed-off-by: Hung Le <hple@nvidia.com> Signed-off-by: San Dang <sdang@nvidia.com> Signed-off-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: Hung Le <hple@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>

## Summary Post-Computex / v0.0.60 follow-up for the gateway lifecycle half of #4423. This is intentionally separate from #4578, which remains the v0.0.56 safety hotfix for non-destructive status behavior. - Use OpenShell's package-managed `openshell-gateway` user service when its vendor/package unit is present, writing Docker-driver gateway env before restart. - Ignore per-user/stale gateway unit files so standalone recovery remains available when there is no package-managed OpenShell service. - Keep the existing standalone gateway launch path as an explicit compatibility fallback when the upstream service is unavailable. - Update docs/tests to describe package-service ownership vs. standalone fallback. ## Validation - `npm run build:cli` - `npm run typecheck:cli` - `npm run checks` - `npm run source-shape:check` - `npm run check:installer-hash` - `bash -n scripts/install-openshell.sh` - `bash test/e2e/test-openshell-version-pin.sh` - `npx vitest run src/lib/onboard/docker-driver-gateway-env.test.ts src/lib/onboard/docker-driver-gateway-service.test.ts test/install-openshell-version-check.test.ts test/runner.test.ts test/onboard-gateway-runtime.test.ts test/gateway-state-reconcile-2276.test.ts` Refs #4423 Follow-up to #4578  ## Summary by CodeRabbit * **Documentation** * Clarified Deployment Topology, uninstall/state-dir contents, Apple Silicon sandbox behavior, and environment-variable guidance for the standalone fallback. * **New Features** * Installer on Linux now attempts to start a package-managed OpenShell gateway and falls back to the standalone gateway when appropriate. * **Tests** * Expanded unit and e2e tests to cover package-managed vs standalone gateway flows and realistic checksum-driven installer scenarios.  --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>

The first end-to-end dispatch with the new probe ordering revealed that the test was hitting #4578's mitigation, not the open #4423 bug. With the gateway stopped and never restarted, status's `getNamedGateway- LifecycleState()` returned NOT healthy_named, which routes through the non-destructive branch — so the registry was correctly preserved and the test failed at gateway-healthy, masking the real signal. Add `openshell gateway start --name nemoclaw` between `docker stop` and `nemoclaw <name> status` so the lifecycle truly mirrors a DGX Spark / Linux Docker-driver post-reboot: * Gateway stopped (sandbox memory dropped). * Sandbox container stopped (so OpenShell sees NotFound). * Gateway restarted FRESH via the explicit `openshell gateway start` invocation — the user-systemd autostart path from #4580 in compressed form. Now the gateway is `healthy_named` AND has no sandbox memory. * Status is then invoked. With healthy_named gateway + missing sandbox + Docker container still present, the destructive `missing` branch in `src/lib/actions/sandbox/status.ts:308` fires and wipes the registry on unfixed code. This is the precise precondition that PR-A's Docker-driver sandbox recovery helper neutralizes: status will see Docker still has the labeled container, restart it, and skip the destructive cleanup. Refs #4423

Initial dispatches of `ubuntu-repo-docker-post-reboot-recovery` against `ubuntu-latest` revealed that the user-visible #4423 bug class cannot be fully reproduced from CI without a real reboot: * `docker stop` of the labeled sandbox container puts OpenShell into `Phase: Provisioning`, not literal `NotFound` — so the destructive `missing` branch in `src/lib/actions/sandbox/status.ts:308` does not fire on unfixed code. * The OpenShell CLI on `ubuntu-latest` does not expose a `gateway start` subcommand (the user-systemd autostart path from #4580 is the real autostart mechanism, not the CLI), so we can't cleanly reset the gateway to a fresh-state-with-no-sandbox-memory configuration that would yield a real `NotFound`. * Gateway HTTP health on the runner is environmental and varies independently of the regression target. Scope the `post-reboot-recovery-ready` expected state to ONLY the host-side preservation invariants this scenario can faithfully exercise from CI: * `cli` installed, * `localRegistry` entry preserved, * `dockerSandboxContainer` (labeled, including `*-nemoclaw-gpu-backup-*` siblings) present. This makes the scenario a "lock-in" guard for #4578's mitigation and for the host-side invariants PR-A's Docker-driver recovery helper must also preserve. A future scenario on a more controlled runner can extend the expected state with gateway/sandbox runtime probes once PR-A has landed and a true post-reboot environment is available. The framework primitives — lifecycle phase fixture, host-side probes, lifecycle whitelist — remain unchanged and remain useful beyond this scenario's current limits. Refs #4423

The PR Review Advisor on #5087 flagged three places that still described the OLD lifecycle (stops gateway runtime + restarts it) even though the code already exercises the scoped lifecycle (stops labeled container only, runs `nemoclaw status`, leaves gateway HEALTHY): * `scenarios/scenarios/baseline.ts` — scenario description. * `manifests/openclaw-nvidia-post-reboot-recovery.yaml` — manifest inline comment. * `live/registry-scenarios.test.ts` — driver comment on the lifecycle-dispatch block. Update each to: - Describe the actual two-step lifecycle (`docker stop` + status). - Note the gateway is left healthy and why (no `gateway start` subcommand on `ubuntu-latest`; #4578's mitigation would mask the regression target if the gateway were down). - Re-frame the assertion surface as host-side preservation invariants (`local-registry-entry-present`, `docker-sandbox-container-present`). - Cross-reference between the three locations so future readers don't have to chase context. No code change. Refs #4423

…-up to #5046) (#5087) ## Summary Follow-up to #5046. Tightens the `ubuntu-repo-docker-post-reboot-recovery` scenario into a stable RED-on-regression / GREEN-on-main guard while PR-A (the `#4423` source fix) is in flight. ## Related Issue Refs #4423 ## Why this is needed Post-merge dispatches of #5046's `ubuntu-repo-docker-post-reboot-recovery` job against `ubuntu-latest` revealed two issues: 1. **The `openshell gateway stop` step in `LifecyclePhaseFixture` was counterproductive.** `#4578`'s mitigation in `src/lib/actions/sandbox/status.ts:296` guards the destructive branch behind a `healthy_named` gateway. Stopping the gateway routed every dispatch through the safe path and masked the regression target. There is no `openshell gateway start` subcommand on `ubuntu-latest` (the user-systemd autostart from #4580 is the real mechanism), so the original lifecycle couldn't restart the gateway cleanly to keep it `healthy_named`. 2. **The `post-reboot-recovery-ready` expected-state asserted on runtime liveness probes** (`gateway-healthy`, `sandbox-running`) that are environmental on `ubuntu-latest` after `docker stop`. They varied independently of the `#4423` regression target and made every dispatch RED for the wrong reason — concretely, `gateway-healthy` failing before the host-side probes had a chance to run. ## Changes ### `LifecyclePhaseFixture.simulatePostReboot` - Drop the `openshell gateway stop` step. The lifecycle now stops only the labeled sandbox container and then runs `nemoclaw <name> status`. Gateway is left `healthy_named`, which is the precondition for the destructive branch we want PR-A to neutralize. - Drop the `GATEWAY_STOP_TIMEOUT_MS` constant (no longer referenced). - Updated docstring + 2 unit tests removed (gateway-stop-failure tolerance tests are no longer applicable). ### `post-reboot-recovery-ready` expected-state Lock down only the host-side preservation invariants this scenario can faithfully exercise: - `cli-installed` - `local-registry-entry-present` - `docker-sandbox-container-present` (running OR stopped OR `*-nemoclaw-gpu-backup-*` sibling) Drop `gateway: { expected: "present", health: "healthy" }` and `sandbox: { expected: "present", status: "running", agent: "openclaw" }`. A follow-up scenario on a more controlled runner (real DGX Spark / Brev) can extend the expected state with runtime liveness probes once PR-A has landed and a true post-reboot environment is available. ### `probesForState` ordering Reorder so host-side observables (`local-registry-entry-present`, `docker-sandbox-container-present`) emit BEFORE runtime-derived gateway/sandbox probes everywhere. The state-validation orchestrator short-circuits on the first probe failure, so host-side regression targets must be observed first. Pre-existing scenarios (`cloud-openclaw-ready`, `preflight-failure-*`, etc.) are unaffected — none declare host-side dimensions, so their probe lists are unchanged. ## RED / GREEN contract - **On `main`** (this PR's HEAD): scenario goes 🟢 GREEN. Locks in `#4578`'s mitigation as a passive guard — registry preserved through `docker stop` + `nemoclaw status` round-trip. - **On a regression PR** that re-introduces unconditional `registry.removeSandbox` in `status.ts:296` or `gateway-state.ts:412`: scenario goes 🔴 RED at `local-registry-entry-present`. - **On PR-A's fix branch** that adds Docker-driver sandbox recovery: scenario stays 🟢, demonstrating the new helper does not regress either host-side invariant. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes — 142 tests across the directly affected files all green; no regressions on the rest of the e2e-scenario framework suite. - [x] Live dispatch (`gh workflow run e2e-vitest-scenarios.yaml -f scenarios=ubuntu-repo-docker-post-reboot-recovery --ref e2e-post-reboot-scope-host-side`) goes GREEN: lifecycle (`docker stop` + `nemoclaw status`) → host-side probes pass → cleanup restores Docker. Will dispatch as a verification step once this PR is open. - [x] Tests added or updated for new or changed behavior. - [x] No secrets, API keys, or credentials committed. ## Follow-ups (post PR-A) - Once PR-A's Docker-driver recovery helper is in place, extend the post-reboot expected-state with gateway/sandbox runtime probes on a controlled runner. - Consider a `post-reboot-spark` lifecycle profile for hardware-runner dispatch that can do a real reboot and exercise the full bug class.  ## Summary by CodeRabbit * **Tests** * Added a validation test for post-reboot-recovery expected-state focusing on host-side invariants. * Updated post-reboot-recovery phase tests to remove the gateway-stop step and assert container stop + host status checks. * **Documentation** * Clarified lifecycle and scenario comments/manifests to describe container-stop + host-status-driven recovery and probe ordering. * **Behavior** * Expected-state and probe emission now prioritize host-side checks and omit runtime gateway/sandbox expectations.

@ericksoa

…4423) (#5091) ## Summary Closes the active-recovery half of #4423 — parts 2 & 3 of @ericksoa's plan. The destructive registry-removal paths were already neutralized by: - **#4578** — `status` no longer calls `registry.removeSandbox` on `NotFound` unless the gateway is `healthy_named`. - **#4497** — `ensureLiveSandboxOrExit` (which `connect` rides) preserves the registry entry on `missing` and routes intentional purges through explicit `destroy`. What's still missing is the *active recovery* path: when the gateway returns to `healthy_named` after a reboot but reports the sandbox as `NotFound`, NemoClaw currently prints "retry in a moment" guidance. This PR makes it walk Docker by the OpenShell labels, restart the labeled container (or rename a `*-nemoclaw-gpu-backup-*` sibling back), re-query OpenShell, and return a `present` lookup with a recovery breadcrumb. The user runs `nemoclaw <name> status` once and the sandbox is live. ## Related Issue Refs #4423 ## Changes ### New: `src/lib/onboard/docker-driver-sandbox-recovery.ts` `recoverDockerDriverSandbox(name, deps)` scans Docker by `openshell.ai/managed-by=openshell` AND `openshell.ai/sandbox-name=<name>` (the same labels `findOpenShellDockerSandboxContainerIds` already uses). Dispatches by container shape: | Branch ID | Trigger | Action | |---|---|---| | `started-running-original` | Labeled container is already running | No-op success | | `started-stopped-original` | Labeled container exists but stopped | `docker start` | | `renamed-and-started-backup` | Only a `*-nemoclaw-gpu-backup-*` sibling exists (from `docker-gpu-patch.ts`'s GPU patch) | `docker rename <backup> <original>` then `docker start` | | Conflict | Stopped original + backup sibling both exist | Start the original, leave the backup alone | | None | No labeled container | `{ recovered: false }` — callers fall through to existing non-destructive guidance | Adapter access is per-call lazy `require` — top-level `import` from `../adapters/docker` pulls in `runner.ts`'s load-time `require("./platform")` which the Vitest TS loader can't resolve. Same escape hatch as `auto-pair-approval.ts`. Label constants are duplicated locally with comments pointing to `docker-gpu-patch.ts` as the source of truth. ### Wiring: `src/lib/actions/sandbox/gateway-state.ts` `reconcileMissingAgainstNamedGateway` now calls the helper: - After the existing `gateway select nemoclaw` retry path lands on `healthy_named` + `missing`. - On the new top-level `healthy_named` + `missing` precondition that #4423's bug class describes (gateway came back fresh after reboot with no sandbox memory). `SandboxGatewayState` gains `recoveredSandbox?: boolean` and `recoverySandboxVia?: string | null` so callers can render a recovery breadcrumb without re-walking the recovery chain. ### User-facing breadcrumb: `src/lib/actions/sandbox/status.ts` Present-state output now prints: ``` Recovered sandbox '<name>' from Docker via <branch>; OpenShell now sees it as live. ``` before the canonical OpenShell sandbox info. ## Test results - ✅ **12 new helper tests** in `docker-driver-sandbox-recovery.test.ts` (all 4 dispatch branches, rename/start failure surfaces, no-labeled-container empty case). - ✅ **Existing 30 reconcile tests** in `gateway-state-reconcile-2276.test.ts` and `gateway-state-drift.test.ts` still pass — the new branch is additive and only fires under `healthy_named` + `missing` which existing tests don't exercise. - ✅ **1621 tests across `src/lib/onboard/`, `src/lib/actions/sandbox/`, `test/gateway-state*.test.ts`** all green. ## Verification surface - The post-reboot recovery scenario from #5087 (`ubuntu-repo-docker-post-reboot-recovery`) continues to go 🟢 GREEN — its host-side invariants (`local-registry-entry-present`, `docker-sandbox-container-present`) are exactly what this fix preserves. - A future controlled-runner scenario can extend `post-reboot-recovery-ready` with gateway/sandbox runtime probes once a real-reboot environment is wired. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes — 1621 affected tests green; full suite to be exercised by CI on this branch. - [x] Tests added or updated for new or changed behavior. - [x] No secrets, API keys, or credentials committed. - [x] Live `ubuntu-repo-docker-post-reboot-recovery` scenario stays green; will dispatch on this branch as a post-push verification step. ## Boundaries kept honest - The helper talks to Docker only. OpenShell re-query is the caller's job. - Profile dispatch is exhaustive (`never`-checked); adding a new branch requires extending both the helper switch and the unit test. - Recovery is best-effort and short-circuits to `{ recovered: false }` on any failure, so callers' existing non-destructive guidance stays the safety net.  ## Summary by CodeRabbit * **New Features** * Sandboxes marked as missing after a system reboot can now be automatically recovered from Docker if the container still exists on disk. * Status messages now display recovery information when a sandbox is restored.  Co-authored-by: Carlos Villela <cvillela@nvidia.com>

fix(status): preserve registry on missing live sandbox

93dd87a

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

ericksoa added bug Something fails against expected or documented behavior Platform: DGX Spark NV QA Bugs found by the NVIDIA QA Team UAT Issues flagged for User Acceptance Testing. v0.0.56 Release target labels May 31, 2026

ericksoa self-assigned this May 31, 2026

ericksoa requested a review from cv May 31, 2026 15:01

This was referenced May 31, 2026

[DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423

Closed

fix(onboard): use OpenShell gateway user service #4580

Merged

cjagwani self-requested a review June 1, 2026 16:38

cv added v0.0.57 Release target and removed v0.0.56 Release target labels Jun 1, 2026

cjagwani approved these changes Jun 1, 2026

View reviewed changes

Merge branch 'main' into fix/4423-preserve-registry-on-status

eabc2f1

cv enabled auto-merge (squash) June 1, 2026 18:46

cv merged commit 4f7ae09 into main Jun 1, 2026
31 checks passed

cv deleted the fix/4423-preserve-registry-on-status branch June 1, 2026 18:51

coderabbitai Bot mentioned this pull request Jun 1, 2026

fix(sandbox): stop false Docker unhealthy and add paused-container hint (#4503, #4495) #4600

Merged

12 tasks

hunglp6d mentioned this pull request Jun 2, 2026

nightly-e2e: double-onboard-e2e stale-registry reconciliation misaligned with #4578 (run 26790528855) #4639

Closed

2 tasks

coderabbitai Bot mentioned this pull request Jun 2, 2026

fix(sandbox): preserve registry entry on stale-sandbox connect recovery (#4497) #4647

Merged

6 tasks

sandl99 mentioned this pull request Jun 2, 2026

fix(e2e): align double-onboard stale-registry test with non-destructive status (#4578) #4672

Merged

9 tasks

wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: dgx-spark Affects DGX Spark hardware or workflows and removed priority: high labels Jun 3, 2026

wscurran removed the bug Something fails against expected or documented behavior label Jun 8, 2026

github-actions Bot mentioned this pull request Jun 9, 2026

test(e2e): scope post-reboot recovery to host-side invariants (follow-up to #5046) #5087

Merged

9 tasks

jyaunches mentioned this pull request Jun 10, 2026

fix(sandbox): recover Docker-driver sandbox from labels post-reboot (#4423) #5091

Merged

9 tasks

Conversation

ericksoa commented May 31, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

github-actions Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

ericksoa commented May 31, 2026

Uh oh!

github-actions Bot commented May 31, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

ericksoa commented May 31, 2026

Uh oh!

cjagwani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 1, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ericksoa commented May 31, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 31, 2026 •

edited

Loading

github-actions Bot commented May 31, 2026 •

edited

Loading

github-actions Bot commented May 31, 2026 •

edited

Loading

github-actions Bot commented May 31, 2026 •

edited

Loading