Skip to content

fix(status): preserve registry on missing live sandbox#4578

Merged
cv merged 2 commits into
mainfrom
fix/4423-preserve-registry-on-status
Jun 1, 2026
Merged

fix(status): preserve registry on missing live sandbox#4578
cv merged 2 commits into
mainfrom
fix/4423-preserve-registry-on-status

Conversation

@ericksoa

@ericksoa ericksoa commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

Addresses the destructive passive-status side effect from #4423. nemoclaw <name> status is now non-destructive when a sandbox is registered locally but missing from the live OpenShell gateway, preventing the first post-reboot status check on DGX Spark from deleting registry state before gateway reconciliation catches up.

This is the v0.0.56 safety mitigation only: it preserves local registry/onboard-session state and makes the failure explicit. The broader #4423 recovery goals remain open on the issue, including DGX Spark gateway auto-start after reboot and full post-reboot sandbox reconciliation/recovery.

Related Issue

Partially addresses #4423.

Changes

  • Preserve registry and onboard-session state in the status-only missing-live-sandbox path.
  • Print explicit guidance that status did not remove the entry and should be retried after gateway reconnection.
  • Keep destructive stale-entry cleanup in active-use paths like connect, and update regression tests to lock that split.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Validation run:

  • npm run build:cli
  • npx vitest run test/gateway-state-reconcile-2276.test.ts test/cli.test.ts - 182 passed, 1 skipped
  • npx vitest run test/cli.test.ts -t "preserves an orphan registry entry" - 1 passed
  • npx vitest run test/gateway-state-reconcile-2276.test.ts -t "reports the missing live sandbox" - 1 passed

Advisor-selected E2E:


Signed-off-by: Aaron Erickson aerickson@nvidia.com

Summary by CodeRabbit

  • Bug Fixes
    • The status command now preserves local sandbox registry entries when registered locally but missing from the gateway. Clear guidance is provided for recovery steps instead of automatic removal.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa added bug Something fails against expected or documented behavior Platform: DGX Spark NV QA Bugs found by the NVIDIA QA Team UAT Issues flagged for User Acceptance Testing. v0.0.56 Release target labels May 31, 2026
@ericksoa ericksoa self-assigned this May 31, 2026
@coderabbitai

coderabbitai Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 05f446c4-08e6-47b5-89be-e79f53ee6b05

📥 Commits

Reviewing files that changed from the base of the PR and between 93dd87a and eabc2f1.

📒 Files selected for processing (2)
  • src/lib/actions/sandbox/status.ts
  • test/cli.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/cli.test.ts
  • src/lib/actions/sandbox/status.ts

📝 Walkthrough

Walkthrough

When nemoclaw status finds a sandbox in the local registry but missing from the live OpenShell gateway, the CLI now prints user guidance and preserves the local registry and onboard session state instead of removing them. Tests were updated to assert the new non-destructive behavior and messaging.

Changes

Preserve orphan entries instead of cleanup

Layer / File(s) Summary
Status handler refactoring for orphan guidance
src/lib/actions/sandbox/status.ts
Remove imports related to the destructive lifecycle guard and onboard-session cleanup, add printMissingLiveSandboxStatusGuidance() to render guidance for sandboxes registered locally but absent from the live gateway, and replace the lookup.state === "missing" branch with a call to the new helper (no registry deletion or session clearing).
Test updates verifying non-destructive behavior
test/cli.test.ts, test/gateway-state-reconcile-2276.test.ts
Rename and update the orphan-registry status test and Scenario 2 to assert preservation of the registry and onboard-session state, expect exit status 1 where appropriate, verify the output contains "registered locally, but is not present in the live OpenShell gateway", and ensure it does not contain "Removed stale local registry entry".

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4423: Addresses the same destructive cleanup behavior in src/lib/actions/sandbox/status.ts, relevant to preserving local registry entries when gateways transiently lose sandboxes.

Suggested labels

NemoClaw CLI, fix, v0.0.56

Suggested reviewers

  • cv

Poem

🐰 I found an orphaned name in the log,
once slated for loss like a driftwood log.
Now I whisper a hint,
keep the registry mint —
reboot may bring back its catalog 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: making the status command preserve local registry state when a sandbox is missing from the live gateway, addressing the core issue described in the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/4423-preserve-registry-on-status

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: sandbox-operations-e2e
Optional E2E: sandbox-survival-e2e, gateway-drift-preflight-e2e

Dispatch hint: sandbox-operations-e2e

Auto-dispatched E2E: sandbox-operations-e2e via nightly-e2e.yaml at eabc2f187169355ce79e3afff7f68bb895c300a8nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • sandbox-operations-e2e (high (~60 min, creates multiple sandboxes)): Required because the source change affects nemoclaw <sandbox> status and sandbox lifecycle behavior. This job exercises live sandbox operations including status output, registry behavior, process recovery through status, destroy cleanup, multi-sandbox flows, and gateway kill recovery.

Optional E2E

  • sandbox-survival-e2e (medium-high (~30 min, requires NVIDIA_API_KEY/live inference)): Useful adjacent confidence for gateway restart/recovery semantics: it validates that a sandbox remains discoverable, registry state is retained, and nemoclaw <name> status/connect work after gateway stop/start. This is relevant to the changed missing-live-sandbox guidance but broader and more expensive than the targeted required job.
  • gateway-drift-preflight-e2e (low-medium (~15 min, mocked OpenShell/Docker state)): Optional adjacent regression coverage for fail-closed/no-mutation behavior when OpenShell sandbox-state RPCs are unsafe or drifted. The PR changes another no-mutation status path, so this provides confidence around related gateway-state safety boundaries.

New E2E recommendations

  • sandbox status missing-live-sandbox reconciliation (high): Existing live E2E coverage exercises normal status and gateway recovery, but there does not appear to be a dedicated E2E that creates a registered local sandbox entry, makes the live OpenShell gateway report NotFound for that sandbox, and asserts nemoclaw <name> status preserves sandboxes.json and onboard session state while printing the new passive guidance.
    • Suggested test: Add a targeted E2E/regression case for passive status on a missing live sandbox: registered local sandbox + healthy named gateway + openshell sandbox get NotFound => status exits non-zero, prints 'No local registry entry was removed', and leaves registry/session untouched.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: sandbox-operations-e2e

@github-actions

github-actions Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw
Optional scenario E2E: None

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required scenario E2E

  • ubuntu-repo-cloud-openclaw: The PR changes the core sandbox status command behavior for a registered sandbox that is missing from the live OpenShell gateway. The Ubuntu repo cloud OpenClaw scenario is the smallest routed scenario that exercises the main repo-checkout onboarding path and runtime status/list operations for a live sandbox.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional scenario E2E

  • None.

Relevant changed files

  • src/lib/actions/sandbox/status.ts

@github-actions

github-actions Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Since last review: 2 prior items resolved, 0 still apply, 1 new item found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • Runtime validation is still recommended for the infrastructure-dependent reboot path (src/lib/actions/sandbox/status.ts:407): The updated CLI and harness tests cover the intended passive-status split, including preserving registry and onboard-session state when the live gateway reports the sandbox missing. However, the motivating DGX Spark/OpenShell post-reboot state is infrastructure-dependent, and the stubbed tests cannot prove the real gateway recovery/reconciliation timing behaves the same way.
    • Recommendation: Add or identify targeted runtime/integration validation for the changed behavior: a locally registered sandbox should remain in registry/session when `nemoclaw <name> status` sees a missing live sandbox during gateway recovery, while active-use paths such as `connect` keep their existing cleanup behavior. Do not rely on external E2E status alone as code-review evidence.
    • Evidence: The production change is the `lookup.state === "missing"` branch in `showSandboxStatus()`, which now calls `printMissingLiveSandboxStatusGuidance()` and exits. Tests in `test/cli.test.ts` and `test/gateway-state-reconcile-2276.test.ts` use stubbed `openshell` binaries and filesystem fixtures rather than a real rebooted OpenShell gateway.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Runtime validation is still recommended for the infrastructure-dependent reboot path (src/lib/actions/sandbox/status.ts:407): The updated CLI and harness tests cover the intended passive-status split, including preserving registry and onboard-session state when the live gateway reports the sandbox missing. However, the motivating DGX Spark/OpenShell post-reboot state is infrastructure-dependent, and the stubbed tests cannot prove the real gateway recovery/reconciliation timing behaves the same way.
    • Recommendation: Add or identify targeted runtime/integration validation for the changed behavior: a locally registered sandbox should remain in registry/session when `nemoclaw <name> status` sees a missing live sandbox during gateway recovery, while active-use paths such as `connect` keep their existing cleanup behavior. Do not rely on external E2E status alone as code-review evidence.
    • Evidence: The production change is the `lookup.state === "missing"` branch in `showSandboxStatus()`, which now calls `printMissingLiveSandboxStatusGuidance()` and exits. Tests in `test/cli.test.ts` and `test/gateway-state-reconcile-2276.test.ts` use stubbed `openshell` binaries and filesystem fixtures rather than a real rebooted OpenShell gateway.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@ericksoa

Copy link
Copy Markdown
Contributor Author

Advisor feedback addressed.

For the removal condition: passive status no longer removes local state on live-gateway NotFound; active-use paths like connect still retain the existing true-stale cleanup behavior. The focused regression tests assert both sides of that split.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26715565554
Target ref: 93dd87a67c876358d42ef399a083d86baf3dc5a0
Workflow ref: main
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
sandbox-operations-e2e ✅ success

@ericksoa

Copy link
Copy Markdown
Contributor Author

@cv could you please review/approve this v0.0.56 mitigation when you get a chance?

The advisor scope feedback has been addressed by changing the PR from Fixes #4423 to Partially addresses #4423; this now only preserves registry/onboard-session state for passive nemoclaw <name> status when the live gateway reports the sandbox missing. The broader DGX Spark reboot recovery work remains tracked on #4423 for v0.0.57.

Current validation is green:

@cjagwani cjagwani left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Right shape for v0.0.56 — keep status passive so DGX Spark first-post-reboot doesn't nuke registry state before the gateway reconciles. Destructive cleanup stays on connect, which is the correct boundary.

Tests pin both sides (gateway-state-reconcile-2276.test.ts flips the assertion correctly, cli.test.ts covers the new guidance string). sandbox-operations-e2e, ubuntu-repo-cloud-openclaw, wsl-e2e, macos-e2e, test-e2e-sandbox, test-e2e-gateway-isolation, and test-e2e-port-overrides all passed. Scope properly narrowed to "partially addresses #4423" per advisor feedback.

@cv cv enabled auto-merge (squash) June 1, 2026 18:46
@cv cv merged commit 4f7ae09 into main Jun 1, 2026
31 checks passed
@cv cv deleted the fix/4423-preserve-registry-on-status branch June 1, 2026 18:51
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26774901829
Target ref: eabc2f187169355ce79e3afff7f68bb895c300a8
Workflow ref: main
Requested jobs: sandbox-operations-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
sandbox-operations-e2e ✅ success

cv added a commit that referenced this pull request Jun 3, 2026
…ve status (#4578) (#4672)

<!-- markdownlint-disable MD041 -->
## Summary

Replaces #4635 with the same diff on an upstream `NVIDIA/NemoClaw`
branch so maintainers can dispatch E2E from the replacement PR. The
`double-onboard-e2e` nightly job failed at Phase 5 because the test
expected `nemoclaw status` to reconcile stale registry entries, but
#4578 intentionally made `status` non-destructive. This PR aligns the
test with that behavior by verifying the non-destructive `status` output
first, then triggering actual reconciliation via `connect --probe-only`.

## Related Issue

Fixes #4639

## Changes

- `test/e2e/test-double-onboard.sh`: replace the single `nemoclaw
status` reconciliation assertion with a two-step sequence.
- Verify `status` exits 1 and preserves the registry entry with the "No
local registry entry was removed" guidance.
- Trigger reconciliation via `connect --probe-only` and verify "Removed
stale local registry entry" plus final registry cleanup.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification

- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes

Note: local checks were not rerun for this replacement branch; it copies
#4635 unchanged. #4635 reported focused validation for
`double-onboard-e2e` on the same fix commit:
https://github.com/hunglp6d/NemoClaw/actions/runs/26792577152.

---
Signed-off-by: San Dang <sdang@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Improvements**
* Clarified that the status command is non-destructive and explicitly
reports when it does not remove local registry entries.
* Status and connect now intentionally preserve stale registry entries
unless destroyed.
* Cleanup now relies on the explicit destroy action to remove registry
entries.
* Tests updated to verify the non-destructive status message and to
avoid forcing reconciliation after teardown.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Hung Le <hple@nvidia.com>
Signed-off-by: San Dang <sdang@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Co-authored-by: Hung Le <hple@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: dgx-spark Affects DGX Spark hardware or workflows and removed priority: high labels Jun 3, 2026
cv added a commit that referenced this pull request Jun 5, 2026
## Summary

Post-Computex / v0.0.60 follow-up for the gateway lifecycle half of
#4423. This is intentionally separate from #4578, which remains the
v0.0.56 safety hotfix for non-destructive status behavior.

- Use OpenShell's package-managed `openshell-gateway` user service when
its vendor/package unit is present, writing Docker-driver gateway env
before restart.
- Ignore per-user/stale gateway unit files so standalone recovery
remains available when there is no package-managed OpenShell service.
- Keep the existing standalone gateway launch path as an explicit
compatibility fallback when the upstream service is unavailable.
- Update docs/tests to describe package-service ownership vs. standalone
fallback.

## Validation

- `npm run build:cli`
- `npm run typecheck:cli`
- `npm run checks`
- `npm run source-shape:check`
- `npm run check:installer-hash`
- `bash -n scripts/install-openshell.sh`
- `bash test/e2e/test-openshell-version-pin.sh`
- `npx vitest run src/lib/onboard/docker-driver-gateway-env.test.ts
src/lib/onboard/docker-driver-gateway-service.test.ts
test/install-openshell-version-check.test.ts test/runner.test.ts
test/onboard-gateway-runtime.test.ts
test/gateway-state-reconcile-2276.test.ts`

Refs #4423
Follow-up to #4578

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Clarified Deployment Topology, uninstall/state-dir contents, Apple
Silicon sandbox behavior, and environment-variable guidance for the
standalone fallback.
* **New Features**
* Installer on Linux now attempts to start a package-managed OpenShell
gateway and falls back to the standalone gateway when appropriate.
* **Tests**
* Expanded unit and e2e tests to cover package-managed vs standalone
gateway flows and realistic checksum-driven installer scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
@wscurran wscurran removed the bug Something fails against expected or documented behavior label Jun 8, 2026
jyaunches added a commit that referenced this pull request Jun 9, 2026
The first end-to-end dispatch with the new probe ordering revealed that
the test was hitting #4578's mitigation, not the open #4423 bug. With
the gateway stopped and never restarted, status's `getNamedGateway-
LifecycleState()` returned NOT healthy_named, which routes through the
non-destructive branch — so the registry was correctly preserved and
the test failed at gateway-healthy, masking the real signal.

Add `openshell gateway start --name nemoclaw` between `docker stop`
and `nemoclaw <name> status` so the lifecycle truly mirrors a DGX
Spark / Linux Docker-driver post-reboot:

  * Gateway stopped (sandbox memory dropped).
  * Sandbox container stopped (so OpenShell sees NotFound).
  * Gateway restarted FRESH via the explicit `openshell gateway
    start` invocation — the user-systemd autostart path from #4580
    in compressed form. Now the gateway is `healthy_named` AND has
    no sandbox memory.
  * Status is then invoked. With healthy_named gateway + missing
    sandbox + Docker container still present, the destructive
    `missing` branch in `src/lib/actions/sandbox/status.ts:308`
    fires and wipes the registry on unfixed code.

This is the precise precondition that PR-A's Docker-driver sandbox
recovery helper neutralizes: status will see Docker still has the
labeled container, restart it, and skip the destructive cleanup.

Refs #4423
jyaunches added a commit that referenced this pull request Jun 9, 2026
Initial dispatches of `ubuntu-repo-docker-post-reboot-recovery` against
`ubuntu-latest` revealed that the user-visible #4423 bug class cannot
be fully reproduced from CI without a real reboot:

  * `docker stop` of the labeled sandbox container puts OpenShell into
    `Phase: Provisioning`, not literal `NotFound` — so the destructive
    `missing` branch in `src/lib/actions/sandbox/status.ts:308` does
    not fire on unfixed code.
  * The OpenShell CLI on `ubuntu-latest` does not expose a
    `gateway start` subcommand (the user-systemd autostart path from
    #4580 is the real autostart mechanism, not the CLI), so we can't
    cleanly reset the gateway to a fresh-state-with-no-sandbox-memory
    configuration that would yield a real `NotFound`.
  * Gateway HTTP health on the runner is environmental and varies
    independently of the regression target.

Scope the `post-reboot-recovery-ready` expected state to ONLY the
host-side preservation invariants this scenario can faithfully
exercise from CI:

  * `cli` installed,
  * `localRegistry` entry preserved,
  * `dockerSandboxContainer` (labeled, including
    `*-nemoclaw-gpu-backup-*` siblings) present.

This makes the scenario a "lock-in" guard for #4578's mitigation and
for the host-side invariants PR-A's Docker-driver recovery helper must
also preserve. A future scenario on a more controlled runner can
extend the expected state with gateway/sandbox runtime probes once
PR-A has landed and a true post-reboot environment is available.

The framework primitives — lifecycle phase fixture, host-side
probes, lifecycle whitelist — remain unchanged and remain useful
beyond this scenario's current limits.

Refs #4423
jyaunches added a commit that referenced this pull request Jun 9, 2026
The PR Review Advisor on #5087 flagged three places that still
described the OLD lifecycle (stops gateway runtime + restarts it)
even though the code already exercises the scoped lifecycle (stops
labeled container only, runs `nemoclaw status`, leaves gateway
HEALTHY):

  * `scenarios/scenarios/baseline.ts` — scenario description.
  * `manifests/openclaw-nvidia-post-reboot-recovery.yaml` — manifest
    inline comment.
  * `live/registry-scenarios.test.ts` — driver comment on the
    lifecycle-dispatch block.

Update each to:
  - Describe the actual two-step lifecycle (`docker stop` + status).
  - Note the gateway is left healthy and why (no `gateway start`
    subcommand on `ubuntu-latest`; #4578's mitigation would mask the
    regression target if the gateway were down).
  - Re-frame the assertion surface as host-side preservation
    invariants (`local-registry-entry-present`,
    `docker-sandbox-container-present`).
  - Cross-reference between the three locations so future readers
    don't have to chase context.

No code change.

Refs #4423
cv pushed a commit that referenced this pull request Jun 9, 2026
…-up to #5046) (#5087)

## Summary

Follow-up to #5046. Tightens the
`ubuntu-repo-docker-post-reboot-recovery` scenario into a stable
RED-on-regression / GREEN-on-main guard while PR-A (the `#4423` source
fix) is in flight.

## Related Issue

Refs #4423

## Why this is needed

Post-merge dispatches of #5046's
`ubuntu-repo-docker-post-reboot-recovery` job against `ubuntu-latest`
revealed two issues:

1. **The `openshell gateway stop` step in `LifecyclePhaseFixture` was
counterproductive.** `#4578`'s mitigation in
`src/lib/actions/sandbox/status.ts:296` guards the destructive branch
behind a `healthy_named` gateway. Stopping the gateway routed every
dispatch through the safe path and masked the regression target. There
is no `openshell gateway start` subcommand on `ubuntu-latest` (the
user-systemd autostart from #4580 is the real mechanism), so the
original lifecycle couldn't restart the gateway cleanly to keep it
`healthy_named`.

2. **The `post-reboot-recovery-ready` expected-state asserted on runtime
liveness probes** (`gateway-healthy`, `sandbox-running`) that are
environmental on `ubuntu-latest` after `docker stop`. They varied
independently of the `#4423` regression target and made every dispatch
RED for the wrong reason — concretely, `gateway-healthy` failing before
the host-side probes had a chance to run.

## Changes

### `LifecyclePhaseFixture.simulatePostReboot`

- Drop the `openshell gateway stop` step. The lifecycle now stops only
the labeled sandbox container and then runs `nemoclaw <name> status`.
Gateway is left `healthy_named`, which is the precondition for the
destructive branch we want PR-A to neutralize.
- Drop the `GATEWAY_STOP_TIMEOUT_MS` constant (no longer referenced).
- Updated docstring + 2 unit tests removed (gateway-stop-failure
tolerance tests are no longer applicable).

### `post-reboot-recovery-ready` expected-state

Lock down only the host-side preservation invariants this scenario can
faithfully exercise:

- `cli-installed`
- `local-registry-entry-present`
- `docker-sandbox-container-present` (running OR stopped OR
`*-nemoclaw-gpu-backup-*` sibling)

Drop `gateway: { expected: "present", health: "healthy" }` and `sandbox:
{ expected: "present", status: "running", agent: "openclaw" }`. A
follow-up scenario on a more controlled runner (real DGX Spark / Brev)
can extend the expected state with runtime liveness probes once PR-A has
landed and a true post-reboot environment is available.

### `probesForState` ordering

Reorder so host-side observables (`local-registry-entry-present`,
`docker-sandbox-container-present`) emit BEFORE runtime-derived
gateway/sandbox probes everywhere. The state-validation orchestrator
short-circuits on the first probe failure, so host-side regression
targets must be observed first. Pre-existing scenarios
(`cloud-openclaw-ready`, `preflight-failure-*`, etc.) are unaffected —
none declare host-side dimensions, so their probe lists are unchanged.

## RED / GREEN contract

- **On `main`** (this PR's HEAD): scenario goes 🟢 GREEN. Locks in
`#4578`'s mitigation as a passive guard — registry preserved through
`docker stop` + `nemoclaw status` round-trip.
- **On a regression PR** that re-introduces unconditional
`registry.removeSandbox` in `status.ts:296` or `gateway-state.ts:412`:
scenario goes 🔴 RED at `local-registry-entry-present`.
- **On PR-A's fix branch** that adds Docker-driver sandbox recovery:
scenario stays 🟢, demonstrating the new helper does not regress either
host-side invariant.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes
- [x] `npm test` passes — 142 tests across the directly affected files
all green; no regressions on the rest of the e2e-scenario framework
suite.
- [x] Live dispatch (`gh workflow run e2e-vitest-scenarios.yaml -f
scenarios=ubuntu-repo-docker-post-reboot-recovery --ref
e2e-post-reboot-scope-host-side`) goes GREEN: lifecycle (`docker stop` +
`nemoclaw status`) → host-side probes pass → cleanup restores Docker.
Will dispatch as a verification step once this PR is open.
- [x] Tests added or updated for new or changed behavior.
- [x] No secrets, API keys, or credentials committed.

## Follow-ups (post PR-A)

- Once PR-A's Docker-driver recovery helper is in place, extend the
post-reboot expected-state with gateway/sandbox runtime probes on a
controlled runner.
- Consider a `post-reboot-spark` lifecycle profile for hardware-runner
dispatch that can do a real reboot and exercise the full bug class.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Added a validation test for post-reboot-recovery expected-state
focusing on host-side invariants.
* Updated post-reboot-recovery phase tests to remove the gateway-stop
step and assert container stop + host status checks.

* **Documentation**
* Clarified lifecycle and scenario comments/manifests to describe
container-stop + host-status-driven recovery and probe ordering.

* **Behavior**
* Expected-state and probe emission now prioritize host-side checks and
omit runtime gateway/sandbox expectations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
jyaunches added a commit that referenced this pull request Jun 10, 2026
…4423) (#5091)

## Summary

Closes the active-recovery half of #4423 — parts 2 & 3 of @ericksoa's
plan. The destructive registry-removal paths were already neutralized
by:

- **#4578** — `status` no longer calls `registry.removeSandbox` on
`NotFound` unless the gateway is `healthy_named`.
- **#4497** — `ensureLiveSandboxOrExit` (which `connect` rides)
preserves the registry entry on `missing` and routes intentional purges
through explicit `destroy`.

What's still missing is the *active recovery* path: when the gateway
returns to `healthy_named` after a reboot but reports the sandbox as
`NotFound`, NemoClaw currently prints "retry in a moment" guidance. This
PR makes it walk Docker by the OpenShell labels, restart the labeled
container (or rename a `*-nemoclaw-gpu-backup-*` sibling back), re-query
OpenShell, and return a `present` lookup with a recovery breadcrumb. The
user runs `nemoclaw <name> status` once and the sandbox is live.

## Related Issue

Refs #4423

## Changes

### New: `src/lib/onboard/docker-driver-sandbox-recovery.ts`

`recoverDockerDriverSandbox(name, deps)` scans Docker by
`openshell.ai/managed-by=openshell` AND
`openshell.ai/sandbox-name=<name>` (the same labels
`findOpenShellDockerSandboxContainerIds` already uses). Dispatches by
container shape:

| Branch ID | Trigger | Action |
|---|---|---|
| `started-running-original` | Labeled container is already running |
No-op success |
| `started-stopped-original` | Labeled container exists but stopped |
`docker start` |
| `renamed-and-started-backup` | Only a `*-nemoclaw-gpu-backup-*`
sibling exists (from `docker-gpu-patch.ts`'s GPU patch) | `docker rename
<backup> <original>` then `docker start` |
| Conflict | Stopped original + backup sibling both exist | Start the
original, leave the backup alone |
| None | No labeled container | `{ recovered: false }` — callers fall
through to existing non-destructive guidance |

Adapter access is per-call lazy `require` — top-level `import` from
`../adapters/docker` pulls in `runner.ts`'s load-time
`require("./platform")` which the Vitest TS loader can't resolve. Same
escape hatch as `auto-pair-approval.ts`.

Label constants are duplicated locally with comments pointing to
`docker-gpu-patch.ts` as the source of truth.

### Wiring: `src/lib/actions/sandbox/gateway-state.ts`

`reconcileMissingAgainstNamedGateway` now calls the helper:
- After the existing `gateway select nemoclaw` retry path lands on
`healthy_named` + `missing`.
- On the new top-level `healthy_named` + `missing` precondition that
#4423's bug class describes (gateway came back fresh after reboot with
no sandbox memory).

`SandboxGatewayState` gains `recoveredSandbox?: boolean` and
`recoverySandboxVia?: string | null` so callers can render a recovery
breadcrumb without re-walking the recovery chain.

### User-facing breadcrumb: `src/lib/actions/sandbox/status.ts`

Present-state output now prints:

```
Recovered sandbox '<name>' from Docker via <branch>; OpenShell now sees it as live.
```

before the canonical OpenShell sandbox info.

## Test results

- ✅ **12 new helper tests** in `docker-driver-sandbox-recovery.test.ts`
(all 4 dispatch branches, rename/start failure surfaces,
no-labeled-container empty case).
- ✅ **Existing 30 reconcile tests** in
`gateway-state-reconcile-2276.test.ts` and `gateway-state-drift.test.ts`
still pass — the new branch is additive and only fires under
`healthy_named` + `missing` which existing tests don't exercise.
- ✅ **1621 tests across `src/lib/onboard/`, `src/lib/actions/sandbox/`,
`test/gateway-state*.test.ts`** all green.

## Verification surface

- The post-reboot recovery scenario from #5087
(`ubuntu-repo-docker-post-reboot-recovery`) continues to go 🟢 GREEN —
its host-side invariants (`local-registry-entry-present`,
`docker-sandbox-container-present`) are exactly what this fix preserves.
- A future controlled-runner scenario can extend
`post-reboot-recovery-ready` with gateway/sandbox runtime probes once a
real-reboot environment is wired.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes
- [x] `npm test` passes — 1621 affected tests green; full suite to be
exercised by CI on this branch.
- [x] Tests added or updated for new or changed behavior.
- [x] No secrets, API keys, or credentials committed.
- [x] Live `ubuntu-repo-docker-post-reboot-recovery` scenario stays
green; will dispatch on this branch as a post-push verification step.

## Boundaries kept honest

- The helper talks to Docker only. OpenShell re-query is the caller's
job.
- Profile dispatch is exhaustive (`never`-checked); adding a new branch
requires extending both the helper switch and the unit test.
- Recovery is best-effort and short-circuits to `{ recovered: false }`
on any failure, so callers' existing non-destructive guidance stays the
safety net.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Sandboxes marked as missing after a system reboot can now be
automatically recovered from Docker if the container still exists on
disk.
* Status messages now display recovery information when a sandbox is
restored.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression NV QA Bugs found by the NVIDIA QA Team platform: dgx-spark Affects DGX Spark hardware or workflows UAT Issues flagged for User Acceptance Testing. v0.0.57 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants