Skip to content

fix(sandbox): recover Docker-driver sandbox from labels post-reboot (#4423)#5091

Merged
jyaunches merged 2 commits into
mainfrom
issue-4423-post-reboot-status-destroys-sandbox
Jun 10, 2026
Merged

fix(sandbox): recover Docker-driver sandbox from labels post-reboot (#4423)#5091
jyaunches merged 2 commits into
mainfrom
issue-4423-post-reboot-status-destroys-sandbox

Conversation

@jyaunches

@jyaunches jyaunches commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes the active-recovery half of #4423 — parts 2 & 3 of @ericksoa's plan. The destructive registry-removal paths were already neutralized by:

What's still missing is the active recovery path: when the gateway returns to healthy_named after a reboot but reports the sandbox as NotFound, NemoClaw currently prints "retry in a moment" guidance. This PR makes it walk Docker by the OpenShell labels, restart the labeled container (or rename a *-nemoclaw-gpu-backup-* sibling back), re-query OpenShell, and return a present lookup with a recovery breadcrumb. The user runs nemoclaw <name> status once and the sandbox is live.

Related Issue

Refs #4423

Changes

New: src/lib/onboard/docker-driver-sandbox-recovery.ts

recoverDockerDriverSandbox(name, deps) scans Docker by
openshell.ai/managed-by=openshell AND openshell.ai/sandbox-name=<name> (the same labels findOpenShellDockerSandboxContainerIds already uses). Dispatches by container shape:

Branch ID Trigger Action
started-running-original Labeled container is already running No-op success
started-stopped-original Labeled container exists but stopped docker start
renamed-and-started-backup Only a *-nemoclaw-gpu-backup-* sibling exists (from docker-gpu-patch.ts's GPU patch) docker rename <backup> <original> then docker start
Conflict Stopped original + backup sibling both exist Start the original, leave the backup alone
None No labeled container { recovered: false } — callers fall through to existing non-destructive guidance

Adapter access is per-call lazy require — top-level import from ../adapters/docker pulls in runner.ts's load-time require("./platform") which the Vitest TS loader can't resolve. Same escape hatch as auto-pair-approval.ts.

Label constants are duplicated locally with comments pointing to docker-gpu-patch.ts as the source of truth.

Wiring: src/lib/actions/sandbox/gateway-state.ts

reconcileMissingAgainstNamedGateway now calls the helper:

SandboxGatewayState gains recoveredSandbox?: boolean and recoverySandboxVia?: string | null so callers can render a recovery breadcrumb without re-walking the recovery chain.

User-facing breadcrumb: src/lib/actions/sandbox/status.ts

Present-state output now prints:

Recovered sandbox '<name>' from Docker via <branch>; OpenShell now sees it as live.

before the canonical OpenShell sandbox info.

Test results

  • 12 new helper tests in docker-driver-sandbox-recovery.test.ts (all 4 dispatch branches, rename/start failure surfaces, no-labeled-container empty case).
  • Existing 30 reconcile tests in gateway-state-reconcile-2276.test.ts and gateway-state-drift.test.ts still pass — the new branch is additive and only fires under healthy_named + missing which existing tests don't exercise.
  • 1621 tests across src/lib/onboard/, src/lib/actions/sandbox/, test/gateway-state*.test.ts all green.

Verification surface

  • The post-reboot recovery scenario from test(e2e): scope post-reboot recovery to host-side invariants (follow-up to #5046) #5087 (ubuntu-repo-docker-post-reboot-recovery) continues to go 🟢 GREEN — its host-side invariants (local-registry-entry-present, docker-sandbox-container-present) are exactly what this fix preserves.
  • A future controlled-runner scenario can extend post-reboot-recovery-ready with gateway/sandbox runtime probes once a real-reboot environment is wired.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes — 1621 affected tests green; full suite to be exercised by CI on this branch.
  • Tests added or updated for new or changed behavior.
  • No secrets, API keys, or credentials committed.
  • Live ubuntu-repo-docker-post-reboot-recovery scenario stays green; will dispatch on this branch as a post-push verification step.

Boundaries kept honest

  • The helper talks to Docker only. OpenShell re-query is the caller's job.
  • Profile dispatch is exhaustive (never-checked); adding a new branch requires extending both the helper switch and the unit test.
  • Recovery is best-effort and short-circuits to { recovered: false } on any failure, so callers' existing non-destructive guidance stays the safety net.

Summary by CodeRabbit

  • New Features
    • Sandboxes marked as missing after a system reboot can now be automatically recovered from Docker if the container still exists on disk.
    • Status messages now display recovery information when a sandbox is restored.

Closes the active-recovery half of #4423 (parts 2 & 3 of @ericksoa's
plan). #4578 and #4497 already neutralized the destructive `missing`
branches in `status` and `ensureLiveSandboxOrExit`; the registry
entry survives a healthy_named-gateway + NotFound-sandbox round-trip.
This PR adds the actively recoverable path so the user's next command
sees a live sandbox instead of guidance-and-retry.

Adds `src/lib/onboard/docker-driver-sandbox-recovery.ts` with
`recoverDockerDriverSandbox(name, deps)`. It scans Docker by the
OpenShell labels
(`openshell.ai/managed-by=openshell` AND
 `openshell.ai/sandbox-name=<name>`) and dispatches by container
shape:

  1. `started-running-original`     — labeled container is already
                                      running; no-op success.
  2. `started-stopped-original`     — labeled container exists but
                                      is stopped; `docker start`.
  3. `renamed-and-started-backup`   — only a `*-nemoclaw-gpu-backup-*`
                                      sibling exists (from the GPU
                                      patch path in
                                      `docker-gpu-patch.ts`); rename
                                      back to the original and start.
  4. Conflict (backup + stale stopped original with the same base
     name) — start the original, leave the backup alone.
  5. No labeled container at all → `{ recovered: false }` so callers
     fall through to existing non-destructive guidance.

Wires the helper into
`gateway-state.ts::reconcileMissingAgainstNamedGateway`. After the
existing `gateway select nemoclaw` retry path AND on the new
`healthy_named` + `missing` precondition that #4423 reports, the
reconciler invokes the helper and re-queries OpenShell. When recovery
succeeds, the returned `SandboxGatewayState` carries
`recoveredSandbox: true` + `recoverySandboxVia: <branch id>` so
callers can render a recovery breadcrumb without re-walking the
recovery chain themselves.

`status.ts`'s present-state path now prints
`Recovered sandbox '<name>' from Docker via <branch>; OpenShell now
sees it as live.` so the user knows why a previously-NotFound sandbox
is suddenly Ready.

Boundary:
  * The helper talks to Docker only. OpenShell re-query is the
    caller's job (`getSandboxGatewayState` in `gateway-state.ts`).
  * Adapter access is per-call lazy `require` — top-level import of
    `../adapters/docker` would pull in `runner.ts`'s load-time
    `require("./platform")` which the Vitest TS loader can't
    resolve. Same pattern as `auto-pair-approval.ts`.
  * Label constants are duplicated locally with comments pointing to
    `docker-gpu-patch.ts` as the source of truth.

Tests:
  * `docker-driver-sandbox-recovery.test.ts` — 12 cases across
    discover parsing, all 4 dispatch branches, rename/start failure
    surfaces, and the no-labeled-container empty case.
  * Existing gateway-state tests (15 in
    `gateway-state-reconcile-2276.test.ts`, 2 in
    `gateway-state-drift.test.ts`) remain green; the new branch is
    additive and only fires under `healthy_named` + `missing` which
    none of them exercise.

Verification surface:
  * The post-reboot recovery scenario shipped by #5087
    (`ubuntu-repo-docker-post-reboot-recovery`) continues to go GREEN
    — its host-side invariants are exactly what this fix preserves.
  * A future controlled-runner scenario can extend the
    `post-reboot-recovery-ready` expected state with gateway/sandbox
    runtime probes once a real-reboot environment is wired.

Refs #4423
@jyaunches jyaunches added the v0.0.62 Release target label Jun 10, 2026
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces Docker-driver sandbox recovery to handle cases where OpenShell reports a labeled sandbox as NotFound after reboot despite the container still existing on disk. The feature probes Docker for labeled containers, attempts recovery by starting stopped originals or restoring from backups, integrates recovery into the missing-sandbox reconciliation flow, and displays recovery status to users.

Changes

Docker-driver sandbox recovery integration

Layer / File(s) Summary
Docker Driver Recovery Module
src/lib/onboard/docker-driver-sandbox-recovery.ts
Implements best-effort recovery via label constants, lazy Docker adapter loaders, and public APIs (findLabeledSandboxContainers, recoverDockerDriverSandbox). Probes Docker for labeled containers, classifies candidates (running/stopped original, most-recent backup sibling), and executes ordered recovery: no-op if running original exists; docker start if stopped original exists; docker rename + docker start if only backup exists; recovered: false if no labeled container found.
Docker Driver Recovery Tests
src/lib/onboard/docker-driver-sandbox-recovery.test.ts
Comprehensive test suite with fakes for Docker adapter calls. Validates labeled-container parsing, running-original no-op, stopped-original start, backup-only recovery with multi-backup selection, preference for original when both original and backup exist, failure surfacing for start/rename errors, and missing-labeled-container handling.
Gateway State Recovery Integration
src/lib/actions/sandbox/gateway-state.ts
Extends SandboxGatewayState with optional recoveredSandbox and recoverySandboxVia fields. Integrates recovery into missing-sandbox reconciliation: connected_other retry path and healthy_named path. Introduces tryRecoverDockerDriverSandbox helper that executes recovery, swallows errors, and re-queries OpenShell state on success before returning annotated result.
Status Display for Recovered Sandboxes
src/lib/actions/sandbox/status.ts
Extends showSandboxStatus to detect and display recovery status when lookup.recoveredSandbox is true, printing "Recovered sandbox …" message with optional recovery-via suffix before existing output/phase handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4578: Both PRs modify src/lib/actions/sandbox/status.ts's showSandboxStatus behavior for missing-sandbox cases.
  • NVIDIA/NemoClaw#4550: Both PRs modify src/lib/actions/sandbox/status.ts, adding additional conditional output logic to status handling.

Suggested labels

bug-fix, Docker, Sandbox, platform: container, v0.0.58

Suggested reviewers

  • prekshivyas
  • cv

Poem

🐇 A Docker-labeled rabbit hopped back to life,
When reboot had tangled its container in strife,
We probe, classify, and recover with care—
Rename and restart, with labels to spare!
Lost sandboxes found in the Docker-dark night. 🌙

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: implementing Docker-driver sandbox recovery after reboot using labels, which is the core objective of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-4423-post-reboot-status-destroys-sandbox

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: None
Optional E2E: None

Workflow run

Full advisor summary

E2E Recommendation Advisor

Failed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-advisor-raw-output.txt

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Failed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-scenario-advisor-raw-output.txt

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 2 needs attention, 9 worth checking, 0 nice ideas
Since last review: 0 prior items resolved, 8 still apply, 1 new item found

Review findings

🛠️ Needs attention

  • Docker recovery is returned before the sandbox is proven reconnected (src/lib/actions/sandbox/gateway-state.ts:296): After Docker recovery succeeds, the caller performs one immediate `getSandboxGatewayState` and annotates the returned state with `recoveredSandbox: true` regardless of whether OpenShell actually reports the sandbox as `present`. The linked [DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423 plan requires starting or restoring the container, then waiting for supervisor reconnect and re-querying `openshell sandbox get <name>` / `sandbox exec -n <name> -- true` before status returns Ready. If OpenShell registration is asynchronous after reboot, this can still fall through to missing or another non-present state even though Docker start/rename succeeded.
    • Recommendation: Add a bounded recovery wait that polls OpenShell until the sandbox is present/Ready or `sandbox exec -n <name> -- true` succeeds. Only attach `recoveredSandbox` when the post-recovery lookup is actually `present`; otherwise return the original missing lookup or a clear non-destructive retry state. Cover success, timeout, and still-NotFound paths with caller-level tests.
    • Evidence: `tryRecoverDockerDriverSandbox` calls `recoverDockerDriverSandbox`, then immediately calls `getSandboxGatewayState(sandboxName)` and returns `{ ...retried, recoveredSandbox: true, recoverySandboxVia: recovery.via }`.
  • Several literal [DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423 acceptance clauses remain outside the diff: The linked issue/comment acceptance includes gateway autostart and true-stale cleanup behavior, but this PR changes only Docker-side container recovery and status messaging. If this PR is intended to close [DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423, those clauses are not implemented or evidenced by the changed files.
    • Recommendation: Either implement the missing acceptance clauses in this PR, or narrow the PR/issue linkage so it is clear that this is only the Docker-container recovery portion and the remaining [DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423 clauses are tracked elsewhere with code evidence.
    • Evidence: No changed file adds a Linux/Spark user-systemd gateway service, launchd entry, linger detection, or equivalent autostart behavior. No changed active-command cleanup path proves true-stale cleanup only happens when both OpenShell and Docker prove the sandbox is gone.

🔎 Worth checking

  • Source-of-truth review needed: Docker-driver sandbox recovery for healthy gateway + OpenShell NotFound: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `gateway-state.ts` adds `tryRecoverDockerDriverSandbox` as a best-effort fallback; `docker-driver-sandbox-recovery.ts` documents the post-reboot invalid state and Docker-only recovery boundary.
  • Source-of-truth review needed: Duplicated OpenShell Docker label constants: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `docker-driver-sandbox-recovery.ts` declares `OPENSHELL_MANAGED_BY_LABEL`, `OPENSHELL_MANAGED_BY_VALUE`, and `OPENSHELL_SANDBOX_NAME_LABEL` locally and comments that a parity test pins drift.
  • Source-of-truth review needed: Best-effort Docker recovery exception fallback: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `tryRecoverDockerDriverSandbox` catches all exceptions from `recoverDockerDriverSandbox` and returns `missingLookup`.
  • Recovery trusts Docker labels alone to start or rename containers (src/lib/onboard/docker-driver-sandbox-recovery.ts:151): The new recovery path discovers containers by OpenShell labels and then starts or renames the selected candidate. Commands are passed as argv arrays, so shell injection is not apparent, but the authorization boundary is label ownership. A local actor with Docker access could label a non-NemoClaw or wrong-target container so this path starts or renames it during recovery. Docker access is already privileged, but this is still a sandbox lifecycle trust boundary.
    • Recommendation: Corroborate label matches with expected registry metadata such as sandbox id/name-derived container shape, or explicitly document labels as the sole authority and add negative tests for label/name disagreement and unexpected candidate names.
    • Evidence: `findLabeledSandboxContainers` filters by `openshell.ai/managed-by=openshell` and `openshell.ai/sandbox-name=<name>`; `classifyCandidate` treats any non-backup labeled container as the original without validating its name against the registered sandbox.
  • Backup restore name reconstruction is lossy for truncated GPU-backup names (src/lib/onboard/docker-driver-sandbox-recovery.ts:127): `docker-gpu-patch.ts` creates backup names by truncating the original container name before appending `-nemoclaw-gpu-backup-<timestamp>`. This recovery helper restores by stripping the suffix from the backup name. For long original container names, that cannot reconstruct the true original name and may rename the backup to a truncated name that OpenShell does not recognize.
    • Recommendation: Use a non-lossy source for the restore target, such as registry metadata, an OpenShell container label, or inspect data, and add a regression test for a backup created from an original name longer than Docker's name limit budget.
    • Evidence: `buildBackupContainerName` in `docker-gpu-patch.ts` truncates before suffixing; `originalNameFromBackup` in this PR only removes the suffix, and `buildBackupRestoreName` truncates again if needed.
  • Duplicated Docker label constants are not pinned to their source of truth (src/lib/onboard/docker-driver-sandbox-recovery.ts:7): The new module duplicates OpenShell label constants and comments that `docker-gpu-patch.ts` is the source of truth and that a parity test pins drift. The added test file covers parser and recovery behavior but does not compare the duplicated constants with the source constants, so future drift could silently break recovery or make it scan different labels than the GPU patch path.
    • Recommendation: Either move the constants to a lightweight shared module that both files import, or add a focused parity test comparing `docker-driver-sandbox-recovery.ts` constants to `docker-gpu-patch.ts` constants.
    • Evidence: `OPENSHELL_MANAGED_BY_LABEL`, `OPENSHELL_MANAGED_BY_VALUE`, and `OPENSHELL_SANDBOX_NAME_LABEL` are duplicated locally; `docker-driver-sandbox-recovery.test.ts` does not import the source constants or assert parity.
  • Recovery helper does not validate sandbox names at its own boundary (src/lib/onboard/docker-driver-sandbox-recovery.ts:144): `recoverDockerDriverSandbox` is exported and uses `sandboxName` directly in the Docker label filter. The surrounding CLI usually works with registered sandbox names, and Docker is invoked through argv arrays with `shell:false`, so this is not an obvious command injection path. Still, the helper itself does not enforce the existing RFC1123 sandbox-name contract at this high-risk sandbox lifecycle boundary.
    • Recommendation: Validate `sandboxName` with the existing `validateName` helper or make the caller/callee contract explicit and covered by a negative test for malformed sandbox names.
    • Evidence: `findLabeledSandboxContainers` builds `label=${OPENSHELL_SANDBOX_NAME_LABEL}=${sandboxName}` without local validation; `runner.ts` already exposes `validateName` for RFC1123 name enforcement.
  • Caller-level recovery and failure paths are not tested (src/lib/actions/sandbox/gateway-state.ts:283): The new tests exercise the Docker helper branches, but the changed behavior also depends on `reconcileMissingAgainstNamedGateway`, OpenShell re-query results, and user-facing status rendering. There is no test proving that recovery metadata is only set after a present re-query, that recovery failure preserves the non-destructive missing guidance, or that the connected-other/select path composes correctly with Docker recovery.
    • Recommendation: Add caller-level tests around `reconcileMissingAgainstNamedGateway` and status output for Docker recovered + present, Docker recovered + still NotFound, Docker recovery exception/failure, and connected-other -> select -> healthy_named missing -> Docker recovery.
    • Evidence: `docker-driver-sandbox-recovery.test.ts` covers helper dispatch only; the existing gateway-state tests do not exercise the new `healthy_named + missing` Docker recovery branch.
  • Large gateway-state hotspot grows with another recovery workaround (src/lib/actions/sandbox/gateway-state.ts:226): `gateway-state.ts` is already a large, frequently changed sandbox lifecycle hotspot and this PR adds more Docker recovery orchestration to it. The deterministic drift context flags about 60 lines of growth in this file, and the source-of-truth story for when this localized workaround can be removed is not captured in code.
    • Recommendation: Keep the gateway-state entry point thin by extracting the recovery orchestration/polling contract into the new recovery module or a small coordinator, and document the removal condition for this workaround once OpenShell/gateway state persistence handles post-reboot recovery directly.
    • Evidence: The PR adds Docker recovery import/types, new state fields, healthy_named recovery branching, and `tryRecoverDockerDriverSandbox` to `gateway-state.ts`; deterministic drift context reports `baseLines: 513`, `headLines: 573`, `delta: 60`.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — reconcileMissingAgainstNamedGateway returns recovered present state only after Docker recovery and post-requery `present`. The patch touches sandbox lifecycle recovery, Docker container operations, OpenShell supervisor reconnect timing, and status output. Unit helper tests are useful but do not prove the runtime post-reboot behavior or caller/callee contract.
  • **Runtime validation** — reconcileMissingAgainstNamedGateway leaves the lookup missing and unmarked when Docker recovery succeeds but OpenShell still returns NotFound. The patch touches sandbox lifecycle recovery, Docker container operations, OpenShell supervisor reconnect timing, and status output. Unit helper tests are useful but do not prove the runtime post-reboot behavior or caller/callee contract.
  • **Runtime validation** — tryRecoverDockerDriverSandbox waits or polls until OpenShell supervisor reconnects, and times out to non-destructive guidance. The patch touches sandbox lifecycle recovery, Docker container operations, OpenShell supervisor reconnect timing, and status output. Unit helper tests are useful but do not prove the runtime post-reboot behavior or caller/callee contract.
  • **Runtime validation** — status prints the Docker recovery breadcrumb only for `lookup.state === "present"` with `recoveredSandbox === true`. The patch touches sandbox lifecycle recovery, Docker container operations, OpenShell supervisor reconnect timing, and status output. Unit helper tests are useful but do not prove the runtime post-reboot behavior or caller/callee contract.
  • **Runtime validation** — recoverDockerDriverSandbox rejects or ignores malformed sandbox names before Docker label lookup. The patch touches sandbox lifecycle recovery, Docker container operations, OpenShell supervisor reconnect timing, and status output. Unit helper tests are useful but do not prove the runtime post-reboot behavior or caller/callee contract.
  • **Duplicated Docker label constants are not pinned to their source of truth** — Either move the constants to a lightweight shared module that both files import, or add a focused parity test comparing `docker-driver-sandbox-recovery.ts` constants to `docker-gpu-patch.ts` constants.
  • **Caller-level recovery and failure paths are not tested** — Add caller-level tests around `reconcileMissingAgainstNamedGateway` and status output for Docker recovered + present, Docker recovered + still NotFound, Docker recovery exception/failure, and connected-other -> select -> healthy_named missing -> Docker recovery.
  • **Acceptance clause:** On DGX Spark after `sudo reboot`, the NemoClaw gateway starts automatically on boot or at first user session, without requiring the first `nemoclaw status` to recreate gateway metadata. — add test evidence or identify existing coverage. No changed file adds or modifies a gateway user-systemd service, launchd entry, linger detection, or equivalent autostart behavior.
Since last review details

Current findings:

  • Source-of-truth review needed: Docker-driver sandbox recovery for healthy gateway + OpenShell NotFound: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `gateway-state.ts` adds `tryRecoverDockerDriverSandbox` as a best-effort fallback; `docker-driver-sandbox-recovery.ts` documents the post-reboot invalid state and Docker-only recovery boundary.
  • Source-of-truth review needed: Duplicated OpenShell Docker label constants: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `docker-driver-sandbox-recovery.ts` declares `OPENSHELL_MANAGED_BY_LABEL`, `OPENSHELL_MANAGED_BY_VALUE`, and `OPENSHELL_SANDBOX_NAME_LABEL` locally and comments that a parity test pins drift.
  • Source-of-truth review needed: Best-effort Docker recovery exception fallback: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: `tryRecoverDockerDriverSandbox` catches all exceptions from `recoverDockerDriverSandbox` and returns `missingLookup`.
  • Docker recovery is returned before the sandbox is proven reconnected (src/lib/actions/sandbox/gateway-state.ts:296): After Docker recovery succeeds, the caller performs one immediate `getSandboxGatewayState` and annotates the returned state with `recoveredSandbox: true` regardless of whether OpenShell actually reports the sandbox as `present`. The linked [DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423 plan requires starting or restoring the container, then waiting for supervisor reconnect and re-querying `openshell sandbox get <name>` / `sandbox exec -n <name> -- true` before status returns Ready. If OpenShell registration is asynchronous after reboot, this can still fall through to missing or another non-present state even though Docker start/rename succeeded.
    • Recommendation: Add a bounded recovery wait that polls OpenShell until the sandbox is present/Ready or `sandbox exec -n <name> -- true` succeeds. Only attach `recoveredSandbox` when the post-recovery lookup is actually `present`; otherwise return the original missing lookup or a clear non-destructive retry state. Cover success, timeout, and still-NotFound paths with caller-level tests.
    • Evidence: `tryRecoverDockerDriverSandbox` calls `recoverDockerDriverSandbox`, then immediately calls `getSandboxGatewayState(sandboxName)` and returns `{ ...retried, recoveredSandbox: true, recoverySandboxVia: recovery.via }`.
  • Several literal [DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423 acceptance clauses remain outside the diff: The linked issue/comment acceptance includes gateway autostart and true-stale cleanup behavior, but this PR changes only Docker-side container recovery and status messaging. If this PR is intended to close [DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423, those clauses are not implemented or evidenced by the changed files.
    • Recommendation: Either implement the missing acceptance clauses in this PR, or narrow the PR/issue linkage so it is clear that this is only the Docker-container recovery portion and the remaining [DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423 clauses are tracked elsewhere with code evidence.
    • Evidence: No changed file adds a Linux/Spark user-systemd gateway service, launchd entry, linger detection, or equivalent autostart behavior. No changed active-command cleanup path proves true-stale cleanup only happens when both OpenShell and Docker prove the sandbox is gone.
  • Recovery trusts Docker labels alone to start or rename containers (src/lib/onboard/docker-driver-sandbox-recovery.ts:151): The new recovery path discovers containers by OpenShell labels and then starts or renames the selected candidate. Commands are passed as argv arrays, so shell injection is not apparent, but the authorization boundary is label ownership. A local actor with Docker access could label a non-NemoClaw or wrong-target container so this path starts or renames it during recovery. Docker access is already privileged, but this is still a sandbox lifecycle trust boundary.
    • Recommendation: Corroborate label matches with expected registry metadata such as sandbox id/name-derived container shape, or explicitly document labels as the sole authority and add negative tests for label/name disagreement and unexpected candidate names.
    • Evidence: `findLabeledSandboxContainers` filters by `openshell.ai/managed-by=openshell` and `openshell.ai/sandbox-name=<name>`; `classifyCandidate` treats any non-backup labeled container as the original without validating its name against the registered sandbox.
  • Backup restore name reconstruction is lossy for truncated GPU-backup names (src/lib/onboard/docker-driver-sandbox-recovery.ts:127): `docker-gpu-patch.ts` creates backup names by truncating the original container name before appending `-nemoclaw-gpu-backup-<timestamp>`. This recovery helper restores by stripping the suffix from the backup name. For long original container names, that cannot reconstruct the true original name and may rename the backup to a truncated name that OpenShell does not recognize.
    • Recommendation: Use a non-lossy source for the restore target, such as registry metadata, an OpenShell container label, or inspect data, and add a regression test for a backup created from an original name longer than Docker's name limit budget.
    • Evidence: `buildBackupContainerName` in `docker-gpu-patch.ts` truncates before suffixing; `originalNameFromBackup` in this PR only removes the suffix, and `buildBackupRestoreName` truncates again if needed.
  • Duplicated Docker label constants are not pinned to their source of truth (src/lib/onboard/docker-driver-sandbox-recovery.ts:7): The new module duplicates OpenShell label constants and comments that `docker-gpu-patch.ts` is the source of truth and that a parity test pins drift. The added test file covers parser and recovery behavior but does not compare the duplicated constants with the source constants, so future drift could silently break recovery or make it scan different labels than the GPU patch path.
    • Recommendation: Either move the constants to a lightweight shared module that both files import, or add a focused parity test comparing `docker-driver-sandbox-recovery.ts` constants to `docker-gpu-patch.ts` constants.
    • Evidence: `OPENSHELL_MANAGED_BY_LABEL`, `OPENSHELL_MANAGED_BY_VALUE`, and `OPENSHELL_SANDBOX_NAME_LABEL` are duplicated locally; `docker-driver-sandbox-recovery.test.ts` does not import the source constants or assert parity.
  • Recovery helper does not validate sandbox names at its own boundary (src/lib/onboard/docker-driver-sandbox-recovery.ts:144): `recoverDockerDriverSandbox` is exported and uses `sandboxName` directly in the Docker label filter. The surrounding CLI usually works with registered sandbox names, and Docker is invoked through argv arrays with `shell:false`, so this is not an obvious command injection path. Still, the helper itself does not enforce the existing RFC1123 sandbox-name contract at this high-risk sandbox lifecycle boundary.
    • Recommendation: Validate `sandboxName` with the existing `validateName` helper or make the caller/callee contract explicit and covered by a negative test for malformed sandbox names.
    • Evidence: `findLabeledSandboxContainers` builds `label=${OPENSHELL_SANDBOX_NAME_LABEL}=${sandboxName}` without local validation; `runner.ts` already exposes `validateName` for RFC1123 name enforcement.
  • Caller-level recovery and failure paths are not tested (src/lib/actions/sandbox/gateway-state.ts:283): The new tests exercise the Docker helper branches, but the changed behavior also depends on `reconcileMissingAgainstNamedGateway`, OpenShell re-query results, and user-facing status rendering. There is no test proving that recovery metadata is only set after a present re-query, that recovery failure preserves the non-destructive missing guidance, or that the connected-other/select path composes correctly with Docker recovery.
    • Recommendation: Add caller-level tests around `reconcileMissingAgainstNamedGateway` and status output for Docker recovered + present, Docker recovered + still NotFound, Docker recovery exception/failure, and connected-other -> select -> healthy_named missing -> Docker recovery.
    • Evidence: `docker-driver-sandbox-recovery.test.ts` covers helper dispatch only; the existing gateway-state tests do not exercise the new `healthy_named + missing` Docker recovery branch.
  • Large gateway-state hotspot grows with another recovery workaround (src/lib/actions/sandbox/gateway-state.ts:226): `gateway-state.ts` is already a large, frequently changed sandbox lifecycle hotspot and this PR adds more Docker recovery orchestration to it. The deterministic drift context flags about 60 lines of growth in this file, and the source-of-truth story for when this localized workaround can be removed is not captured in code.
    • Recommendation: Keep the gateway-state entry point thin by extracting the recovery orchestration/polling contract into the new recovery module or a small coordinator, and document the removal condition for this workaround once OpenShell/gateway state persistence handles post-reboot recovery directly.
    • Evidence: The PR adds Docker recovery import/types, new state fields, healthy_named recovery branching, and `tryRecoverDockerDriverSandbox` to `gateway-state.ts`; deterministic drift context reports `baseLines: 513`, `headLines: 573`, `delta: 60`.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/lib/onboard/docker-driver-sandbox-recovery.ts (1)

103-103: ⚡ Quick win

Unused dependency field: now.

The now field is defined in DockerDriverRecoveryDeps and assigned a default in depsWithDefaults (line 119), but it's never called anywhere in the module. The comment mentions "backup-name normalization" but this module only discovers existing backup containers and compares their names lexicographically—it doesn't construct new backup names.

Consider removing this field unless it's reserved for planned functionality.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/docker-driver-sandbox-recovery.ts` at line 103, The
DockerDriverRecoveryDeps type includes an unused now?: () => number and
depsWithDefaults assigns a default but it's never used; remove the now field
from the DockerDriverRecoveryDeps interface/type and delete the corresponding
default assignment in depsWithDefaults (and any related fallback code or
imports), then run a quick search in this module for the "backup-name
normalization" comment and remove or update it if it references now; ensure any
callers constructing deps are updated to no longer pass now.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/docker-driver-sandbox-recovery.test.ts`:
- Around line 15-27: The returned arrow functions in fakeStart, fakeRename, and
fakeCapture declare parameters that are unused; update their parameter names to
be prefixed with an underscore to satisfy the lint rule (e.g., in fakeStart
change (name, opts?) to (_name, _opts?), in fakeRename change (oldName, newName,
opts?) to (_oldName, _newName, _opts?), and in fakeCapture change (args) to
(_args)); keep the same types and return values, only rename the parameters.

---

Nitpick comments:
In `@src/lib/onboard/docker-driver-sandbox-recovery.ts`:
- Line 103: The DockerDriverRecoveryDeps type includes an unused now?: () =>
number and depsWithDefaults assigns a default but it's never used; remove the
now field from the DockerDriverRecoveryDeps interface/type and delete the
corresponding default assignment in depsWithDefaults (and any related fallback
code or imports), then run a quick search in this module for the "backup-name
normalization" comment and remove or update it if it references now; ensure any
callers constructing deps are updated to no longer pass now.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 678c1cb2-110e-4318-878c-02f792c85e72

📥 Commits

Reviewing files that changed from the base of the PR and between 4515eda and 4c739bd.

📒 Files selected for processing (4)
  • src/lib/actions/sandbox/gateway-state.ts
  • src/lib/actions/sandbox/status.ts
  • src/lib/onboard/docker-driver-sandbox-recovery.test.ts
  • src/lib/onboard/docker-driver-sandbox-recovery.ts

Comment on lines +15 to +27
function fakeStart(status = 0): (name: string, opts?: Record<string, unknown>) => FakeRunResult {
return () => ({ status });
}

function fakeRename(
status = 0,
): (oldName: string, newName: string, opts?: Record<string, unknown>) => FakeRunResult {
return () => ({ status });
}

function fakeCapture(output: string): (args: readonly string[]) => string {
return () => output;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Prefix unused parameters with underscore.

The returned arrow functions in fakeStart, fakeRename, and fakeCapture receive parameters that are never used. According to the coding guidelines, unused variables must be prefixed with underscore.

🔧 Proposed fix
-function fakeStart(status = 0): (name: string, opts?: Record<string, unknown>) => FakeRunResult {
-  return () => ({ status });
+function fakeStart(status = 0): (_name: string, _opts?: Record<string, unknown>) => FakeRunResult {
+  return (_name, _opts) => ({ status });
 }
 
 function fakeRename(
   status = 0,
-): (oldName: string, newName: string, opts?: Record<string, unknown>) => FakeRunResult {
-  return () => ({ status });
+): (_oldName: string, _newName: string, _opts?: Record<string, unknown>) => FakeRunResult {
+  return (_oldName, _newName, _opts) => ({ status });
 }
 
-function fakeCapture(output: string): (args: readonly string[]) => string {
-  return () => output;
+function fakeCapture(output: string): (_args: readonly string[]) => string {
+  return (_args) => output;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function fakeStart(status = 0): (name: string, opts?: Record<string, unknown>) => FakeRunResult {
return () => ({ status });
}
function fakeRename(
status = 0,
): (oldName: string, newName: string, opts?: Record<string, unknown>) => FakeRunResult {
return () => ({ status });
}
function fakeCapture(output: string): (args: readonly string[]) => string {
return () => output;
}
function fakeStart(status = 0): (_name: string, _opts?: Record<string, unknown>) => FakeRunResult {
return (_name, _opts) => ({ status });
}
function fakeRename(
status = 0,
): (_oldName: string, _newName: string, _opts?: Record<string, unknown>) => FakeRunResult {
return (_oldName, _newName, _opts) => ({ status });
}
function fakeCapture(output: string): (_args: readonly string[]) => string {
return (_args) => output;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/docker-driver-sandbox-recovery.test.ts` around lines 15 - 27,
The returned arrow functions in fakeStart, fakeRename, and fakeCapture declare
parameters that are unused; update their parameter names to be prefixed with an
underscore to satisfy the lint rule (e.g., in fakeStart change (name, opts?) to
(_name, _opts?), in fakeRename change (oldName, newName, opts?) to (_oldName,
_newName, _opts?), and in fakeCapture change (args) to (_args)); keep the same
types and return values, only rename the parameters.

Source: Coding guidelines

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27244345220
Target ref: 4c739bd13d88931d82d314a8a0d46a7eff85ceb4
Workflow ref: main
Requested jobs: sandbox-survival-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
sandbox-survival-e2e ✅ success

@cv cv enabled auto-merge (squash) June 10, 2026 00:35
@cv cv disabled auto-merge June 10, 2026 00:35
@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression labels Jun 10, 2026
@cv cv added v0.0.63 Release target and removed v0.0.62 Release target labels Jun 10, 2026
@jyaunches jyaunches merged commit 39f20a4 into main Jun 10, 2026
38 checks passed
@jyaunches jyaunches deleted the issue-4423-post-reboot-status-destroys-sandbox branch June 10, 2026 02:41
miyoungc added a commit that referenced this pull request Jun 11, 2026
## Summary
- Add the v0.0.63 release-note section using the published development
note as source context.
- Update source docs for sandbox recovery, OpenClaw config restore
safety, managed vLLM selection, Slack Socket Mode conflict handling, and
host diagnostics.
- Refresh generated `nemoclaw-user-*` skills from the updated Fern MDX
docs.
- Update the release-doc refresh skill so post-release docs for version
`n` look up the matching announcement discussion and use the `n+1` patch
release label.
- Fix CLI/docs parity by avoiding a `--from <Dockerfile>` flag mention
inside the `upgrade-sandboxes` command section.

## Source summary
- #5034 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document safer stale-sandbox recovery
through `rebuild --yes` before recreating from scratch.
- #5091 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document Docker-driver post-reboot
recovery from OpenShell container labels.
- #5101, #5174, #5177 -> `docs/manage-sandboxes/backup-restore.mdx`,
`docs/about/release-notes.mdx`: Document OpenClaw `openclaw.json`
preservation, merge behavior, and fail-safe restore handling.
- #5102 -> `docs/reference/commands.mdx`,
`docs/reference/commands-nemohermes.mdx`,
`docs/manage-sandboxes/lifecycle.mdx`, `docs/about/release-notes.mdx`:
Document `upgrade-sandboxes` image-fingerprint drift detection.
- #4201 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document the installer diagnostic for
unexpected Docker daemon access outside the `docker` group.
- #5038 -> `docs/inference/inference-options.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/about/release-notes.mdx`: Document the interactive managed-vLLM
model picker and non-interactive override behavior.
- #5040, #5041 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document Ollama auth-proxy recovery and
host DNS preflight diagnostics.
- #4986, #5039 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`docs/about/release-notes.mdx`: Document Slack validation and duplicate
Slack Socket Mode sandbox handling.
- #4981, #5168 -> `docs/about/release-notes.mdx`: Capture Hermes gateway
secret-guard and wrapped-argv startup hardening in the release surface.
- Follow-up ->
`.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`: Record the
post-release docs workflow, discussion-announcement lookup, and
next-patch release label rule.
- Follow-up -> `docs/reference/commands.mdx`,
`docs/reference/commands-nemohermes.mdx`: Reword custom Dockerfile
sandbox text so CLI parity does not treat `--from` as an
`upgrade-sandboxes` flag.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `npm run docs`
- `npm run build:cli`
- `bash test/e2e/e2e-cloud-experimental/check-docs.sh --only-cli`
- Skip-term scan for `docs/.docs-skip` blocked terms across generated
user skills

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Enhanced local inference setup with interactive model selection
prompts and environment variable overrides
* Improved sandbox upgrade detection using build fingerprints and
version checks
* Clarified configuration restore behavior preserving user settings
during rebuild/restore
  * Added gateway authentication as fifth security layer
  * Expanded Slack messaging validation with live credential checking
* Enhanced troubleshooting guidance for Docker access, DNS issues, and
sandbox recovery
* Updated release notes for v0.0.63 featuring sandbox recovery and
inference improvements

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression v0.0.63 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants