fix(sandbox): recover Docker-driver sandbox from labels post-reboot (#4423)#5091
Conversation
Closes the active-recovery half of #4423 (parts 2 & 3 of @ericksoa's plan). #4578 and #4497 already neutralized the destructive `missing` branches in `status` and `ensureLiveSandboxOrExit`; the registry entry survives a healthy_named-gateway + NotFound-sandbox round-trip. This PR adds the actively recoverable path so the user's next command sees a live sandbox instead of guidance-and-retry. Adds `src/lib/onboard/docker-driver-sandbox-recovery.ts` with `recoverDockerDriverSandbox(name, deps)`. It scans Docker by the OpenShell labels (`openshell.ai/managed-by=openshell` AND `openshell.ai/sandbox-name=<name>`) and dispatches by container shape: 1. `started-running-original` — labeled container is already running; no-op success. 2. `started-stopped-original` — labeled container exists but is stopped; `docker start`. 3. `renamed-and-started-backup` — only a `*-nemoclaw-gpu-backup-*` sibling exists (from the GPU patch path in `docker-gpu-patch.ts`); rename back to the original and start. 4. Conflict (backup + stale stopped original with the same base name) — start the original, leave the backup alone. 5. No labeled container at all → `{ recovered: false }` so callers fall through to existing non-destructive guidance. Wires the helper into `gateway-state.ts::reconcileMissingAgainstNamedGateway`. After the existing `gateway select nemoclaw` retry path AND on the new `healthy_named` + `missing` precondition that #4423 reports, the reconciler invokes the helper and re-queries OpenShell. When recovery succeeds, the returned `SandboxGatewayState` carries `recoveredSandbox: true` + `recoverySandboxVia: <branch id>` so callers can render a recovery breadcrumb without re-walking the recovery chain themselves. `status.ts`'s present-state path now prints `Recovered sandbox '<name>' from Docker via <branch>; OpenShell now sees it as live.` so the user knows why a previously-NotFound sandbox is suddenly Ready. Boundary: * The helper talks to Docker only. OpenShell re-query is the caller's job (`getSandboxGatewayState` in `gateway-state.ts`). * Adapter access is per-call lazy `require` — top-level import of `../adapters/docker` would pull in `runner.ts`'s load-time `require("./platform")` which the Vitest TS loader can't resolve. Same pattern as `auto-pair-approval.ts`. * Label constants are duplicated locally with comments pointing to `docker-gpu-patch.ts` as the source of truth. Tests: * `docker-driver-sandbox-recovery.test.ts` — 12 cases across discover parsing, all 4 dispatch branches, rename/start failure surfaces, and the no-labeled-container empty case. * Existing gateway-state tests (15 in `gateway-state-reconcile-2276.test.ts`, 2 in `gateway-state-drift.test.ts`) remain green; the new branch is additive and only fires under `healthy_named` + `missing` which none of them exercise. Verification surface: * The post-reboot recovery scenario shipped by #5087 (`ubuntu-repo-docker-post-reboot-recovery`) continues to go GREEN — its host-side invariants are exactly what this fix preserves. * A future controlled-runner scenario can extend the `post-reboot-recovery-ready` expected state with gateway/sandbox runtime probes once a real-reboot environment is wired. Refs #4423
📝 WalkthroughWalkthroughThis PR introduces Docker-driver sandbox recovery to handle cases where OpenShell reports a labeled sandbox as ChangesDocker-driver sandbox recovery integration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: None Full advisor summaryE2E Recommendation AdvisorFailed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-advisor-raw-output.txt |
E2E Scenario Advisor RecommendationRequired scenario E2E: None Full scenario advisor summaryE2E Scenario AdvisorFailed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-scenario-advisor-raw-output.txt |
PR Review AdvisorFindings: 2 needs attention, 9 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/onboard/docker-driver-sandbox-recovery.ts (1)
103-103: ⚡ Quick winUnused dependency field:
now.The
nowfield is defined inDockerDriverRecoveryDepsand assigned a default indepsWithDefaults(line 119), but it's never called anywhere in the module. The comment mentions "backup-name normalization" but this module only discovers existing backup containers and compares their names lexicographically—it doesn't construct new backup names.Consider removing this field unless it's reserved for planned functionality.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/docker-driver-sandbox-recovery.ts` at line 103, The DockerDriverRecoveryDeps type includes an unused now?: () => number and depsWithDefaults assigns a default but it's never used; remove the now field from the DockerDriverRecoveryDeps interface/type and delete the corresponding default assignment in depsWithDefaults (and any related fallback code or imports), then run a quick search in this module for the "backup-name normalization" comment and remove or update it if it references now; ensure any callers constructing deps are updated to no longer pass now.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/docker-driver-sandbox-recovery.test.ts`:
- Around line 15-27: The returned arrow functions in fakeStart, fakeRename, and
fakeCapture declare parameters that are unused; update their parameter names to
be prefixed with an underscore to satisfy the lint rule (e.g., in fakeStart
change (name, opts?) to (_name, _opts?), in fakeRename change (oldName, newName,
opts?) to (_oldName, _newName, _opts?), and in fakeCapture change (args) to
(_args)); keep the same types and return values, only rename the parameters.
---
Nitpick comments:
In `@src/lib/onboard/docker-driver-sandbox-recovery.ts`:
- Line 103: The DockerDriverRecoveryDeps type includes an unused now?: () =>
number and depsWithDefaults assigns a default but it's never used; remove the
now field from the DockerDriverRecoveryDeps interface/type and delete the
corresponding default assignment in depsWithDefaults (and any related fallback
code or imports), then run a quick search in this module for the "backup-name
normalization" comment and remove or update it if it references now; ensure any
callers constructing deps are updated to no longer pass now.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 678c1cb2-110e-4318-878c-02f792c85e72
📒 Files selected for processing (4)
src/lib/actions/sandbox/gateway-state.tssrc/lib/actions/sandbox/status.tssrc/lib/onboard/docker-driver-sandbox-recovery.test.tssrc/lib/onboard/docker-driver-sandbox-recovery.ts
| function fakeStart(status = 0): (name: string, opts?: Record<string, unknown>) => FakeRunResult { | ||
| return () => ({ status }); | ||
| } | ||
|
|
||
| function fakeRename( | ||
| status = 0, | ||
| ): (oldName: string, newName: string, opts?: Record<string, unknown>) => FakeRunResult { | ||
| return () => ({ status }); | ||
| } | ||
|
|
||
| function fakeCapture(output: string): (args: readonly string[]) => string { | ||
| return () => output; | ||
| } |
There was a problem hiding this comment.
Prefix unused parameters with underscore.
The returned arrow functions in fakeStart, fakeRename, and fakeCapture receive parameters that are never used. According to the coding guidelines, unused variables must be prefixed with underscore.
🔧 Proposed fix
-function fakeStart(status = 0): (name: string, opts?: Record<string, unknown>) => FakeRunResult {
- return () => ({ status });
+function fakeStart(status = 0): (_name: string, _opts?: Record<string, unknown>) => FakeRunResult {
+ return (_name, _opts) => ({ status });
}
function fakeRename(
status = 0,
-): (oldName: string, newName: string, opts?: Record<string, unknown>) => FakeRunResult {
- return () => ({ status });
+): (_oldName: string, _newName: string, _opts?: Record<string, unknown>) => FakeRunResult {
+ return (_oldName, _newName, _opts) => ({ status });
}
-function fakeCapture(output: string): (args: readonly string[]) => string {
- return () => output;
+function fakeCapture(output: string): (_args: readonly string[]) => string {
+ return (_args) => output;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function fakeStart(status = 0): (name: string, opts?: Record<string, unknown>) => FakeRunResult { | |
| return () => ({ status }); | |
| } | |
| function fakeRename( | |
| status = 0, | |
| ): (oldName: string, newName: string, opts?: Record<string, unknown>) => FakeRunResult { | |
| return () => ({ status }); | |
| } | |
| function fakeCapture(output: string): (args: readonly string[]) => string { | |
| return () => output; | |
| } | |
| function fakeStart(status = 0): (_name: string, _opts?: Record<string, unknown>) => FakeRunResult { | |
| return (_name, _opts) => ({ status }); | |
| } | |
| function fakeRename( | |
| status = 0, | |
| ): (_oldName: string, _newName: string, _opts?: Record<string, unknown>) => FakeRunResult { | |
| return (_oldName, _newName, _opts) => ({ status }); | |
| } | |
| function fakeCapture(output: string): (_args: readonly string[]) => string { | |
| return (_args) => output; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/onboard/docker-driver-sandbox-recovery.test.ts` around lines 15 - 27,
The returned arrow functions in fakeStart, fakeRename, and fakeCapture declare
parameters that are unused; update their parameter names to be prefixed with an
underscore to satisfy the lint rule (e.g., in fakeStart change (name, opts?) to
(_name, _opts?), in fakeRename change (oldName, newName, opts?) to (_oldName,
_newName, _opts?), and in fakeCapture change (args) to (_args)); keep the same
types and return values, only rename the parameters.
Source: Coding guidelines
Selective E2E Results — ✅ All requested jobs passedRun: 27244345220
|
|
✨ Related open issues: |
## Summary - Add the v0.0.63 release-note section using the published development note as source context. - Update source docs for sandbox recovery, OpenClaw config restore safety, managed vLLM selection, Slack Socket Mode conflict handling, and host diagnostics. - Refresh generated `nemoclaw-user-*` skills from the updated Fern MDX docs. - Update the release-doc refresh skill so post-release docs for version `n` look up the matching announcement discussion and use the `n+1` patch release label. - Fix CLI/docs parity by avoiding a `--from <Dockerfile>` flag mention inside the `upgrade-sandboxes` command section. ## Source summary - #5034 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document safer stale-sandbox recovery through `rebuild --yes` before recreating from scratch. - #5091 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Docker-driver post-reboot recovery from OpenShell container labels. - #5101, #5174, #5177 -> `docs/manage-sandboxes/backup-restore.mdx`, `docs/about/release-notes.mdx`: Document OpenClaw `openclaw.json` preservation, merge behavior, and fail-safe restore handling. - #5102 -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/about/release-notes.mdx`: Document `upgrade-sandboxes` image-fingerprint drift detection. - #4201 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document the installer diagnostic for unexpected Docker daemon access outside the `docker` group. - #5038 -> `docs/inference/inference-options.mdx`, `docs/inference/use-local-inference.mdx`, `docs/about/release-notes.mdx`: Document the interactive managed-vLLM model picker and non-interactive override behavior. - #5040, #5041 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Ollama auth-proxy recovery and host DNS preflight diagnostics. - #4986, #5039 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/about/release-notes.mdx`: Document Slack validation and duplicate Slack Socket Mode sandbox handling. - #4981, #5168 -> `docs/about/release-notes.mdx`: Capture Hermes gateway secret-guard and wrapped-argv startup hardening in the release surface. - Follow-up -> `.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`: Record the post-release docs workflow, discussion-announcement lookup, and next-patch release label rule. - Follow-up -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`: Reword custom Dockerfile sandbox text so CLI parity does not treat `--from` as an `upgrade-sandboxes` flag. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` - `npm run build:cli` - `bash test/e2e/e2e-cloud-experimental/check-docs.sh --only-cli` - Skip-term scan for `docs/.docs-skip` blocked terms across generated user skills <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Enhanced local inference setup with interactive model selection prompts and environment variable overrides * Improved sandbox upgrade detection using build fingerprints and version checks * Clarified configuration restore behavior preserving user settings during rebuild/restore * Added gateway authentication as fifth security layer * Expanded Slack messaging validation with live credential checking * Enhanced troubleshooting guidance for Docker access, DNS issues, and sandbox recovery * Updated release notes for v0.0.63 featuring sandbox recovery and inference improvements <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Closes the active-recovery half of #4423 — parts 2 & 3 of @ericksoa's plan. The destructive registry-removal paths were already neutralized by:
statusno longer callsregistry.removeSandboxonNotFoundunless the gateway ishealthy_named.ensureLiveSandboxOrExit(whichconnectrides) preserves the registry entry onmissingand routes intentional purges through explicitdestroy.What's still missing is the active recovery path: when the gateway returns to
healthy_namedafter a reboot but reports the sandbox asNotFound, NemoClaw currently prints "retry in a moment" guidance. This PR makes it walk Docker by the OpenShell labels, restart the labeled container (or rename a*-nemoclaw-gpu-backup-*sibling back), re-query OpenShell, and return apresentlookup with a recovery breadcrumb. The user runsnemoclaw <name> statusonce and the sandbox is live.Related Issue
Refs #4423
Changes
New:
src/lib/onboard/docker-driver-sandbox-recovery.tsrecoverDockerDriverSandbox(name, deps)scans Docker byopenshell.ai/managed-by=openshellANDopenshell.ai/sandbox-name=<name>(the same labelsfindOpenShellDockerSandboxContainerIdsalready uses). Dispatches by container shape:started-running-originalstarted-stopped-originaldocker startrenamed-and-started-backup*-nemoclaw-gpu-backup-*sibling exists (fromdocker-gpu-patch.ts's GPU patch)docker rename <backup> <original>thendocker start{ recovered: false }— callers fall through to existing non-destructive guidanceAdapter access is per-call lazy
require— top-levelimportfrom../adapters/dockerpulls inrunner.ts's load-timerequire("./platform")which the Vitest TS loader can't resolve. Same escape hatch asauto-pair-approval.ts.Label constants are duplicated locally with comments pointing to
docker-gpu-patch.tsas the source of truth.Wiring:
src/lib/actions/sandbox/gateway-state.tsreconcileMissingAgainstNamedGatewaynow calls the helper:gateway select nemoclawretry path lands onhealthy_named+missing.healthy_named+missingprecondition that [DGX Spark][Sandbox] post-reboot first 'nemoclaw <name> status' destroys sandbox via "Removed stale local registry entry" #4423's bug class describes (gateway came back fresh after reboot with no sandbox memory).SandboxGatewayStategainsrecoveredSandbox?: booleanandrecoverySandboxVia?: string | nullso callers can render a recovery breadcrumb without re-walking the recovery chain.User-facing breadcrumb:
src/lib/actions/sandbox/status.tsPresent-state output now prints:
before the canonical OpenShell sandbox info.
Test results
docker-driver-sandbox-recovery.test.ts(all 4 dispatch branches, rename/start failure surfaces, no-labeled-container empty case).gateway-state-reconcile-2276.test.tsandgateway-state-drift.test.tsstill pass — the new branch is additive and only fires underhealthy_named+missingwhich existing tests don't exercise.src/lib/onboard/,src/lib/actions/sandbox/,test/gateway-state*.test.tsall green.Verification surface
ubuntu-repo-docker-post-reboot-recovery) continues to go 🟢 GREEN — its host-side invariants (local-registry-entry-present,docker-sandbox-container-present) are exactly what this fix preserves.post-reboot-recovery-readywith gateway/sandbox runtime probes once a real-reboot environment is wired.Type of Change
Verification
npx prek run --all-filespassesnpm testpasses — 1621 affected tests green; full suite to be exercised by CI on this branch.ubuntu-repo-docker-post-reboot-recoveryscenario stays green; will dispatch on this branch as a post-push verification step.Boundaries kept honest
never-checked); adding a new branch requires extending both the helper switch and the unit test.{ recovered: false }on any failure, so callers' existing non-destructive guidance stays the safety net.Summary by CodeRabbit