test(e2e): scope post-reboot recovery to host-side invariants (follow-up to #5046)#5087
Conversation
Follow-up to #5046. The initial dispatch of `ubuntu-repo-docker-post-reboot-recovery` against `ubuntu-latest` revealed two issues that I want to lock in correctly while PR-A (the `#4423` source fix) is in flight: 1. The `openshell gateway stop` step in `LifecyclePhaseFixture` was load-bearing in the docstring but counterproductive in practice: `#4578`'s mitigation guards the destructive branch behind a `healthy_named` gateway, so stopping the gateway routes the test through the safe path and masks the regression target. There is no `openshell gateway start` subcommand on `ubuntu-latest` to restart the gateway cleanly (the user-systemd autostart from #4580 is the real mechanism), so simulate the post-reboot conditions by stopping ONLY the labeled sandbox container and leaving the gateway running. 2. The `post-reboot-recovery-ready` expected state asserted on `gateway-healthy` and `sandbox-running` runtime probes that are environmental on `ubuntu-latest` after `docker stop`. These liveness probes vary independently of the `#4423` regression target (registry destruction) and were causing the scenario to go RED for the wrong reason. Drop them and lock only the host-side preservation invariants this scenario can faithfully exercise: - `cli-installed` - `local-registry-entry-present` - `docker-sandbox-container-present` (running OR stopped OR `*-nemoclaw-gpu-backup-*` sibling) Reorder `probesForState` so host-side observables run BEFORE runtime-derived gateway/sandbox probes everywhere — the state-validation orchestrator short-circuits on the first probe failure, so host-side regression targets must be observed first. Net effect: the live `e2e-vitest-scenarios.yaml` dispatch on `ubuntu-latest` now goes GREEN against `main`. The scenario locks in `#4578`'s mitigation as a passive guard. PR-A's Docker-driver recovery helper extends this scaffold; if PR-A regresses either host-side invariant, the scenario goes RED. Pre-existing scenarios (cloud-openclaw-ready, preflight-failure-*, etc.) are unaffected — none declare `localRegistry` or `dockerSandboxContainer`, so the probe reordering is a no-op for their probe lists. Verified by the existing pinning tests in `e2e-expected-state.test.ts`. Refs #4423
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughThis PR refactors the post-reboot-recovery lifecycle phase by removing the "gateway stop" orchestration step, updating expected-state definitions to focus on host-side invariants only, and aligning all test assertions with the revised behavior. ChangesPost-reboot-recovery lifecycle refactor
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
This is an automated advisory review. A human maintainer must make the final merge decision. |
Verified: guard catches both regression targetsThrowaway branch
|
The PR Review Advisor on #5087 flagged three places that still described the OLD lifecycle (stops gateway runtime + restarts it) even though the code already exercises the scoped lifecycle (stops labeled container only, runs `nemoclaw status`, leaves gateway HEALTHY): * `scenarios/scenarios/baseline.ts` — scenario description. * `manifests/openclaw-nvidia-post-reboot-recovery.yaml` — manifest inline comment. * `live/registry-scenarios.test.ts` — driver comment on the lifecycle-dispatch block. Update each to: - Describe the actual two-step lifecycle (`docker stop` + status). - Note the gateway is left healthy and why (no `gateway start` subcommand on `ubuntu-latest`; #4578's mitigation would mask the regression target if the gateway were down). - Re-frame the assertion surface as host-side preservation invariants (`local-registry-entry-present`, `docker-sandbox-container-present`). - Cross-reference between the three locations so future readers don't have to chase context. No code change. Refs #4423
…4423) (#5091) ## Summary Closes the active-recovery half of #4423 — parts 2 & 3 of @ericksoa's plan. The destructive registry-removal paths were already neutralized by: - **#4578** — `status` no longer calls `registry.removeSandbox` on `NotFound` unless the gateway is `healthy_named`. - **#4497** — `ensureLiveSandboxOrExit` (which `connect` rides) preserves the registry entry on `missing` and routes intentional purges through explicit `destroy`. What's still missing is the *active recovery* path: when the gateway returns to `healthy_named` after a reboot but reports the sandbox as `NotFound`, NemoClaw currently prints "retry in a moment" guidance. This PR makes it walk Docker by the OpenShell labels, restart the labeled container (or rename a `*-nemoclaw-gpu-backup-*` sibling back), re-query OpenShell, and return a `present` lookup with a recovery breadcrumb. The user runs `nemoclaw <name> status` once and the sandbox is live. ## Related Issue Refs #4423 ## Changes ### New: `src/lib/onboard/docker-driver-sandbox-recovery.ts` `recoverDockerDriverSandbox(name, deps)` scans Docker by `openshell.ai/managed-by=openshell` AND `openshell.ai/sandbox-name=<name>` (the same labels `findOpenShellDockerSandboxContainerIds` already uses). Dispatches by container shape: | Branch ID | Trigger | Action | |---|---|---| | `started-running-original` | Labeled container is already running | No-op success | | `started-stopped-original` | Labeled container exists but stopped | `docker start` | | `renamed-and-started-backup` | Only a `*-nemoclaw-gpu-backup-*` sibling exists (from `docker-gpu-patch.ts`'s GPU patch) | `docker rename <backup> <original>` then `docker start` | | Conflict | Stopped original + backup sibling both exist | Start the original, leave the backup alone | | None | No labeled container | `{ recovered: false }` — callers fall through to existing non-destructive guidance | Adapter access is per-call lazy `require` — top-level `import` from `../adapters/docker` pulls in `runner.ts`'s load-time `require("./platform")` which the Vitest TS loader can't resolve. Same escape hatch as `auto-pair-approval.ts`. Label constants are duplicated locally with comments pointing to `docker-gpu-patch.ts` as the source of truth. ### Wiring: `src/lib/actions/sandbox/gateway-state.ts` `reconcileMissingAgainstNamedGateway` now calls the helper: - After the existing `gateway select nemoclaw` retry path lands on `healthy_named` + `missing`. - On the new top-level `healthy_named` + `missing` precondition that #4423's bug class describes (gateway came back fresh after reboot with no sandbox memory). `SandboxGatewayState` gains `recoveredSandbox?: boolean` and `recoverySandboxVia?: string | null` so callers can render a recovery breadcrumb without re-walking the recovery chain. ### User-facing breadcrumb: `src/lib/actions/sandbox/status.ts` Present-state output now prints: ``` Recovered sandbox '<name>' from Docker via <branch>; OpenShell now sees it as live. ``` before the canonical OpenShell sandbox info. ## Test results - ✅ **12 new helper tests** in `docker-driver-sandbox-recovery.test.ts` (all 4 dispatch branches, rename/start failure surfaces, no-labeled-container empty case). - ✅ **Existing 30 reconcile tests** in `gateway-state-reconcile-2276.test.ts` and `gateway-state-drift.test.ts` still pass — the new branch is additive and only fires under `healthy_named` + `missing` which existing tests don't exercise. - ✅ **1621 tests across `src/lib/onboard/`, `src/lib/actions/sandbox/`, `test/gateway-state*.test.ts`** all green. ## Verification surface - The post-reboot recovery scenario from #5087 (`ubuntu-repo-docker-post-reboot-recovery`) continues to go 🟢 GREEN — its host-side invariants (`local-registry-entry-present`, `docker-sandbox-container-present`) are exactly what this fix preserves. - A future controlled-runner scenario can extend `post-reboot-recovery-ready` with gateway/sandbox runtime probes once a real-reboot environment is wired. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes — 1621 affected tests green; full suite to be exercised by CI on this branch. - [x] Tests added or updated for new or changed behavior. - [x] No secrets, API keys, or credentials committed. - [x] Live `ubuntu-repo-docker-post-reboot-recovery` scenario stays green; will dispatch on this branch as a post-push verification step. ## Boundaries kept honest - The helper talks to Docker only. OpenShell re-query is the caller's job. - Profile dispatch is exhaustive (`never`-checked); adding a new branch requires extending both the helper switch and the unit test. - Recovery is best-effort and short-circuits to `{ recovered: false }` on any failure, so callers' existing non-destructive guidance stays the safety net. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Sandboxes marked as missing after a system reboot can now be automatically recovered from Docker if the container still exists on disk. * Status messages now display recovery information when a sandbox is restored. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
Follow-up to #5046. Tightens the
ubuntu-repo-docker-post-reboot-recoveryscenario into a stable RED-on-regression / GREEN-on-main guard while PR-A (the#4423source fix) is in flight.Related Issue
Refs #4423
Why this is needed
Post-merge dispatches of #5046's
ubuntu-repo-docker-post-reboot-recoveryjob againstubuntu-latestrevealed two issues:The
openshell gateway stopstep inLifecyclePhaseFixturewas counterproductive.#4578's mitigation insrc/lib/actions/sandbox/status.ts:296guards the destructive branch behind ahealthy_namedgateway. Stopping the gateway routed every dispatch through the safe path and masked the regression target. There is noopenshell gateway startsubcommand onubuntu-latest(the user-systemd autostart from fix(onboard): use OpenShell gateway user service #4580 is the real mechanism), so the original lifecycle couldn't restart the gateway cleanly to keep ithealthy_named.The
post-reboot-recovery-readyexpected-state asserted on runtime liveness probes (gateway-healthy,sandbox-running) that are environmental onubuntu-latestafterdocker stop. They varied independently of the#4423regression target and made every dispatch RED for the wrong reason — concretely,gateway-healthyfailing before the host-side probes had a chance to run.Changes
LifecyclePhaseFixture.simulatePostRebootopenshell gateway stopstep. The lifecycle now stops only the labeled sandbox container and then runsnemoclaw <name> status. Gateway is lefthealthy_named, which is the precondition for the destructive branch we want PR-A to neutralize.GATEWAY_STOP_TIMEOUT_MSconstant (no longer referenced).post-reboot-recovery-readyexpected-stateLock down only the host-side preservation invariants this scenario can faithfully exercise:
cli-installedlocal-registry-entry-presentdocker-sandbox-container-present(running OR stopped OR*-nemoclaw-gpu-backup-*sibling)Drop
gateway: { expected: "present", health: "healthy" }andsandbox: { expected: "present", status: "running", agent: "openclaw" }. A follow-up scenario on a more controlled runner (real DGX Spark / Brev) can extend the expected state with runtime liveness probes once PR-A has landed and a true post-reboot environment is available.probesForStateorderingReorder so host-side observables (
local-registry-entry-present,docker-sandbox-container-present) emit BEFORE runtime-derived gateway/sandbox probes everywhere. The state-validation orchestrator short-circuits on the first probe failure, so host-side regression targets must be observed first. Pre-existing scenarios (cloud-openclaw-ready,preflight-failure-*, etc.) are unaffected — none declare host-side dimensions, so their probe lists are unchanged.RED / GREEN contract
main(this PR's HEAD): scenario goes 🟢 GREEN. Locks in#4578's mitigation as a passive guard — registry preserved throughdocker stop+nemoclaw statusround-trip.registry.removeSandboxinstatus.ts:296orgateway-state.ts:412: scenario goes 🔴 RED atlocal-registry-entry-present.Type of Change
Verification
npx prek run --all-filespassesnpm testpasses — 142 tests across the directly affected files all green; no regressions on the rest of the e2e-scenario framework suite.gh workflow run e2e-vitest-scenarios.yaml -f scenarios=ubuntu-repo-docker-post-reboot-recovery --ref e2e-post-reboot-scope-host-side) goes GREEN: lifecycle (docker stop+nemoclaw status) → host-side probes pass → cleanup restores Docker. Will dispatch as a verification step once this PR is open.Follow-ups (post PR-A)
post-reboot-sparklifecycle profile for hardware-runner dispatch that can do a real reboot and exercise the full bug class.Summary by CodeRabbit
Tests
Documentation
Behavior