fix(status): preserve registry on missing live sandbox#4578
Conversation
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughWhen ChangesPreserve orphan entries instead of cleanup
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
|
Advisor feedback addressed.
For the removal condition: passive |
Selective E2E Results — ✅ All requested jobs passedRun: 26715565554
|
|
@cv could you please review/approve this v0.0.56 mitigation when you get a chance? The advisor scope feedback has been addressed by changing the PR from Current validation is green:
|
cjagwani
left a comment
There was a problem hiding this comment.
Approving. Right shape for v0.0.56 — keep status passive so DGX Spark first-post-reboot doesn't nuke registry state before the gateway reconciles. Destructive cleanup stays on connect, which is the correct boundary.
Tests pin both sides (gateway-state-reconcile-2276.test.ts flips the assertion correctly, cli.test.ts covers the new guidance string). sandbox-operations-e2e, ubuntu-repo-cloud-openclaw, wsl-e2e, macos-e2e, test-e2e-sandbox, test-e2e-gateway-isolation, and test-e2e-port-overrides all passed. Scope properly narrowed to "partially addresses #4423" per advisor feedback.
Selective E2E Results — ✅ All requested jobs passedRun: 26774901829
|
…ve status (#4578) (#4672) <!-- markdownlint-disable MD041 --> ## Summary Replaces #4635 with the same diff on an upstream `NVIDIA/NemoClaw` branch so maintainers can dispatch E2E from the replacement PR. The `double-onboard-e2e` nightly job failed at Phase 5 because the test expected `nemoclaw status` to reconcile stale registry entries, but #4578 intentionally made `status` non-destructive. This PR aligns the test with that behavior by verifying the non-destructive `status` output first, then triggering actual reconciliation via `connect --probe-only`. ## Related Issue Fixes #4639 ## Changes - `test/e2e/test-double-onboard.sh`: replace the single `nemoclaw status` reconciliation assertion with a two-step sequence. - Verify `status` exits 1 and preserves the registry entry with the "No local registry entry was removed" guidance. - Trigger reconciliation via `connect --probe-only` and verify "Removed stale local registry entry" plus final registry cleanup. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes Note: local checks were not rerun for this replacement branch; it copies #4635 unchanged. #4635 reported focused validation for `double-onboard-e2e` on the same fix commit: https://github.com/hunglp6d/NemoClaw/actions/runs/26792577152. --- Signed-off-by: San Dang <sdang@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Improvements** * Clarified that the status command is non-destructive and explicitly reports when it does not remove local registry entries. * Status and connect now intentionally preserve stale registry entries unless destroyed. * Cleanup now relies on the explicit destroy action to remove registry entries. * Tests updated to verify the non-destructive status message and to avoid forcing reconciliation after teardown. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Hung Le <hple@nvidia.com> Signed-off-by: San Dang <sdang@nvidia.com> Signed-off-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: Hung Le <hple@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
## Summary Post-Computex / v0.0.60 follow-up for the gateway lifecycle half of #4423. This is intentionally separate from #4578, which remains the v0.0.56 safety hotfix for non-destructive status behavior. - Use OpenShell's package-managed `openshell-gateway` user service when its vendor/package unit is present, writing Docker-driver gateway env before restart. - Ignore per-user/stale gateway unit files so standalone recovery remains available when there is no package-managed OpenShell service. - Keep the existing standalone gateway launch path as an explicit compatibility fallback when the upstream service is unavailable. - Update docs/tests to describe package-service ownership vs. standalone fallback. ## Validation - `npm run build:cli` - `npm run typecheck:cli` - `npm run checks` - `npm run source-shape:check` - `npm run check:installer-hash` - `bash -n scripts/install-openshell.sh` - `bash test/e2e/test-openshell-version-pin.sh` - `npx vitest run src/lib/onboard/docker-driver-gateway-env.test.ts src/lib/onboard/docker-driver-gateway-service.test.ts test/install-openshell-version-check.test.ts test/runner.test.ts test/onboard-gateway-runtime.test.ts test/gateway-state-reconcile-2276.test.ts` Refs #4423 Follow-up to #4578 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Clarified Deployment Topology, uninstall/state-dir contents, Apple Silicon sandbox behavior, and environment-variable guidance for the standalone fallback. * **New Features** * Installer on Linux now attempts to start a package-managed OpenShell gateway and falls back to the standalone gateway when appropriate. * **Tests** * Expanded unit and e2e tests to cover package-managed vs standalone gateway flows and realistic checksum-driven installer scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
The first end-to-end dispatch with the new probe ordering revealed that the test was hitting #4578's mitigation, not the open #4423 bug. With the gateway stopped and never restarted, status's `getNamedGateway- LifecycleState()` returned NOT healthy_named, which routes through the non-destructive branch — so the registry was correctly preserved and the test failed at gateway-healthy, masking the real signal. Add `openshell gateway start --name nemoclaw` between `docker stop` and `nemoclaw <name> status` so the lifecycle truly mirrors a DGX Spark / Linux Docker-driver post-reboot: * Gateway stopped (sandbox memory dropped). * Sandbox container stopped (so OpenShell sees NotFound). * Gateway restarted FRESH via the explicit `openshell gateway start` invocation — the user-systemd autostart path from #4580 in compressed form. Now the gateway is `healthy_named` AND has no sandbox memory. * Status is then invoked. With healthy_named gateway + missing sandbox + Docker container still present, the destructive `missing` branch in `src/lib/actions/sandbox/status.ts:308` fires and wipes the registry on unfixed code. This is the precise precondition that PR-A's Docker-driver sandbox recovery helper neutralizes: status will see Docker still has the labeled container, restart it, and skip the destructive cleanup. Refs #4423
Initial dispatches of `ubuntu-repo-docker-post-reboot-recovery` against `ubuntu-latest` revealed that the user-visible #4423 bug class cannot be fully reproduced from CI without a real reboot: * `docker stop` of the labeled sandbox container puts OpenShell into `Phase: Provisioning`, not literal `NotFound` — so the destructive `missing` branch in `src/lib/actions/sandbox/status.ts:308` does not fire on unfixed code. * The OpenShell CLI on `ubuntu-latest` does not expose a `gateway start` subcommand (the user-systemd autostart path from #4580 is the real autostart mechanism, not the CLI), so we can't cleanly reset the gateway to a fresh-state-with-no-sandbox-memory configuration that would yield a real `NotFound`. * Gateway HTTP health on the runner is environmental and varies independently of the regression target. Scope the `post-reboot-recovery-ready` expected state to ONLY the host-side preservation invariants this scenario can faithfully exercise from CI: * `cli` installed, * `localRegistry` entry preserved, * `dockerSandboxContainer` (labeled, including `*-nemoclaw-gpu-backup-*` siblings) present. This makes the scenario a "lock-in" guard for #4578's mitigation and for the host-side invariants PR-A's Docker-driver recovery helper must also preserve. A future scenario on a more controlled runner can extend the expected state with gateway/sandbox runtime probes once PR-A has landed and a true post-reboot environment is available. The framework primitives — lifecycle phase fixture, host-side probes, lifecycle whitelist — remain unchanged and remain useful beyond this scenario's current limits. Refs #4423
The PR Review Advisor on #5087 flagged three places that still described the OLD lifecycle (stops gateway runtime + restarts it) even though the code already exercises the scoped lifecycle (stops labeled container only, runs `nemoclaw status`, leaves gateway HEALTHY): * `scenarios/scenarios/baseline.ts` — scenario description. * `manifests/openclaw-nvidia-post-reboot-recovery.yaml` — manifest inline comment. * `live/registry-scenarios.test.ts` — driver comment on the lifecycle-dispatch block. Update each to: - Describe the actual two-step lifecycle (`docker stop` + status). - Note the gateway is left healthy and why (no `gateway start` subcommand on `ubuntu-latest`; #4578's mitigation would mask the regression target if the gateway were down). - Re-frame the assertion surface as host-side preservation invariants (`local-registry-entry-present`, `docker-sandbox-container-present`). - Cross-reference between the three locations so future readers don't have to chase context. No code change. Refs #4423
…-up to #5046) (#5087) ## Summary Follow-up to #5046. Tightens the `ubuntu-repo-docker-post-reboot-recovery` scenario into a stable RED-on-regression / GREEN-on-main guard while PR-A (the `#4423` source fix) is in flight. ## Related Issue Refs #4423 ## Why this is needed Post-merge dispatches of #5046's `ubuntu-repo-docker-post-reboot-recovery` job against `ubuntu-latest` revealed two issues: 1. **The `openshell gateway stop` step in `LifecyclePhaseFixture` was counterproductive.** `#4578`'s mitigation in `src/lib/actions/sandbox/status.ts:296` guards the destructive branch behind a `healthy_named` gateway. Stopping the gateway routed every dispatch through the safe path and masked the regression target. There is no `openshell gateway start` subcommand on `ubuntu-latest` (the user-systemd autostart from #4580 is the real mechanism), so the original lifecycle couldn't restart the gateway cleanly to keep it `healthy_named`. 2. **The `post-reboot-recovery-ready` expected-state asserted on runtime liveness probes** (`gateway-healthy`, `sandbox-running`) that are environmental on `ubuntu-latest` after `docker stop`. They varied independently of the `#4423` regression target and made every dispatch RED for the wrong reason — concretely, `gateway-healthy` failing before the host-side probes had a chance to run. ## Changes ### `LifecyclePhaseFixture.simulatePostReboot` - Drop the `openshell gateway stop` step. The lifecycle now stops only the labeled sandbox container and then runs `nemoclaw <name> status`. Gateway is left `healthy_named`, which is the precondition for the destructive branch we want PR-A to neutralize. - Drop the `GATEWAY_STOP_TIMEOUT_MS` constant (no longer referenced). - Updated docstring + 2 unit tests removed (gateway-stop-failure tolerance tests are no longer applicable). ### `post-reboot-recovery-ready` expected-state Lock down only the host-side preservation invariants this scenario can faithfully exercise: - `cli-installed` - `local-registry-entry-present` - `docker-sandbox-container-present` (running OR stopped OR `*-nemoclaw-gpu-backup-*` sibling) Drop `gateway: { expected: "present", health: "healthy" }` and `sandbox: { expected: "present", status: "running", agent: "openclaw" }`. A follow-up scenario on a more controlled runner (real DGX Spark / Brev) can extend the expected state with runtime liveness probes once PR-A has landed and a true post-reboot environment is available. ### `probesForState` ordering Reorder so host-side observables (`local-registry-entry-present`, `docker-sandbox-container-present`) emit BEFORE runtime-derived gateway/sandbox probes everywhere. The state-validation orchestrator short-circuits on the first probe failure, so host-side regression targets must be observed first. Pre-existing scenarios (`cloud-openclaw-ready`, `preflight-failure-*`, etc.) are unaffected — none declare host-side dimensions, so their probe lists are unchanged. ## RED / GREEN contract - **On `main`** (this PR's HEAD): scenario goes 🟢 GREEN. Locks in `#4578`'s mitigation as a passive guard — registry preserved through `docker stop` + `nemoclaw status` round-trip. - **On a regression PR** that re-introduces unconditional `registry.removeSandbox` in `status.ts:296` or `gateway-state.ts:412`: scenario goes 🔴 RED at `local-registry-entry-present`. - **On PR-A's fix branch** that adds Docker-driver sandbox recovery: scenario stays 🟢, demonstrating the new helper does not regress either host-side invariant. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes — 142 tests across the directly affected files all green; no regressions on the rest of the e2e-scenario framework suite. - [x] Live dispatch (`gh workflow run e2e-vitest-scenarios.yaml -f scenarios=ubuntu-repo-docker-post-reboot-recovery --ref e2e-post-reboot-scope-host-side`) goes GREEN: lifecycle (`docker stop` + `nemoclaw status`) → host-side probes pass → cleanup restores Docker. Will dispatch as a verification step once this PR is open. - [x] Tests added or updated for new or changed behavior. - [x] No secrets, API keys, or credentials committed. ## Follow-ups (post PR-A) - Once PR-A's Docker-driver recovery helper is in place, extend the post-reboot expected-state with gateway/sandbox runtime probes on a controlled runner. - Consider a `post-reboot-spark` lifecycle profile for hardware-runner dispatch that can do a real reboot and exercise the full bug class. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added a validation test for post-reboot-recovery expected-state focusing on host-side invariants. * Updated post-reboot-recovery phase tests to remove the gateway-stop step and assert container stop + host status checks. * **Documentation** * Clarified lifecycle and scenario comments/manifests to describe container-stop + host-status-driven recovery and probe ordering. * **Behavior** * Expected-state and probe emission now prioritize host-side checks and omit runtime gateway/sandbox expectations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…4423) (#5091) ## Summary Closes the active-recovery half of #4423 — parts 2 & 3 of @ericksoa's plan. The destructive registry-removal paths were already neutralized by: - **#4578** — `status` no longer calls `registry.removeSandbox` on `NotFound` unless the gateway is `healthy_named`. - **#4497** — `ensureLiveSandboxOrExit` (which `connect` rides) preserves the registry entry on `missing` and routes intentional purges through explicit `destroy`. What's still missing is the *active recovery* path: when the gateway returns to `healthy_named` after a reboot but reports the sandbox as `NotFound`, NemoClaw currently prints "retry in a moment" guidance. This PR makes it walk Docker by the OpenShell labels, restart the labeled container (or rename a `*-nemoclaw-gpu-backup-*` sibling back), re-query OpenShell, and return a `present` lookup with a recovery breadcrumb. The user runs `nemoclaw <name> status` once and the sandbox is live. ## Related Issue Refs #4423 ## Changes ### New: `src/lib/onboard/docker-driver-sandbox-recovery.ts` `recoverDockerDriverSandbox(name, deps)` scans Docker by `openshell.ai/managed-by=openshell` AND `openshell.ai/sandbox-name=<name>` (the same labels `findOpenShellDockerSandboxContainerIds` already uses). Dispatches by container shape: | Branch ID | Trigger | Action | |---|---|---| | `started-running-original` | Labeled container is already running | No-op success | | `started-stopped-original` | Labeled container exists but stopped | `docker start` | | `renamed-and-started-backup` | Only a `*-nemoclaw-gpu-backup-*` sibling exists (from `docker-gpu-patch.ts`'s GPU patch) | `docker rename <backup> <original>` then `docker start` | | Conflict | Stopped original + backup sibling both exist | Start the original, leave the backup alone | | None | No labeled container | `{ recovered: false }` — callers fall through to existing non-destructive guidance | Adapter access is per-call lazy `require` — top-level `import` from `../adapters/docker` pulls in `runner.ts`'s load-time `require("./platform")` which the Vitest TS loader can't resolve. Same escape hatch as `auto-pair-approval.ts`. Label constants are duplicated locally with comments pointing to `docker-gpu-patch.ts` as the source of truth. ### Wiring: `src/lib/actions/sandbox/gateway-state.ts` `reconcileMissingAgainstNamedGateway` now calls the helper: - After the existing `gateway select nemoclaw` retry path lands on `healthy_named` + `missing`. - On the new top-level `healthy_named` + `missing` precondition that #4423's bug class describes (gateway came back fresh after reboot with no sandbox memory). `SandboxGatewayState` gains `recoveredSandbox?: boolean` and `recoverySandboxVia?: string | null` so callers can render a recovery breadcrumb without re-walking the recovery chain. ### User-facing breadcrumb: `src/lib/actions/sandbox/status.ts` Present-state output now prints: ``` Recovered sandbox '<name>' from Docker via <branch>; OpenShell now sees it as live. ``` before the canonical OpenShell sandbox info. ## Test results - ✅ **12 new helper tests** in `docker-driver-sandbox-recovery.test.ts` (all 4 dispatch branches, rename/start failure surfaces, no-labeled-container empty case). - ✅ **Existing 30 reconcile tests** in `gateway-state-reconcile-2276.test.ts` and `gateway-state-drift.test.ts` still pass — the new branch is additive and only fires under `healthy_named` + `missing` which existing tests don't exercise. - ✅ **1621 tests across `src/lib/onboard/`, `src/lib/actions/sandbox/`, `test/gateway-state*.test.ts`** all green. ## Verification surface - The post-reboot recovery scenario from #5087 (`ubuntu-repo-docker-post-reboot-recovery`) continues to go 🟢 GREEN — its host-side invariants (`local-registry-entry-present`, `docker-sandbox-container-present`) are exactly what this fix preserves. - A future controlled-runner scenario can extend `post-reboot-recovery-ready` with gateway/sandbox runtime probes once a real-reboot environment is wired. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes — 1621 affected tests green; full suite to be exercised by CI on this branch. - [x] Tests added or updated for new or changed behavior. - [x] No secrets, API keys, or credentials committed. - [x] Live `ubuntu-repo-docker-post-reboot-recovery` scenario stays green; will dispatch on this branch as a post-push verification step. ## Boundaries kept honest - The helper talks to Docker only. OpenShell re-query is the caller's job. - Profile dispatch is exhaustive (`never`-checked); adding a new branch requires extending both the helper switch and the unit test. - Recovery is best-effort and short-circuits to `{ recovered: false }` on any failure, so callers' existing non-destructive guidance stays the safety net. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Sandboxes marked as missing after a system reboot can now be automatically recovered from Docker if the container still exists on disk. * Status messages now display recovery information when a sandbox is restored. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
Addresses the destructive passive-status side effect from #4423.
nemoclaw <name> statusis now non-destructive when a sandbox is registered locally but missing from the live OpenShell gateway, preventing the first post-reboot status check on DGX Spark from deleting registry state before gateway reconciliation catches up.This is the v0.0.56 safety mitigation only: it preserves local registry/onboard-session state and makes the failure explicit. The broader #4423 recovery goals remain open on the issue, including DGX Spark gateway auto-start after reboot and full post-reboot sandbox reconciliation/recovery.
Related Issue
Partially addresses #4423.
Changes
connect, and update regression tests to lock that split.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Validation run:
npm run build:clinpx vitest run test/gateway-state-reconcile-2276.test.ts test/cli.test.ts- 182 passed, 1 skippednpx vitest run test/cli.test.ts -t "preserves an orphan registry entry"- 1 passednpx vitest run test/gateway-state-reconcile-2276.test.ts -t "reports the missing live sandbox"- 1 passedAdvisor-selected E2E:
sandbox-operations-e2eauto-dispatched by the E2E advisor against PR head93dd87a67c876358d42ef399a083d86baf3dc5a0: https://github.com/NVIDIA/NemoClaw/actions/runs/26715565554ubuntu-repo-cloud-openclawdispatched against this branch: https://github.com/NVIDIA/NemoClaw/actions/runs/26715713730gateway-drift-preflight-e2eregression dispatched against this branch: https://github.com/NVIDIA/NemoClaw/actions/runs/26715713787Signed-off-by: Aaron Erickson aerickson@nvidia.com
Summary by CodeRabbit
statuscommand now preserves local sandbox registry entries when registered locally but missing from the gateway. Clear guidance is provided for recovery steps instead of automatic removal.