fix(onboard): use OpenShell gateway user service#4580
Conversation
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
Worried about impact? Review this PR in Change Stack to explore blast radius before you approve or request changes. No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
💤 Files with no reviewable changes (3)
📝 WalkthroughWalkthroughAdds Linux systemd user-service orchestration for openshell-gateway, gates Debian env-override writes on service presence, integrates package-managed startup into onboarding (short-circuiting standalone fallback on success), broadens unit and e2e tests, and updates docs and readiness comments. ChangesLinux Gateway User Service Feature
Sequence DiagramssequenceDiagram
participant Onboard as startDockerDriverGateway
participant EnvOverride as writeDockerGatewayDebEnvOverride
participant PkgMgd as startPackageManagedDockerDriverGateway
participant Service as startOpenShellGatewayUserService
participant Systemctl as systemctl --user
participant Gateway as openshell-gateway
Onboard->>Onboard: Compute gatewayEnv from OpenShell --version
Onboard->>EnvOverride: Write DEB gateway.env override
EnvOverride-->>Onboard: wrote boolean
Onboard->>PkgMgd: startPackageManagedDockerDriverGateway(...)
PkgMgd->>PkgMgd: Check unit file exists
alt Unit exists
PkgMgd->>Service: startOpenShellGatewayUserService()
Service->>Systemctl: daemon-reload
Systemctl-->>Service: result
Service->>Systemctl: enable openshell-gateway
Systemctl-->>Service: result
Service->>Systemctl: restart openshell-gateway
Systemctl-->>Gateway: restart signal
Gateway-->>Service: started / failure
alt Service started
Service-->>PkgMgd: started true
PkgMgd-->>Onboard: success
else Service failed
Service-->>PkgMgd: fallbackAllowed / reason
PkgMgd-->>Onboard: false or exit/throw
end
else Unit not found
PkgMgd-->>Onboard: false
end
Onboard->>Onboard: Standalone gateway fallback path
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
🌿 Preview your docs: https://nvidia-preview-pr-4580.docs.buildwithfern.com/nemoclaw |
E2E Advisor RecommendationRequired E2E: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
This is an automated advisory review. A human maintainer must make the final merge decision. |
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 26717014158
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2414-2471: The helper function
tryStartPackageManagedDockerDriverGateway should be moved out of
src/lib/onboard.ts into a new sibling module (e.g.,
src/lib/onboard-package-managed.ts) so the big block no longer inflates
onboard.ts; keep startDockerDriverGateway() in onboard.ts as the coordinator and
have it import and call the relocated tryStartPackageManagedDockerDriverGateway.
Ensure you preserve the function signature and all referenced symbols
(dockerDriverGatewayService, clearDockerDriverGatewayRuntimeFiles,
registerDockerDriverGatewayEndpoint, runCaptureOpenshell, isGatewayHealthy,
isGatewayTcpReady, verifySandboxBridgeGatewayReachableOrExit, envInt,
sleepSeconds, GATEWAY_NAME) and their imports/exports so behavior is unchanged,
update module imports in onboard.ts, and export the helper from the new module
for testing/consumption.
- Around line 2443-2461: The call to clearDockerDriverGatewayRuntimeFiles
currently runs before the health-poll loop and removes runtime PID/marker files
prematurely; move that cleanup so it only runs after the gateway is confirmed
healthy (i.e., after isGatewayHealthy(status, namedInfo, currentInfo) && await
isGatewayTcpReady() succeeds) — update the code so
clearDockerDriverGatewayRuntimeFiles is invoked after the successful health
checks (for functions/variables involved: clearDockerDriverGatewayRuntimeFiles,
registerDockerDriverGatewayEndpoint, isGatewayHealthy, isGatewayTcpReady,
verifySandboxBridgeGatewayReachableOrExit) so recovery/fallback logic retains
runtime breadcrumbs until the service is truly up.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ad5de779-9431-49cf-972d-2394b910a053
📒 Files selected for processing (9)
docs/reference/architecture.mdxdocs/reference/commands.mdxscripts/install-openshell.shsrc/lib/onboard.tssrc/lib/onboard/docker-driver-gateway-env.test.tssrc/lib/onboard/docker-driver-gateway-env.tssrc/lib/onboard/docker-driver-gateway-service.test.tssrc/lib/onboard/docker-driver-gateway-service.tstest/install-openshell-version-check.test.ts
Selective E2E Results — ✅ All requested jobs passedRun: 26717115183
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26717180037
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
2424-2426: Please run one service-path E2E before merge.This branch now short-circuits the legacy startup flow when the package-managed gateway takes ownership, so
openshell-gateway-upgrade-e2eplus one happy-path onboard flow such ascloud-e2eorsandbox-operations-e2ewould give good coverage of the new handoff.As per coding guidelines,
src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 2424 - 2426, Run a service-path E2E before merging to validate the new short-circuit in src/lib/onboard.ts: execute the openshell-gateway-upgrade-e2e test plus one happy-path onboarding E2E (cloud-e2e or sandbox-operations-e2e) to confirm dockerDriverGatewayEnv.startPackageManagedDockerDriverGateway correctly takes ownership and that the legacy startup path still behaves when it should; verify the flow around verifySandboxBridgeGatewayReachableOrExit and GATEWAY_NAME, observe logs/errors and ensure no regressions in gateway handoff or sandbox creation before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2424-2426: Run a service-path E2E before merging to validate the
new short-circuit in src/lib/onboard.ts: execute the
openshell-gateway-upgrade-e2e test plus one happy-path onboarding E2E (cloud-e2e
or sandbox-operations-e2e) to confirm
dockerDriverGatewayEnv.startPackageManagedDockerDriverGateway correctly takes
ownership and that the legacy startup path still behaves when it should; verify
the flow around verifySandboxBridgeGatewayReachableOrExit and GATEWAY_NAME,
observe logs/errors and ensure no regressions in gateway handoff or sandbox
creation before merging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e9cba808-c0cf-473b-950a-5efbb2171a45
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/docker-driver-gateway-env.tssrc/lib/onboard/docker-driver-gateway-service.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard/docker-driver-gateway-env.ts
Selective E2E Results — ❌ Some jobs failedRun: 26717266019
|
Selective E2E Results — ❌ Some jobs failedRun: 26718114347
|
Selective E2E Results — ✅ All requested jobs passedRun: 26718192181
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/reference/commands.mdx`:
- Line 1422: Update the description for the environment variable
NEMOCLAW_OPENSHELL_SANDBOX_BIN to use active voice to match the parallel entry
for NEMOCLAW_OPENSHELL_GATEWAY_BIN: replace the passive phrase "passed to the
Linux Docker-driver standalone fallback" with the active construction "used by
the Linux Docker-driver standalone fallback" while keeping the rest of the
sentence unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b6c6b2a5-0d6f-4db2-a49c-f202dd4ad0d1
📒 Files selected for processing (2)
docs/reference/commands.mdxsrc/lib/onboard/docker-driver-gateway-service.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard/docker-driver-gateway-service.ts
Selective E2E Results — ✅ All requested jobs passedRun: 26718200252
|
Selective E2E Results — ✅ All requested jobs passedRun: 26720778107
|
…rvice-lifecycle-v60 # Conflicts: # docs/reference/commands.mdx # src/lib/onboard.ts
Selective E2E Results — ✅ All requested jobs passedRun: 27042899089
|
Selective E2E Results — ✅ All requested jobs passedRun: 27043502178
|
## Summary - Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the dev announcement from discussion #4877. - Fills the source-doc gaps found during release-prep review across inference, policy tiers, command behavior, security boundaries, Hermes dashboard/tooling, runtime context, and troubleshooting. - Refreshes generated agent skills under `.agents/skills/` from the current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`. ## Source summary - #4037 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents system-only runtime context that stays out of visible chat. - #4875 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents try-first sandbox network/filesystem guidance and clearer failure classification. - #4788 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents shared OpenClaw device-approval policy for startup and connect. - #4768 -> `docs/reference/network-policies.mdx`, `docs/network-policy/integration-policy-examples.mdx`, `docs/get-started/quickstart.mdx`, `docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`: Documents `weather`, `public-reference`, and Hermes managed-tool gateway preset behavior. - #3788 and #4864 -> `docs/reference/network-policies.mdx`, `docs/reference/commands.mdx`: Documents non-interactive policy-tier fail-fast behavior and interactive prompt fallback. - #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware default sandbox resolution for `list`, `status`, and `tunnel` commands. - #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel status` behavior. - #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy preset descriptions in `policy-list`. - #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents package-managed OpenShell gateway service and Docker-driver gateway-marker behavior. - #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent gateway/dashboard cleanup isolation by sandbox name and port. - #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU patch rollback behavior. - #4610 -> `docs/reference/troubleshooting.mdx`, `docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission guidance aligned and removes skipped experimental wording. - #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling for custom `onboard --from <Dockerfile>` contexts in generated skills. - #4870 -> `docs/reference/commands.mdx`, `docs/manage-sandboxes/runtime-controls.mdx`: Documents `NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage. - #4641 -> `docs/inference/inference-options.mdx`, `docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM platform-digest pulls and served-model id adoption. - #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash coverage. - #4852 -> `docs/inference/use-local-inference.mdx`, `docs/reference/troubleshooting.mdx`: Documents Ollama model fit filtering, 16K context floor, cold-load retry, and failed-model exclusion. - #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents API-family sync, Hermes `api_mode`, and Bedrock Runtime exception. - #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents Nemotron managed-inference native tool-search fallback. - #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents interactive multimodal input prompting. - #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass normalization in generated troubleshooting coverage. - #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents prebuilt Hermes dashboard assets and TUI recovery without runtime rebuilds. - #4854 -> `docs/inference/switch-inference-providers.mdx`, `docs/reference/commands.mdx`: Documents Hermes proxy API-key placeholder preservation during inference switches. - #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`, `.agents/skills/`: Keeps messaging enrollment behavior aligned with manifest-hook implementation. - #4771 -> `docs/security/best-practices.mdx`, `docs/security/credential-storage.mdx`: Documents Hermes placeholder-only secret boundary for sandbox-visible runtime files. - #4787 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents expanded memory scanner examples for OpenAI project keys and Slack app-level tokens. - #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill install mirroring into the agent home directory. - #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep structure and generated `.agents/skills/` refresh as the template for this release. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/ --prefix nemoclaw-user --doc-platform fern-mdx --dry-run` - `npm run docs` - `git diff --check` - skip-term scan across `docs/`, `.agents/skills/`, and `skills/` - `npm run build:cli` - `npm run typecheck:cli` - Commit and pre-push hook suites, including markdownlint, gitleaks, env-var docs gate, docs-to-skills verification, and skills YAML tests <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * DeepSeek-V4-Flash now available as default inference model for DGX Station. * Hermes dashboard improved with dedicated port and OAuth-authenticated tool gateway selection. * Added weather and public-reference policy presets for expanded agent capabilities. * Enhanced Ollama model selection with GPU memory filtering and automatic retry for timeouts. * **Bug Fixes** * Improved policy tier validation to prevent invalid configurations. * Better sandbox cleanup scoping by port to prevent conflicts across deployments. * Added GPU patch failure recovery with automatic rollback. * **Documentation** * Expanded troubleshooting guides for inference, security, and sandbox lifecycle. * Added .dockerignore best practices for custom deployments. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Registers the failing-test-first guard for #4423 in the typed scenario registry so the live Vitest matrix from #5006 fans it out as a dedicated CI job. Builds on the framework primitives added earlier in this PR (lifecycle phase fixture, host-side probes, lifecycle whitelist). Additions: * `post-reboot-recovery-ready` expected-state in `scenarios/expected-states.ts` declaring the user-visible invariants that must hold after a `nemoclaw <name> status` call on a freshly-rebooted DGX Spark / Linux Docker-driver host: - cli installed, - gateway healthy (the user-systemd unit from #4580 brings it back up before status runs), - sandbox running (recovery completed in time), - localRegistry entry preserved (the user-visible regression target — destroyed on unfixed `main`), - dockerSandboxContainer present (recovery didn't delete the labeled container or its `*-nemoclaw-gpu-backup-*` sibling). * `ubuntu-repo-docker-post-reboot-recovery` scenario in `scenarios/scenarios/baseline.ts` wiring `ubuntuRepoDockerLifecycle("cloud-openclaw", "post-reboot-recovery")` against the new expected-state and a smoke suite. Carries a description that explains the RED/GREEN contract and points to the PR-A fix landing in `src/lib/`. * `manifests/openclaw-nvidia-post-reboot-recovery.yaml` declares `lifecycle: post-reboot-recovery` and the same NVIDIA_API_KEY credential ref the cloud-openclaw scenarios use. * `.github/workflows/e2e-scenarios.yaml` ROUTES table gains the new scenario so the workflow-boundary test (`e2e-scenarios-workflow.test.ts`) routes every typed id. Test pinning: * `e2e-scenario-matrix.test.ts` updated from a 1-entry to a 2-entry live matrix expectation. The new entry asserts on `expectedStateId: "post-reboot-recovery-ready"` so a future accidental dropped-lifecycle change to the scenario regresses loudly. * `e2e-live-registry-discovery.test.ts` swaps the synthetic whitelist-coverage test for an assertion against the real `ubuntu-repo-docker-post-reboot-recovery` registry entry. Behavior: * On unfixed `main`, the live runner's lifecycle phase stops the OpenShell gateway runtime and `docker stop`s the labeled sandbox container. State-validation then runs `nemoclaw <name> status` (which restarts the gateway via systemd) and the destructive `missing` branch in `src/lib/actions/sandbox/status.ts` wipes the local registry entry. The `local-registry-entry-present` probe fails. Scenario goes RED. * On the PR-A fix branch, the new Docker-driver sandbox recovery helper restarts the labeled container before stale-removal can fire, registry survives, all five probes pass. Scenario flips GREEN. The bash-side legacy compiler emits a `lifecycle.profile.post-reboot-recovery` PhaseAction pointing at `nemoclaw_scenarios/lifecycle/dispatch.sh`, but the legacy bash worker is intentionally not provided: this scenario is Vitest-only. The typed runner's `LifecyclePhaseFixture` handles dispatch directly. If the legacy runner is invoked against this scenario it errors out at the dispatcher; that's the right failure mode while the bash side stays on its own retirement clock. Refs #4423
The first dispatch of the new post-reboot-recovery scenario surfaced a flow gap: state-validation asserted gateway-healthy immediately after lifecycle.simulate stopped the gateway, never running the user-visible action that actually triggers #4423's regression. Add `nemoclaw <name> status` as the final step of simulatePostReboot. This is faithful to the bug semantics — "after a reboot, the FIRST status invocation is what destroys state" — and lets state-validation observe the post-status host invariants: * On unfixed code: status restarts the gateway via the user-systemd unit from #4580, sees sandbox NotFound, runs the destructive `missing` branch in src/lib/actions/sandbox/status.ts, registry is wiped → `local-registry-entry-present` probe fails → scenario RED. * On the PR-A fix branch: status restarts the gateway, recovery helper restarts the labeled container before stale-removal can fire, registry survives → all probes pass → scenario GREEN. Status exit code is intentionally not asserted: the bug is precisely that status "succeeds" at destroying state. The post-action invariants are owned by state-validation, not lifecycle. Refs #4423
## Summary Failing-test-first regression guard for #4423, plus the framework primitives that make it expressible. After this PR lands, the live Vitest matrix from #5006 fans out a dedicated `ubuntu-repo-docker-post-reboot-recovery` job that goes 🔴 RED on `main` (because the registry-destruction bug is still there) and is the executable acceptance criterion for the source fix in PR-A. ## Related Issue Refs #4423 ## What this PR adds ### Framework primitives (commits 1–3) 1. **`SUPPORTED_LIFECYCLES` whitelist** in `scenarios/runtime-support.ts`. Replaces the unconditional reject for `environment.lifecycle` with a lock-step whitelist, so lifecycle scenarios can graduate one profile at a time. Initial whitelist: `post-reboot-recovery`. 2. **Two host-side state-validation probes**: - `local-registry-entry-present` — reads `~/.nemoclaw/sandboxes.json` and asserts the scenario's sandbox name is recorded. This is the user-visible regression target for #4423. - `docker-sandbox-container-present` — runs `docker ps -a --filter label=openshell.ai/sandbox-name=…` and accepts running, stopped, and `*-nemoclaw-gpu-backup-*` siblings. Mirrors `OPENSHELL_SANDBOX_NAME_LABEL` / `findOpenShellDockerSandboxContainerIds` in `src/lib/onboard/docker-gpu-patch.ts`. New `ExpectedState.localRegistry` and `ExpectedState.dockerSandboxContainer` dimensions; `probesForState` emits the probes only for `expected: "present"`. `StateValidationPhaseFixture.from()` now accepts an inline `ExpectedState` so unit tests don't need to register synthetic states. Optional `ProbeIO` injection for the registry reader. 3. **`LifecyclePhaseFixture`** in `framework/phases/lifecycle.ts`. Dispatches `simulate(profile, instance, opts)`. Today the only profile is `post-reboot-recovery`, with two modes: - `stop-original` (default) — `openshell gateway stop` + `docker stop` of the labeled container. - `rename-to-gpu-backup` — additionally renames the container to a `*-nemoclaw-gpu-backup-<ts>` sibling, mirroring `buildBackupContainerName()` in `src/lib/onboard/docker-gpu-patch.ts`. Both modes register reverse-order cleanups so teardown leaves Docker usable. Wired into `E2EScenarioFixtures` and into `live/registry-scenarios.test.ts` between `onboard.from()` and `stateValidation.from()` whenever the scenario declares `environment.lifecycle`. ### The failing regression guard (commit 4) 4. **`ubuntu-repo-docker-post-reboot-recovery` scenario** + **`post-reboot-recovery-ready` expected-state** + manifest YAML. The expected-state requires: - `cli` installed, - `gateway` healthy (the user-systemd unit from #4580 brings it back up), - `sandbox` running (recovery completes in time), - **`localRegistry` entry preserved** (the user-visible regression target), - `dockerSandboxContainer` present (recovery didn't delete the labeled container or its `*-nemoclaw-gpu-backup-*` sibling). The `.github/workflows/e2e-scenarios.yaml` ROUTES table gains the new id so `e2e-scenarios-workflow.test.ts` keeps every typed id routed. ## RED / GREEN contract * **On unfixed `main`:** the live runner stops the OpenShell gateway runtime and `docker stop`s the labeled container. State-validation runs `nemoclaw <name> status`, which restarts the gateway via systemd; the destructive `missing` branch in `src/lib/actions/sandbox/status.ts` then wipes the local registry entry. The `local-registry-entry-present` probe fails. **Scenario goes 🔴 RED.** * **On the PR-A fix branch:** the new Docker-driver sandbox recovery helper restarts the labeled container before stale-removal can fire, registry survives, all five probes pass. **Scenario flips 🟢 GREEN.** ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes — the 150 tests across the 6 directly affected files are all green; full e2e-scenario framework suite has only the pre-existing `shell probe reaps timed-out command process groups` load-flake in `e2e-fixture-context.test.ts` (unchanged on `main`). - [x] `--emit-live-matrix` now emits both `ubuntu-repo-cloud-openclaw` and `ubuntu-repo-docker-post-reboot-recovery` as supported live scenarios; the workflow fan-out from #5006 will dispatch each as its own CI job. - [x] Tests added or updated for new or changed behavior. - [x] No secrets, API keys, or credentials committed. - [ ] Docs updated for user-facing behavior changes. - [ ] `npm run docs` builds without warnings (doc changes only). ## Boundaries kept honest - No imports from `src/lib/**` into `framework/**`. The two duplicated constants (`OPENSHELL_SANDBOX_NAME_LABEL`, `~/.nemoclaw/sandboxes.json` path) carry comments pointing back to the CLI source-of-truth. Drift gets caught by the live scenario itself once the lifecycle phase exercises real Docker labels. - `LifecyclePhaseFixture` profile dispatch is exhaustive (`never`-checked). New profiles must add the dispatcher branch and the `SUPPORTED_LIFECYCLES` whitelist together. - Negative-direction host probes (`expected: "absent"` for `localRegistry` / `dockerSandboxContainer`) deliberately emit nothing today, with a `probesForState` test pinning the gap so a future negative scenario PR has to add them. - The legacy bash dispatcher (`nemoclaw_scenarios/lifecycle/dispatch.sh`) is intentionally **not** updated for `post-reboot-recovery`. This scenario is Vitest-only; the typed runner's `LifecyclePhaseFixture` handles dispatch directly. If the legacy bash runner is ever invoked against this scenario, it errors at the dispatcher — that's the right failure mode while the bash side stays on its own retirement clock. ## Why this doesn't yet flip green The fix lives in `src/lib/actions/sandbox/{status,gateway-state}.ts` plus a new `src/lib/onboard/docker-driver-sandbox-recovery.ts` helper, tracked as **PR-A**. PR-A implements parts 2 & 3 of @ericksoa's #4423 plan: - Docker-driver sandbox recovery from `findOpenShellDockerSandboxContainerIds` labels (including `*-nemoclaw-gpu-backup-*` rename), and - tightened stale-removal in active paths (`status`, `connect`, `ensureLiveSandboxOrExit`) requiring Docker-corroborated absence before destroying the registry. PR-A's CI run will dispatch this scenario via `--scenarios ubuntu-repo-docker-post-reboot-recovery` to prove 🔴 → 🟢. ## Diffstat ``` .github/workflows/e2e-scenarios.yaml | 1 + test/e2e-scenario/framework-tests/e2e-expected-state.test.ts | 35 +++ test/e2e-scenario/framework-tests/e2e-live-registry-discovery.test.ts | 35 +++ test/e2e-scenario/framework-tests/e2e-phase-lifecycle.test.ts | 239 +++++++++++++++++++++ test/e2e-scenario/framework-tests/e2e-phase-state-validation.test.ts | 120 ++++++++++- test/e2e-scenario/framework-tests/e2e-scenario-matrix.test.ts | 20 +- test/e2e-scenario/framework/e2e-test.ts | 5 + test/e2e-scenario/framework/phases/index.ts | 8 + test/e2e-scenario/framework/phases/lifecycle.ts | 207 ++++++++++++++++++ test/e2e-scenario/framework/phases/state-validation.ts | 113 +++++++++- test/e2e-scenario/live/registry-scenarios.test.ts | 44 +++- test/e2e-scenario/manifests/openclaw-nvidia-post-reboot-recovery.yaml | 41 ++++ test/e2e-scenario/scenarios/expected-states.ts | 35 +++ test/e2e-scenario/scenarios/runtime-support.ts | 8 +- test/e2e-scenario/scenarios/scenarios/baseline.ts | 31 +++ test/e2e-scenario/scenarios/types.ts | 25 ++- 16 files changed, 950 insertions(+), 17 deletions(-) ``` --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
The first end-to-end dispatch with the new probe ordering revealed that the test was hitting #4578's mitigation, not the open #4423 bug. With the gateway stopped and never restarted, status's `getNamedGateway- LifecycleState()` returned NOT healthy_named, which routes through the non-destructive branch — so the registry was correctly preserved and the test failed at gateway-healthy, masking the real signal. Add `openshell gateway start --name nemoclaw` between `docker stop` and `nemoclaw <name> status` so the lifecycle truly mirrors a DGX Spark / Linux Docker-driver post-reboot: * Gateway stopped (sandbox memory dropped). * Sandbox container stopped (so OpenShell sees NotFound). * Gateway restarted FRESH via the explicit `openshell gateway start` invocation — the user-systemd autostart path from #4580 in compressed form. Now the gateway is `healthy_named` AND has no sandbox memory. * Status is then invoked. With healthy_named gateway + missing sandbox + Docker container still present, the destructive `missing` branch in `src/lib/actions/sandbox/status.ts:308` fires and wipes the registry on unfixed code. This is the precise precondition that PR-A's Docker-driver sandbox recovery helper neutralizes: status will see Docker still has the labeled container, restart it, and skip the destructive cleanup. Refs #4423
The previous attempt to restart the gateway via `openshell gateway start --name nemoclaw` failed at runtime: that subcommand doesn't exist on the OpenShell CLI shipped on `ubuntu-latest`. The hint text the NemoClaw codebase prints to users (`openshell gateway start --name nemoclaw`) is a USER-FACING string, not an actual CLI invocation; the real autostart path is the user-systemd unit installed by #4580. Reframe the lifecycle: stopping just the labeled sandbox container (while leaving the gateway healthy_named) is sufficient to reach the remaining #4423 destructive branch. With: * Gateway healthy_named (untouched by lifecycle) * `openshell sandbox get <name>` returning NotFound (because the container is stopped and OpenShell doesn't auto-recover stopped sandboxes) * Docker still has the labeled container (recoverable) …on unfixed code, status's `missing` branch in `src/lib/actions/sandbox/status.ts:308` calls `registry.removeSandbox` and the local-registry-entry-present probe fails. PR-A's Docker-driver recovery helper is what needs to neutralize this branch. Removed: * `openshell gateway stop` step * `openshell gateway start --name nemoclaw` step * `GATEWAY_STOP_TIMEOUT_MS` / `GATEWAY_START_TIMEOUT_MS` constants * Unit tests for the now-removed gateway-{stop,start} flows Refs #4423
Initial dispatches of `ubuntu-repo-docker-post-reboot-recovery` against `ubuntu-latest` revealed that the user-visible #4423 bug class cannot be fully reproduced from CI without a real reboot: * `docker stop` of the labeled sandbox container puts OpenShell into `Phase: Provisioning`, not literal `NotFound` — so the destructive `missing` branch in `src/lib/actions/sandbox/status.ts:308` does not fire on unfixed code. * The OpenShell CLI on `ubuntu-latest` does not expose a `gateway start` subcommand (the user-systemd autostart path from #4580 is the real autostart mechanism, not the CLI), so we can't cleanly reset the gateway to a fresh-state-with-no-sandbox-memory configuration that would yield a real `NotFound`. * Gateway HTTP health on the runner is environmental and varies independently of the regression target. Scope the `post-reboot-recovery-ready` expected state to ONLY the host-side preservation invariants this scenario can faithfully exercise from CI: * `cli` installed, * `localRegistry` entry preserved, * `dockerSandboxContainer` (labeled, including `*-nemoclaw-gpu-backup-*` siblings) present. This makes the scenario a "lock-in" guard for #4578's mitigation and for the host-side invariants PR-A's Docker-driver recovery helper must also preserve. A future scenario on a more controlled runner can extend the expected state with gateway/sandbox runtime probes once PR-A has landed and a true post-reboot environment is available. The framework primitives — lifecycle phase fixture, host-side probes, lifecycle whitelist — remain unchanged and remain useful beyond this scenario's current limits. Refs #4423
…-up to #5046) (#5087) ## Summary Follow-up to #5046. Tightens the `ubuntu-repo-docker-post-reboot-recovery` scenario into a stable RED-on-regression / GREEN-on-main guard while PR-A (the `#4423` source fix) is in flight. ## Related Issue Refs #4423 ## Why this is needed Post-merge dispatches of #5046's `ubuntu-repo-docker-post-reboot-recovery` job against `ubuntu-latest` revealed two issues: 1. **The `openshell gateway stop` step in `LifecyclePhaseFixture` was counterproductive.** `#4578`'s mitigation in `src/lib/actions/sandbox/status.ts:296` guards the destructive branch behind a `healthy_named` gateway. Stopping the gateway routed every dispatch through the safe path and masked the regression target. There is no `openshell gateway start` subcommand on `ubuntu-latest` (the user-systemd autostart from #4580 is the real mechanism), so the original lifecycle couldn't restart the gateway cleanly to keep it `healthy_named`. 2. **The `post-reboot-recovery-ready` expected-state asserted on runtime liveness probes** (`gateway-healthy`, `sandbox-running`) that are environmental on `ubuntu-latest` after `docker stop`. They varied independently of the `#4423` regression target and made every dispatch RED for the wrong reason — concretely, `gateway-healthy` failing before the host-side probes had a chance to run. ## Changes ### `LifecyclePhaseFixture.simulatePostReboot` - Drop the `openshell gateway stop` step. The lifecycle now stops only the labeled sandbox container and then runs `nemoclaw <name> status`. Gateway is left `healthy_named`, which is the precondition for the destructive branch we want PR-A to neutralize. - Drop the `GATEWAY_STOP_TIMEOUT_MS` constant (no longer referenced). - Updated docstring + 2 unit tests removed (gateway-stop-failure tolerance tests are no longer applicable). ### `post-reboot-recovery-ready` expected-state Lock down only the host-side preservation invariants this scenario can faithfully exercise: - `cli-installed` - `local-registry-entry-present` - `docker-sandbox-container-present` (running OR stopped OR `*-nemoclaw-gpu-backup-*` sibling) Drop `gateway: { expected: "present", health: "healthy" }` and `sandbox: { expected: "present", status: "running", agent: "openclaw" }`. A follow-up scenario on a more controlled runner (real DGX Spark / Brev) can extend the expected state with runtime liveness probes once PR-A has landed and a true post-reboot environment is available. ### `probesForState` ordering Reorder so host-side observables (`local-registry-entry-present`, `docker-sandbox-container-present`) emit BEFORE runtime-derived gateway/sandbox probes everywhere. The state-validation orchestrator short-circuits on the first probe failure, so host-side regression targets must be observed first. Pre-existing scenarios (`cloud-openclaw-ready`, `preflight-failure-*`, etc.) are unaffected — none declare host-side dimensions, so their probe lists are unchanged. ## RED / GREEN contract - **On `main`** (this PR's HEAD): scenario goes 🟢 GREEN. Locks in `#4578`'s mitigation as a passive guard — registry preserved through `docker stop` + `nemoclaw status` round-trip. - **On a regression PR** that re-introduces unconditional `registry.removeSandbox` in `status.ts:296` or `gateway-state.ts:412`: scenario goes 🔴 RED at `local-registry-entry-present`. - **On PR-A's fix branch** that adds Docker-driver sandbox recovery: scenario stays 🟢, demonstrating the new helper does not regress either host-side invariant. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes — 142 tests across the directly affected files all green; no regressions on the rest of the e2e-scenario framework suite. - [x] Live dispatch (`gh workflow run e2e-vitest-scenarios.yaml -f scenarios=ubuntu-repo-docker-post-reboot-recovery --ref e2e-post-reboot-scope-host-side`) goes GREEN: lifecycle (`docker stop` + `nemoclaw status`) → host-side probes pass → cleanup restores Docker. Will dispatch as a verification step once this PR is open. - [x] Tests added or updated for new or changed behavior. - [x] No secrets, API keys, or credentials committed. ## Follow-ups (post PR-A) - Once PR-A's Docker-driver recovery helper is in place, extend the post-reboot expected-state with gateway/sandbox runtime probes on a controlled runner. - Consider a `post-reboot-spark` lifecycle profile for hardware-runner dispatch that can do a real reboot and exercise the full bug class. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added a validation test for post-reboot-recovery expected-state focusing on host-side invariants. * Updated post-reboot-recovery phase tests to remove the gateway-stop step and assert container stop + host status checks. * **Documentation** * Clarified lifecycle and scenario comments/manifests to describe container-stop + host-status-driven recovery and probe ordering. * **Behavior** * Expected-state and probe emission now prioritize host-side checks and omit runtime gateway/sandbox expectations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Post-Computex / v0.0.60 follow-up for the gateway lifecycle half of #4423. This is intentionally separate from #4578, which remains the v0.0.56 safety hotfix for non-destructive status behavior.
openshell-gatewayuser service when its vendor/package unit is present, writing Docker-driver gateway env before restart.Validation
npm run build:clinpm run typecheck:clinpm run checksnpm run source-shape:checknpm run check:installer-hashbash -n scripts/install-openshell.shbash test/e2e/test-openshell-version-pin.shnpx vitest run src/lib/onboard/docker-driver-gateway-env.test.ts src/lib/onboard/docker-driver-gateway-service.test.ts test/install-openshell-version-check.test.ts test/runner.test.ts test/onboard-gateway-runtime.test.ts test/gateway-state-reconcile-2276.test.tsRefs #4423
Follow-up to #4578
Summary by CodeRabbit