Skip to content

fix(onboard): use OpenShell gateway user service#4580

Merged
cv merged 13 commits into
mainfrom
fix/4423-openshell-service-lifecycle-v60
Jun 5, 2026
Merged

fix(onboard): use OpenShell gateway user service#4580
cv merged 13 commits into
mainfrom
fix/4423-openshell-service-lifecycle-v60

Conversation

@ericksoa

@ericksoa ericksoa commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

Post-Computex / v0.0.60 follow-up for the gateway lifecycle half of #4423. This is intentionally separate from #4578, which remains the v0.0.56 safety hotfix for non-destructive status behavior.

  • Use OpenShell's package-managed openshell-gateway user service when its vendor/package unit is present, writing Docker-driver gateway env before restart.
  • Ignore per-user/stale gateway unit files so standalone recovery remains available when there is no package-managed OpenShell service.
  • Keep the existing standalone gateway launch path as an explicit compatibility fallback when the upstream service is unavailable.
  • Update docs/tests to describe package-service ownership vs. standalone fallback.

Validation

  • npm run build:cli
  • npm run typecheck:cli
  • npm run checks
  • npm run source-shape:check
  • npm run check:installer-hash
  • bash -n scripts/install-openshell.sh
  • bash test/e2e/test-openshell-version-pin.sh
  • npx vitest run src/lib/onboard/docker-driver-gateway-env.test.ts src/lib/onboard/docker-driver-gateway-service.test.ts test/install-openshell-version-check.test.ts test/runner.test.ts test/onboard-gateway-runtime.test.ts test/gateway-state-reconcile-2276.test.ts

Refs #4423
Follow-up to #4578

Summary by CodeRabbit

  • Documentation
    • Clarified Deployment Topology, uninstall/state-dir contents, Apple Silicon sandbox behavior, and environment-variable guidance for the standalone fallback.
  • New Features
    • Installer on Linux now attempts to start a package-managed OpenShell gateway and falls back to the standalone gateway when appropriate.
  • Tests
    • Expanded unit and e2e tests to cover package-managed vs standalone gateway flows and realistic checksum-driven installer scenarios.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa added bug Something fails against expected or documented behavior Platform: DGX Spark NV QA Bugs found by the NVIDIA QA Team UAT Issues flagged for User Acceptance Testing. v0.0.60 Release target labels May 31, 2026
@ericksoa ericksoa self-assigned this May 31, 2026
@coderabbitai

coderabbitai Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Worried about impact? Review this PR in Change Stack to explore blast radius before you approve or request changes.

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 682e97c9-8f61-41a6-9aab-5a4a1482060f

📥 Commits

Reviewing files that changed from the base of the PR and between 7523a1e and eee1212.

📒 Files selected for processing (3)
  • docs/reference/architecture.mdx
  • docs/reference/commands-nemohermes.mdx
  • docs/reference/commands.mdx
💤 Files with no reviewable changes (3)
  • docs/reference/architecture.mdx
  • docs/reference/commands-nemohermes.mdx
  • docs/reference/commands.mdx

📝 Walkthrough

Walkthrough

Adds Linux systemd user-service orchestration for openshell-gateway, gates Debian env-override writes on service presence, integrates package-managed startup into onboarding (short-circuiting standalone fallback on success), broadens unit and e2e tests, and updates docs and readiness comments.

Changes

Linux Gateway User Service Feature

Layer / File(s) Summary
Service detection types and unit path helpers
src/lib/onboard/docker-driver-gateway-service.ts
Exports constants/types and helpers to compute systemd user unit paths and check unit-file existence on Linux with injectable existsSync.
Service startup systemctl orchestration
src/lib/onboard/docker-driver-gateway-service.ts
Command-availability probing, spawn-sync abstraction, and startOpenShellGatewayUserService() implementing gated systemctl --user daemon-reload → enable → restart with structured result and fallback allowance.
Package-managed gateway startup wrapper
src/lib/onboard/docker-driver-gateway-service.ts
startPackageManagedDockerDriverGateway() attempts package-managed startup (or returns false if unit missing), polls for endpoint registration, validates health/TCP readiness, clears runtime files, and verifies sandbox-bridge reachability or exits/throws on failure.
Env override conditional write logic
src/lib/onboard/docker-driver-gateway-env.ts
Imports hasOpenShellGatewayUserService, re-exports startPackageManagedDockerDriverGateway, and changes writeDockerGatewayDebEnvOverride() to accept opts and return boolean (false when service absent; true after write + perms).
Env override conditional write tests
src/lib/onboard/docker-driver-gateway-env.test.ts
Updates existing test to assert wrote === true when systemd unit present; adds test asserting wrote === false and no gateway.env when only standalone gateway binary exists.
Onboard package-managed startup integration
src/lib/onboard.ts
Imports reportDockerDriverGatewayStartFailure; startDockerDriverGateway() computes gatewayEnv, writes DEB env override, delegates to startPackageManagedDockerDriverGateway(), and returns early on success.
Service module unit tests
src/lib/onboard/docker-driver-gateway-service.test.ts
Adds helpers and Vitest cases verifying platform detection, systemctl call sequence, fallback allowance on bus errors, non-fallback on restart failures, package-managed orchestration timing, and health-failure behavior.
E2E fake asset & checksum generation
test/e2e/test-openshell-version-pin.sh
Fake gh/curl handlers now create fake OpenShell assets, compute SHA-256 digests, and emit correct *-checksums-sha256.txt entries for openshell, gateway, and sandbox archives.
Documentation and TCP readiness comment updates
docs/reference/architecture.mdx, docs/reference/commands.mdx, docs/reference/commands-nemohermes.mdx, src/lib/onboard/gateway-tcp-readiness.ts
Deployment topology documents Linux package-managed service restart with standalone fallback and Apple Silicon Docker sandbox behavior; uninstall state dir contents and env-var descriptions updated for standalone-fallback gateway; TCP-readiness comment clarified.

Sequence Diagrams

sequenceDiagram
  participant Onboard as startDockerDriverGateway
  participant EnvOverride as writeDockerGatewayDebEnvOverride
  participant PkgMgd as startPackageManagedDockerDriverGateway
  participant Service as startOpenShellGatewayUserService
  participant Systemctl as systemctl --user
  participant Gateway as openshell-gateway

  Onboard->>Onboard: Compute gatewayEnv from OpenShell --version
  Onboard->>EnvOverride: Write DEB gateway.env override
  EnvOverride-->>Onboard: wrote boolean

  Onboard->>PkgMgd: startPackageManagedDockerDriverGateway(...)
  PkgMgd->>PkgMgd: Check unit file exists

  alt Unit exists
    PkgMgd->>Service: startOpenShellGatewayUserService()
    Service->>Systemctl: daemon-reload
    Systemctl-->>Service: result
    Service->>Systemctl: enable openshell-gateway
    Systemctl-->>Service: result
    Service->>Systemctl: restart openshell-gateway
    Systemctl-->>Gateway: restart signal
    Gateway-->>Service: started / failure

    alt Service started
      Service-->>PkgMgd: started true
      PkgMgd-->>Onboard: success
    else Service failed
      Service-->>PkgMgd: fallbackAllowed / reason
      PkgMgd-->>Onboard: false or exit/throw
    end
  else Unit not found
    PkgMgd-->>Onboard: false
  end

  Onboard->>Onboard: Standalone gateway fallback path
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4602: Modifies Docker-driver gateway startup behavior in startDockerDriverGateway, related to gateway health/liveness checks during polling.

Suggested labels

Docker, OpenShell

Suggested reviewers

  • prekshivyas
  • cv

Poem

🐇 I hopped through systemd gates at dawn,
Wrote envs, checked units, then moved on.
Fake assets hum, checksums crisp and bright,
Gateway wakes or falls back in the night.
Carrots for tests — delightfully light!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and specifically describes the main change: implementing the use of OpenShell's package-managed gateway user service as the primary ownership mechanism for gateway lifecycle management.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/4423-openshell-service-lifecycle-v60

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: cloud-onboard-e2e, sandbox-survival-e2e, openshell-gateway-upgrade-e2e, gateway-health-honest-e2e, gateway-drift-preflight-e2e
Optional E2E: docker-unreachable-gateway-start-e2e, openshell-version-pin-e2e, openclaw-onboard-security-posture-e2e

Auto-dispatched E2E: cloud-onboard-e2e, sandbox-survival-e2e, openshell-gateway-upgrade-e2e via nightly-e2e.yaml at eee12120d7be889187c125d5dd9a4bf7d3bcc557nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • cloud-onboard-e2e (medium): Runs a real clean Linux onboard path and should catch regressions in the changed Docker-driver gateway startup handoff before sandbox creation.
  • sandbox-survival-e2e (high): Directly exercises sandbox survival across gateway stop/start and recovery; the PR changes gateway ownership, runtime-file cleanup, and fallback/reuse logic that this flow depends on.
  • openshell-gateway-upgrade-e2e (high): Validates old-to-current OpenShell gateway upgrade and durable sandbox state across gateway replacement, which is the closest existing end-to-end coverage for standalone-to-current gateway handoff behavior.
  • gateway-health-honest-e2e (medium): The PR changes Docker-driver gateway readiness and service startup sequencing; this regression guard ensures NemoClaw does not report a healthy gateway when the underlying gateway is not actually serving.
  • gateway-drift-preflight-e2e (medium): The package-managed service path clears standalone runtime breadcrumbs and bypasses some standalone drift handling; this existing regression checks stale/drifted gateway runtime detection remains fail-closed.

Optional E2E

  • docker-unreachable-gateway-start-e2e (low): Useful adjacent coverage for gateway-start failure handling and fallback diagnostics when host services are unavailable, but the primary changed behavior is service handoff rather than Docker outage handling.
  • openshell-version-pin-e2e (low): The PR edits this E2E script and the gateway-service distinction is adjacent to OpenShell helper binary/version installation behavior; useful to verify the regression guard still runs as intended.
  • openclaw-onboard-security-posture-e2e (high): Optional non-root full onboard check for security posture around host-side gateway ownership and trusted runtime guards after introducing trusted systemd user service detection.

New E2E recommendations

  • package-managed Docker-driver gateway service handoff (high): Existing E2Es cover general onboard, gateway health, drift, and upgrade behavior, but none appears to install or fake a trusted package-managed openshell-gateway user unit and assert that NemoClaw writes gateway.env, starts via systemctl --user, registers the endpoint, clears standalone PID breadcrumbs, and creates a sandbox through that service path.
    • Suggested test: package-managed-docker-gateway-service-e2e
  • untrusted/per-user systemd unit fallback (medium): The critical security rule that per-user units, partial units, and untrusted ExecStart paths must not take over gateway ownership is unit-tested but not covered by a real host/onboard E2E.
    • Suggested test: docker-gateway-untrusted-user-service-fallback-e2e
  • user systemd manager outage fallback (medium): The code intentionally falls back to standalone gateway startup when systemctl --user cannot reach the user manager or bus; an E2E would protect the real diagnostics and ensure onboarding still succeeds or fails with the intended standalone path.
    • Suggested test: docker-gateway-user-manager-outage-fallback-e2e

@github-actions

github-actions Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw
Optional scenario E2E: ubuntu-repo-cloud-hermes, wsl-repo-cloud-openclaw

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • ubuntu-repo-cloud-openclaw: Onboarding Docker-driver gateway startup was changed to prefer the package-managed OpenShell user service with standalone fallback. The Ubuntu repo cloud OpenClaw scenario is the smallest routed scenario that exercises Linux repo-current onboarding, Docker-driver gateway health, sandbox creation, and inference-route readiness.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional scenario E2E

  • ubuntu-repo-cloud-hermes: Optional adjacent coverage for the same Linux Docker-driver gateway/onboarding path with the Hermes agent stack.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-hermes
  • wsl-repo-cloud-openclaw: Optional special-runner coverage for the Linux/WSL Docker-driver gateway path, useful if the package-managed user-service fallback behavior is suspected to differ under WSL.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw

Relevant changed files

  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-env.ts
  • src/lib/onboard/docker-driver-gateway-service.ts
  • src/lib/onboard/gateway-tcp-readiness.ts

@github-actions

github-actions Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Top item: PR review advisor unavailable

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • PR review advisor unavailable: The automated advisor could not complete: Could not parse JSON from PR review advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/pr-review-advisor/pr-review-advisor-raw-output.txt
    • Recommendation: Re-run the PR Review Advisor or perform a manual review.
    • Evidence: Could not parse JSON from PR review advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/pr-review-advisor/pr-review-advisor-raw-output.txt

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Add or identify targeted runtime/integration validation for the changed behavior; do not report external E2E job pass/fail here.. Runtime/sandbox/infrastructure paths need behavioral runtime validation: docs/reference/architecture.mdx, docs/reference/commands-nemohermes.mdx, docs/reference/commands.mdx, src/lib/onboard.ts, src/lib/onboard/docker-driver-gateway-env.ts, src/lib/onboard/docker-driver-gateway-service.ts, src/lib/onboard/gateway-tcp-readiness.ts.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

ericksoa added 2 commits May 31, 2026 08:44
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26717014158
Target ref: 04b7523a4b3e0692374bc8229b4eab11be5ee0df
Workflow ref: main
Requested jobs: cloud-onboard-e2e,openshell-gateway-upgrade-e2e,sandbox-survival-e2e,runtime-overrides-e2e
Summary: 0 passed, 3 failed, 0 skipped

Job Result
cloud-onboard-e2e ❌ failure
openshell-gateway-upgrade-e2e ⚠️ cancelled
runtime-overrides-e2e ❌ failure
sandbox-survival-e2e ❌ failure

Failed jobs: cloud-onboard-e2e, runtime-overrides-e2e, sandbox-survival-e2e. Check run artifacts for logs.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2414-2471: The helper function
tryStartPackageManagedDockerDriverGateway should be moved out of
src/lib/onboard.ts into a new sibling module (e.g.,
src/lib/onboard-package-managed.ts) so the big block no longer inflates
onboard.ts; keep startDockerDriverGateway() in onboard.ts as the coordinator and
have it import and call the relocated tryStartPackageManagedDockerDriverGateway.
Ensure you preserve the function signature and all referenced symbols
(dockerDriverGatewayService, clearDockerDriverGatewayRuntimeFiles,
registerDockerDriverGatewayEndpoint, runCaptureOpenshell, isGatewayHealthy,
isGatewayTcpReady, verifySandboxBridgeGatewayReachableOrExit, envInt,
sleepSeconds, GATEWAY_NAME) and their imports/exports so behavior is unchanged,
update module imports in onboard.ts, and export the helper from the new module
for testing/consumption.
- Around line 2443-2461: The call to clearDockerDriverGatewayRuntimeFiles
currently runs before the health-poll loop and removes runtime PID/marker files
prematurely; move that cleanup so it only runs after the gateway is confirmed
healthy (i.e., after isGatewayHealthy(status, namedInfo, currentInfo) && await
isGatewayTcpReady() succeeds) — update the code so
clearDockerDriverGatewayRuntimeFiles is invoked after the successful health
checks (for functions/variables involved: clearDockerDriverGatewayRuntimeFiles,
registerDockerDriverGatewayEndpoint, isGatewayHealthy, isGatewayTcpReady,
verifySandboxBridgeGatewayReachableOrExit) so recovery/fallback logic retains
runtime breadcrumbs until the service is truly up.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ad5de779-9431-49cf-972d-2394b910a053

📥 Commits

Reviewing files that changed from the base of the PR and between 9641ce0 and 04b7523.

📒 Files selected for processing (9)
  • docs/reference/architecture.mdx
  • docs/reference/commands.mdx
  • scripts/install-openshell.sh
  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-env.test.ts
  • src/lib/onboard/docker-driver-gateway-env.ts
  • src/lib/onboard/docker-driver-gateway-service.test.ts
  • src/lib/onboard/docker-driver-gateway-service.ts
  • test/install-openshell-version-check.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard.ts Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26717115183
Target ref: 90a5959cb9b39e7058ef96abd1468cbd46bd8b83
Workflow ref: main
Requested jobs: cloud-onboard-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26717180037
Target ref: 795b3c376c96cd021ac9091c6160b37ab9ae2f51
Workflow ref: main
Requested jobs: cloud-onboard-e2e,openshell-gateway-upgrade-e2e,sandbox-survival-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled
sandbox-survival-e2e ⚠️ cancelled

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

2424-2426: Please run one service-path E2E before merge.

This branch now short-circuits the legacy startup flow when the package-managed gateway takes ownership, so openshell-gateway-upgrade-e2e plus one happy-path onboard flow such as cloud-e2e or sandbox-operations-e2e would give good coverage of the new handoff.

As per coding guidelines, src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 2424 - 2426, Run a service-path E2E before
merging to validate the new short-circuit in src/lib/onboard.ts: execute the
openshell-gateway-upgrade-e2e test plus one happy-path onboarding E2E (cloud-e2e
or sandbox-operations-e2e) to confirm
dockerDriverGatewayEnv.startPackageManagedDockerDriverGateway correctly takes
ownership and that the legacy startup path still behaves when it should; verify
the flow around verifySandboxBridgeGatewayReachableOrExit and GATEWAY_NAME,
observe logs/errors and ensure no regressions in gateway handoff or sandbox
creation before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2424-2426: Run a service-path E2E before merging to validate the
new short-circuit in src/lib/onboard.ts: execute the
openshell-gateway-upgrade-e2e test plus one happy-path onboarding E2E (cloud-e2e
or sandbox-operations-e2e) to confirm
dockerDriverGatewayEnv.startPackageManagedDockerDriverGateway correctly takes
ownership and that the legacy startup path still behaves when it should; verify
the flow around verifySandboxBridgeGatewayReachableOrExit and GATEWAY_NAME,
observe logs/errors and ensure no regressions in gateway handoff or sandbox
creation before merging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e9cba808-c0cf-473b-950a-5efbb2171a45

📥 Commits

Reviewing files that changed from the base of the PR and between 04b7523 and 795b3c3.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-env.ts
  • src/lib/onboard/docker-driver-gateway-service.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/docker-driver-gateway-env.ts

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26717266019
Target ref: 098007d653201720acea8b9f796c267c9feff776
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e,openshell-gateway-upgrade-e2e
Summary: 1 passed, 2 failed, 0 skipped

Job Result
cloud-onboard-e2e ❌ failure
openshell-gateway-upgrade-e2e ✅ success
sandbox-survival-e2e ❌ failure

Failed jobs: cloud-onboard-e2e, sandbox-survival-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26718114347
Target ref: c24aa4e2f9f2f1d765c4d68ad3d19b78bf33f12ba
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 3 failed, 0 skipped

Job Result
cloud-onboard-e2e ❌ failure
openshell-gateway-upgrade-e2e ❌ failure
sandbox-survival-e2e ❌ failure

Failed jobs: cloud-onboard-e2e, openshell-gateway-upgrade-e2e, sandbox-survival-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26718192181
Target ref: c24aa4e2f798b44cedc7665ffc5750b610d2e28a
Workflow ref: main
Requested jobs: cloud-e2e,double-onboard-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-e2e ⚠️ cancelled
double-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/reference/commands.mdx`:
- Line 1422: Update the description for the environment variable
NEMOCLAW_OPENSHELL_SANDBOX_BIN to use active voice to match the parallel entry
for NEMOCLAW_OPENSHELL_GATEWAY_BIN: replace the passive phrase "passed to the
Linux Docker-driver standalone fallback" with the active construction "used by
the Linux Docker-driver standalone fallback" while keeping the rest of the
sentence unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b6c6b2a5-0d6f-4db2-a49c-f202dd4ad0d1

📥 Commits

Reviewing files that changed from the base of the PR and between 098007d and c24aa4e.

📒 Files selected for processing (2)
  • docs/reference/commands.mdx
  • src/lib/onboard/docker-driver-gateway-service.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/docker-driver-gateway-service.ts

Comment thread docs/reference/commands.mdx Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26718200252
Target ref: 3f68c539819164ab6324c26c69a524181c2c3451
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled
sandbox-survival-e2e ⚠️ cancelled

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26720778107
Target ref: 7523a1ebf586a3749bd3b091f2d22c64add9b021
Workflow ref: main
Requested jobs: openshell-gateway-upgrade-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
openshell-gateway-upgrade-e2e ✅ success

@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: dgx-spark Affects DGX Spark hardware or workflows and removed priority: high bug Something fails against expected or documented behavior labels Jun 3, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27042899089
Target ref: 19b68534075c8d4680ac587a208dbb1f40a82f0d
Workflow ref: main
Requested jobs: cloud-onboard-e2e,openshell-gateway-upgrade-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
openshell-gateway-upgrade-e2e ⚠️ cancelled

@cv cv enabled auto-merge (squash) June 5, 2026 22:33
@cv cv merged commit 2e3daf7 into main Jun 5, 2026
34 checks passed
@cv cv deleted the fix/4423-openshell-service-lifecycle-v60 branch June 5, 2026 22:36
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 27043502178
Target ref: eee12120d7be889187c125d5dd9a4bf7d3bcc557
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e,openshell-gateway-upgrade-e2e
Summary: 3 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
sandbox-survival-e2e ✅ success

miyoungc added a commit that referenced this pull request Jun 6, 2026
## Summary
- Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the
dev announcement from discussion #4877.
- Fills the source-doc gaps found during release-prep review across
inference, policy tiers, command behavior, security boundaries, Hermes
dashboard/tooling, runtime context, and troubleshooting.
- Refreshes generated agent skills under `.agents/skills/` from the
current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`.

## Source summary
- #4037 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
system-only runtime context that stays out of visible chat.
- #4875 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
try-first sandbox network/filesystem guidance and clearer failure
classification.
- #4788 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents shared OpenClaw
device-approval policy for startup and connect.
- #4768 -> `docs/reference/network-policies.mdx`,
`docs/network-policy/integration-policy-examples.mdx`,
`docs/get-started/quickstart.mdx`,
`docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`:
Documents `weather`, `public-reference`, and Hermes managed-tool gateway
preset behavior.
- #3788 and #4864 -> `docs/reference/network-policies.mdx`,
`docs/reference/commands.mdx`: Documents non-interactive policy-tier
fail-fast behavior and interactive prompt fallback.
- #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware
default sandbox resolution for `list`, `status`, and `tunnel` commands.
- #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel
status` behavior.
- #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy
preset descriptions in `policy-list`.
- #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents
package-managed OpenShell gateway service and Docker-driver
gateway-marker behavior.
- #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent
gateway/dashboard cleanup isolation by sandbox name and port.
- #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU
patch rollback behavior.
- #4610 -> `docs/reference/troubleshooting.mdx`,
`docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission
guidance aligned and removes skipped experimental wording.
- #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling
for custom `onboard --from <Dockerfile>` contexts in generated skills.
- #4870 -> `docs/reference/commands.mdx`,
`docs/manage-sandboxes/runtime-controls.mdx`: Documents
`NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage.
- #4641 -> `docs/inference/inference-options.mdx`,
`docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM
platform-digest pulls and served-model id adoption.
- #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents
stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash
coverage.
- #4852 -> `docs/inference/use-local-inference.mdx`,
`docs/reference/troubleshooting.mdx`: Documents Ollama model fit
filtering, 16K context floor, cold-load retry, and failed-model
exclusion.
- #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents
API-family sync, Hermes `api_mode`, and Bedrock Runtime exception.
- #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents
Nemotron managed-inference native tool-search fallback.
- #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents
interactive multimodal input prompting.
- #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass
normalization in generated troubleshooting coverage.
- #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents
prebuilt Hermes dashboard assets and TUI recovery without runtime
rebuilds.
- #4854 -> `docs/inference/switch-inference-providers.mdx`,
`docs/reference/commands.mdx`: Documents Hermes proxy API-key
placeholder preservation during inference switches.
- #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`.agents/skills/`: Keeps messaging enrollment behavior aligned with
manifest-hook implementation.
- #4771 -> `docs/security/best-practices.mdx`,
`docs/security/credential-storage.mdx`: Documents Hermes
placeholder-only secret boundary for sandbox-visible runtime files.
- #4787 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents expanded memory scanner
examples for OpenAI project keys and Slack app-level tokens.
- #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill
install mirroring into the agent home directory.
- #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep
structure and generated `.agents/skills/` refresh as the template for
this release.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/
--prefix nemoclaw-user --doc-platform fern-mdx --dry-run`
- `npm run docs`
- `git diff --check`
- skip-term scan across `docs/`, `.agents/skills/`, and `skills/`
- `npm run build:cli`
- `npm run typecheck:cli`
- Commit and pre-push hook suites, including markdownlint, gitleaks,
env-var docs gate, docs-to-skills verification, and skills YAML tests

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* DeepSeek-V4-Flash now available as default inference model for DGX
Station.
* Hermes dashboard improved with dedicated port and OAuth-authenticated
tool gateway selection.
* Added weather and public-reference policy presets for expanded agent
capabilities.
* Enhanced Ollama model selection with GPU memory filtering and
automatic retry for timeouts.

* **Bug Fixes**
  * Improved policy tier validation to prevent invalid configurations.
* Better sandbox cleanup scoping by port to prevent conflicts across
deployments.
  * Added GPU patch failure recovery with automatic rollback.

* **Documentation**
* Expanded troubleshooting guides for inference, security, and sandbox
lifecycle.
  * Added .dockerignore best practices for custom deployments.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
jyaunches added a commit that referenced this pull request Jun 9, 2026
Registers the failing-test-first guard for #4423 in the typed scenario
registry so the live Vitest matrix from #5006 fans it out as a
dedicated CI job. Builds on the framework primitives added earlier in
this PR (lifecycle phase fixture, host-side probes, lifecycle whitelist).

Additions:

  * `post-reboot-recovery-ready` expected-state in
    `scenarios/expected-states.ts` declaring the user-visible
    invariants that must hold after a `nemoclaw <name> status` call
    on a freshly-rebooted DGX Spark / Linux Docker-driver host:
      - cli installed,
      - gateway healthy (the user-systemd unit from #4580 brings it
        back up before status runs),
      - sandbox running (recovery completed in time),
      - localRegistry entry preserved (the user-visible regression
        target — destroyed on unfixed `main`),
      - dockerSandboxContainer present (recovery didn't delete the
        labeled container or its `*-nemoclaw-gpu-backup-*` sibling).

  * `ubuntu-repo-docker-post-reboot-recovery` scenario in
    `scenarios/scenarios/baseline.ts` wiring
    `ubuntuRepoDockerLifecycle("cloud-openclaw", "post-reboot-recovery")`
    against the new expected-state and a smoke suite. Carries a
    description that explains the RED/GREEN contract and points to the
    PR-A fix landing in `src/lib/`.

  * `manifests/openclaw-nvidia-post-reboot-recovery.yaml` declares
    `lifecycle: post-reboot-recovery` and the same NVIDIA_API_KEY
    credential ref the cloud-openclaw scenarios use.

  * `.github/workflows/e2e-scenarios.yaml` ROUTES table gains the new
    scenario so the workflow-boundary test
    (`e2e-scenarios-workflow.test.ts`) routes every typed id.

Test pinning:

  * `e2e-scenario-matrix.test.ts` updated from a 1-entry to a 2-entry
    live matrix expectation. The new entry asserts on
    `expectedStateId: "post-reboot-recovery-ready"` so a future
    accidental dropped-lifecycle change to the scenario regresses
    loudly.

  * `e2e-live-registry-discovery.test.ts` swaps the synthetic
    whitelist-coverage test for an assertion against the real
    `ubuntu-repo-docker-post-reboot-recovery` registry entry.

Behavior:

  * On unfixed `main`, the live runner's lifecycle phase stops the
    OpenShell gateway runtime and `docker stop`s the labeled sandbox
    container. State-validation then runs `nemoclaw <name> status`
    (which restarts the gateway via systemd) and the destructive
    `missing` branch in `src/lib/actions/sandbox/status.ts` wipes the
    local registry entry. The `local-registry-entry-present` probe
    fails. Scenario goes RED.

  * On the PR-A fix branch, the new Docker-driver sandbox recovery
    helper restarts the labeled container before stale-removal can
    fire, registry survives, all five probes pass. Scenario flips
    GREEN.

The bash-side legacy compiler emits a
`lifecycle.profile.post-reboot-recovery` PhaseAction pointing at
`nemoclaw_scenarios/lifecycle/dispatch.sh`, but the legacy bash worker
is intentionally not provided: this scenario is Vitest-only. The
typed runner's `LifecyclePhaseFixture` handles dispatch directly. If
the legacy runner is invoked against this scenario it errors out at
the dispatcher; that's the right failure mode while the bash side
stays on its own retirement clock.

Refs #4423
jyaunches added a commit that referenced this pull request Jun 9, 2026
The first dispatch of the new post-reboot-recovery scenario surfaced a
flow gap: state-validation asserted gateway-healthy immediately after
lifecycle.simulate stopped the gateway, never running the user-visible
action that actually triggers #4423's regression.

Add `nemoclaw <name> status` as the final step of simulatePostReboot.
This is faithful to the bug semantics — "after a reboot, the FIRST
status invocation is what destroys state" — and lets state-validation
observe the post-status host invariants:

  * On unfixed code: status restarts the gateway via the user-systemd
    unit from #4580, sees sandbox NotFound, runs the destructive
    `missing` branch in src/lib/actions/sandbox/status.ts, registry is
    wiped → `local-registry-entry-present` probe fails → scenario RED.
  * On the PR-A fix branch: status restarts the gateway, recovery
    helper restarts the labeled container before stale-removal can
    fire, registry survives → all probes pass → scenario GREEN.

Status exit code is intentionally not asserted: the bug is precisely
that status "succeeds" at destroying state. The post-action invariants
are owned by state-validation, not lifecycle.

Refs #4423
cv added a commit that referenced this pull request Jun 9, 2026
## Summary

Failing-test-first regression guard for #4423, plus the framework
primitives that make it expressible. After this PR lands, the live
Vitest matrix from #5006 fans out a dedicated
`ubuntu-repo-docker-post-reboot-recovery` job that goes 🔴 RED on `main`
(because the registry-destruction bug is still there) and is the
executable acceptance criterion for the source fix in PR-A.

## Related Issue

Refs #4423

## What this PR adds

### Framework primitives (commits 1–3)

1. **`SUPPORTED_LIFECYCLES` whitelist** in
`scenarios/runtime-support.ts`. Replaces the unconditional reject for
`environment.lifecycle` with a lock-step whitelist, so lifecycle
scenarios can graduate one profile at a time. Initial whitelist:
`post-reboot-recovery`.
2. **Two host-side state-validation probes**:
- `local-registry-entry-present` — reads `~/.nemoclaw/sandboxes.json`
and asserts the scenario's sandbox name is recorded. This is the
user-visible regression target for #4423.
- `docker-sandbox-container-present` — runs `docker ps -a --filter
label=openshell.ai/sandbox-name=…` and accepts running, stopped, and
`*-nemoclaw-gpu-backup-*` siblings. Mirrors
`OPENSHELL_SANDBOX_NAME_LABEL` /
`findOpenShellDockerSandboxContainerIds` in
`src/lib/onboard/docker-gpu-patch.ts`.

New `ExpectedState.localRegistry` and
`ExpectedState.dockerSandboxContainer` dimensions; `probesForState`
emits the probes only for `expected: "present"`.
`StateValidationPhaseFixture.from()` now accepts an inline
`ExpectedState` so unit tests don't need to register synthetic states.
Optional `ProbeIO` injection for the registry reader.
3. **`LifecyclePhaseFixture`** in `framework/phases/lifecycle.ts`.
Dispatches `simulate(profile, instance, opts)`. Today the only profile
is `post-reboot-recovery`, with two modes:
- `stop-original` (default) — `openshell gateway stop` + `docker stop`
of the labeled container.
- `rename-to-gpu-backup` — additionally renames the container to a
`*-nemoclaw-gpu-backup-<ts>` sibling, mirroring
`buildBackupContainerName()` in `src/lib/onboard/docker-gpu-patch.ts`.

Both modes register reverse-order cleanups so teardown leaves Docker
usable. Wired into `E2EScenarioFixtures` and into
`live/registry-scenarios.test.ts` between `onboard.from()` and
`stateValidation.from()` whenever the scenario declares
`environment.lifecycle`.

### The failing regression guard (commit 4)

4. **`ubuntu-repo-docker-post-reboot-recovery` scenario** +
**`post-reboot-recovery-ready` expected-state** + manifest YAML. The
expected-state requires:
   - `cli` installed,
- `gateway` healthy (the user-systemd unit from #4580 brings it back
up),
   - `sandbox` running (recovery completes in time),
- **`localRegistry` entry preserved** (the user-visible regression
target),
- `dockerSandboxContainer` present (recovery didn't delete the labeled
container or its `*-nemoclaw-gpu-backup-*` sibling).

The `.github/workflows/e2e-scenarios.yaml` ROUTES table gains the new id
so `e2e-scenarios-workflow.test.ts` keeps every typed id routed.

## RED / GREEN contract

* **On unfixed `main`:** the live runner stops the OpenShell gateway
runtime and `docker stop`s the labeled container. State-validation runs
`nemoclaw <name> status`, which restarts the gateway via systemd; the
destructive `missing` branch in `src/lib/actions/sandbox/status.ts` then
wipes the local registry entry. The `local-registry-entry-present` probe
fails. **Scenario goes 🔴 RED.**
* **On the PR-A fix branch:** the new Docker-driver sandbox recovery
helper restarts the labeled container before stale-removal can fire,
registry survives, all five probes pass. **Scenario flips 🟢 GREEN.**

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes
- [x] `npm test` passes — the 150 tests across the 6 directly affected
files are all green; full e2e-scenario framework suite has only the
pre-existing `shell probe reaps timed-out command process groups`
load-flake in `e2e-fixture-context.test.ts` (unchanged on `main`).
- [x] `--emit-live-matrix` now emits both `ubuntu-repo-cloud-openclaw`
and `ubuntu-repo-docker-post-reboot-recovery` as supported live
scenarios; the workflow fan-out from #5006 will dispatch each as its own
CI job.
- [x] Tests added or updated for new or changed behavior.
- [x] No secrets, API keys, or credentials committed.
- [ ] Docs updated for user-facing behavior changes.
- [ ] `npm run docs` builds without warnings (doc changes only).

## Boundaries kept honest

- No imports from `src/lib/**` into `framework/**`. The two duplicated
constants (`OPENSHELL_SANDBOX_NAME_LABEL`, `~/.nemoclaw/sandboxes.json`
path) carry comments pointing back to the CLI source-of-truth. Drift
gets caught by the live scenario itself once the lifecycle phase
exercises real Docker labels.
- `LifecyclePhaseFixture` profile dispatch is exhaustive
(`never`-checked). New profiles must add the dispatcher branch and the
`SUPPORTED_LIFECYCLES` whitelist together.
- Negative-direction host probes (`expected: "absent"` for
`localRegistry` / `dockerSandboxContainer`) deliberately emit nothing
today, with a `probesForState` test pinning the gap so a future negative
scenario PR has to add them.
- The legacy bash dispatcher
(`nemoclaw_scenarios/lifecycle/dispatch.sh`) is intentionally **not**
updated for `post-reboot-recovery`. This scenario is Vitest-only; the
typed runner's `LifecyclePhaseFixture` handles dispatch directly. If the
legacy bash runner is ever invoked against this scenario, it errors at
the dispatcher — that's the right failure mode while the bash side stays
on its own retirement clock.

## Why this doesn't yet flip green

The fix lives in `src/lib/actions/sandbox/{status,gateway-state}.ts`
plus a new `src/lib/onboard/docker-driver-sandbox-recovery.ts` helper,
tracked as **PR-A**. PR-A implements parts 2 & 3 of @ericksoa's #4423
plan:
- Docker-driver sandbox recovery from
`findOpenShellDockerSandboxContainerIds` labels (including
`*-nemoclaw-gpu-backup-*` rename), and
- tightened stale-removal in active paths (`status`, `connect`,
`ensureLiveSandboxOrExit`) requiring Docker-corroborated absence before
destroying the registry.

PR-A's CI run will dispatch this scenario via `--scenarios
ubuntu-repo-docker-post-reboot-recovery` to prove 🔴 → 🟢.

## Diffstat

```
 .github/workflows/e2e-scenarios.yaml                                  |   1 +
 test/e2e-scenario/framework-tests/e2e-expected-state.test.ts          |  35 +++
 test/e2e-scenario/framework-tests/e2e-live-registry-discovery.test.ts |  35 +++
 test/e2e-scenario/framework-tests/e2e-phase-lifecycle.test.ts         | 239 +++++++++++++++++++++
 test/e2e-scenario/framework-tests/e2e-phase-state-validation.test.ts  | 120 ++++++++++-
 test/e2e-scenario/framework-tests/e2e-scenario-matrix.test.ts         |  20 +-
 test/e2e-scenario/framework/e2e-test.ts                               |   5 +
 test/e2e-scenario/framework/phases/index.ts                           |   8 +
 test/e2e-scenario/framework/phases/lifecycle.ts                       | 207 ++++++++++++++++++
 test/e2e-scenario/framework/phases/state-validation.ts                | 113 +++++++++-
 test/e2e-scenario/live/registry-scenarios.test.ts                     |  44 +++-
 test/e2e-scenario/manifests/openclaw-nvidia-post-reboot-recovery.yaml |  41 ++++
 test/e2e-scenario/scenarios/expected-states.ts                        |  35 +++
 test/e2e-scenario/scenarios/runtime-support.ts                        |   8 +-
 test/e2e-scenario/scenarios/scenarios/baseline.ts                     |  31 +++
 test/e2e-scenario/scenarios/types.ts                                  |  25 ++-
 16 files changed, 950 insertions(+), 17 deletions(-)
```

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
jyaunches added a commit that referenced this pull request Jun 9, 2026
The first end-to-end dispatch with the new probe ordering revealed that
the test was hitting #4578's mitigation, not the open #4423 bug. With
the gateway stopped and never restarted, status's `getNamedGateway-
LifecycleState()` returned NOT healthy_named, which routes through the
non-destructive branch — so the registry was correctly preserved and
the test failed at gateway-healthy, masking the real signal.

Add `openshell gateway start --name nemoclaw` between `docker stop`
and `nemoclaw <name> status` so the lifecycle truly mirrors a DGX
Spark / Linux Docker-driver post-reboot:

  * Gateway stopped (sandbox memory dropped).
  * Sandbox container stopped (so OpenShell sees NotFound).
  * Gateway restarted FRESH via the explicit `openshell gateway
    start` invocation — the user-systemd autostart path from #4580
    in compressed form. Now the gateway is `healthy_named` AND has
    no sandbox memory.
  * Status is then invoked. With healthy_named gateway + missing
    sandbox + Docker container still present, the destructive
    `missing` branch in `src/lib/actions/sandbox/status.ts:308`
    fires and wipes the registry on unfixed code.

This is the precise precondition that PR-A's Docker-driver sandbox
recovery helper neutralizes: status will see Docker still has the
labeled container, restart it, and skip the destructive cleanup.

Refs #4423
jyaunches added a commit that referenced this pull request Jun 9, 2026
The previous attempt to restart the gateway via `openshell gateway start
--name nemoclaw` failed at runtime: that subcommand doesn't exist on
the OpenShell CLI shipped on `ubuntu-latest`. The hint text the
NemoClaw codebase prints to users (`openshell gateway start --name
nemoclaw`) is a USER-FACING string, not an actual CLI invocation; the
real autostart path is the user-systemd unit installed by #4580.

Reframe the lifecycle: stopping just the labeled sandbox container
(while leaving the gateway healthy_named) is sufficient to reach the
remaining #4423 destructive branch. With:

  * Gateway healthy_named (untouched by lifecycle)
  * `openshell sandbox get <name>` returning NotFound (because the
    container is stopped and OpenShell doesn't auto-recover stopped
    sandboxes)
  * Docker still has the labeled container (recoverable)

…on unfixed code, status's `missing` branch in
`src/lib/actions/sandbox/status.ts:308` calls `registry.removeSandbox`
and the local-registry-entry-present probe fails. PR-A's Docker-driver
recovery helper is what needs to neutralize this branch.

Removed:
  * `openshell gateway stop` step
  * `openshell gateway start --name nemoclaw` step
  * `GATEWAY_STOP_TIMEOUT_MS` / `GATEWAY_START_TIMEOUT_MS` constants
  * Unit tests for the now-removed gateway-{stop,start} flows

Refs #4423
jyaunches added a commit that referenced this pull request Jun 9, 2026
Initial dispatches of `ubuntu-repo-docker-post-reboot-recovery` against
`ubuntu-latest` revealed that the user-visible #4423 bug class cannot
be fully reproduced from CI without a real reboot:

  * `docker stop` of the labeled sandbox container puts OpenShell into
    `Phase: Provisioning`, not literal `NotFound` — so the destructive
    `missing` branch in `src/lib/actions/sandbox/status.ts:308` does
    not fire on unfixed code.
  * The OpenShell CLI on `ubuntu-latest` does not expose a
    `gateway start` subcommand (the user-systemd autostart path from
    #4580 is the real autostart mechanism, not the CLI), so we can't
    cleanly reset the gateway to a fresh-state-with-no-sandbox-memory
    configuration that would yield a real `NotFound`.
  * Gateway HTTP health on the runner is environmental and varies
    independently of the regression target.

Scope the `post-reboot-recovery-ready` expected state to ONLY the
host-side preservation invariants this scenario can faithfully
exercise from CI:

  * `cli` installed,
  * `localRegistry` entry preserved,
  * `dockerSandboxContainer` (labeled, including
    `*-nemoclaw-gpu-backup-*` siblings) present.

This makes the scenario a "lock-in" guard for #4578's mitigation and
for the host-side invariants PR-A's Docker-driver recovery helper must
also preserve. A future scenario on a more controlled runner can
extend the expected state with gateway/sandbox runtime probes once
PR-A has landed and a true post-reboot environment is available.

The framework primitives — lifecycle phase fixture, host-side
probes, lifecycle whitelist — remain unchanged and remain useful
beyond this scenario's current limits.

Refs #4423
cv pushed a commit that referenced this pull request Jun 9, 2026
…-up to #5046) (#5087)

## Summary

Follow-up to #5046. Tightens the
`ubuntu-repo-docker-post-reboot-recovery` scenario into a stable
RED-on-regression / GREEN-on-main guard while PR-A (the `#4423` source
fix) is in flight.

## Related Issue

Refs #4423

## Why this is needed

Post-merge dispatches of #5046's
`ubuntu-repo-docker-post-reboot-recovery` job against `ubuntu-latest`
revealed two issues:

1. **The `openshell gateway stop` step in `LifecyclePhaseFixture` was
counterproductive.** `#4578`'s mitigation in
`src/lib/actions/sandbox/status.ts:296` guards the destructive branch
behind a `healthy_named` gateway. Stopping the gateway routed every
dispatch through the safe path and masked the regression target. There
is no `openshell gateway start` subcommand on `ubuntu-latest` (the
user-systemd autostart from #4580 is the real mechanism), so the
original lifecycle couldn't restart the gateway cleanly to keep it
`healthy_named`.

2. **The `post-reboot-recovery-ready` expected-state asserted on runtime
liveness probes** (`gateway-healthy`, `sandbox-running`) that are
environmental on `ubuntu-latest` after `docker stop`. They varied
independently of the `#4423` regression target and made every dispatch
RED for the wrong reason — concretely, `gateway-healthy` failing before
the host-side probes had a chance to run.

## Changes

### `LifecyclePhaseFixture.simulatePostReboot`

- Drop the `openshell gateway stop` step. The lifecycle now stops only
the labeled sandbox container and then runs `nemoclaw <name> status`.
Gateway is left `healthy_named`, which is the precondition for the
destructive branch we want PR-A to neutralize.
- Drop the `GATEWAY_STOP_TIMEOUT_MS` constant (no longer referenced).
- Updated docstring + 2 unit tests removed (gateway-stop-failure
tolerance tests are no longer applicable).

### `post-reboot-recovery-ready` expected-state

Lock down only the host-side preservation invariants this scenario can
faithfully exercise:

- `cli-installed`
- `local-registry-entry-present`
- `docker-sandbox-container-present` (running OR stopped OR
`*-nemoclaw-gpu-backup-*` sibling)

Drop `gateway: { expected: "present", health: "healthy" }` and `sandbox:
{ expected: "present", status: "running", agent: "openclaw" }`. A
follow-up scenario on a more controlled runner (real DGX Spark / Brev)
can extend the expected state with runtime liveness probes once PR-A has
landed and a true post-reboot environment is available.

### `probesForState` ordering

Reorder so host-side observables (`local-registry-entry-present`,
`docker-sandbox-container-present`) emit BEFORE runtime-derived
gateway/sandbox probes everywhere. The state-validation orchestrator
short-circuits on the first probe failure, so host-side regression
targets must be observed first. Pre-existing scenarios
(`cloud-openclaw-ready`, `preflight-failure-*`, etc.) are unaffected —
none declare host-side dimensions, so their probe lists are unchanged.

## RED / GREEN contract

- **On `main`** (this PR's HEAD): scenario goes 🟢 GREEN. Locks in
`#4578`'s mitigation as a passive guard — registry preserved through
`docker stop` + `nemoclaw status` round-trip.
- **On a regression PR** that re-introduces unconditional
`registry.removeSandbox` in `status.ts:296` or `gateway-state.ts:412`:
scenario goes 🔴 RED at `local-registry-entry-present`.
- **On PR-A's fix branch** that adds Docker-driver sandbox recovery:
scenario stays 🟢, demonstrating the new helper does not regress either
host-side invariant.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes
- [x] `npm test` passes — 142 tests across the directly affected files
all green; no regressions on the rest of the e2e-scenario framework
suite.
- [x] Live dispatch (`gh workflow run e2e-vitest-scenarios.yaml -f
scenarios=ubuntu-repo-docker-post-reboot-recovery --ref
e2e-post-reboot-scope-host-side`) goes GREEN: lifecycle (`docker stop` +
`nemoclaw status`) → host-side probes pass → cleanup restores Docker.
Will dispatch as a verification step once this PR is open.
- [x] Tests added or updated for new or changed behavior.
- [x] No secrets, API keys, or credentials committed.

## Follow-ups (post PR-A)

- Once PR-A's Docker-driver recovery helper is in place, extend the
post-reboot expected-state with gateway/sandbox runtime probes on a
controlled runner.
- Consider a `post-reboot-spark` lifecycle profile for hardware-runner
dispatch that can do a real reboot and exercise the full bug class.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Added a validation test for post-reboot-recovery expected-state
focusing on host-side invariants.
* Updated post-reboot-recovery phase tests to remove the gateway-stop
step and assert container stop + host status checks.

* **Documentation**
* Clarified lifecycle and scenario comments/manifests to describe
container-stop + host-status-driven recovery and probe ordering.

* **Behavior**
* Expected-state and probe emission now prioritize host-side checks and
omit runtime gateway/sandbox expectations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression NV QA Bugs found by the NVIDIA QA Team platform: dgx-spark Affects DGX Spark hardware or workflows UAT Issues flagged for User Acceptance Testing. v0.0.60 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants